RELIABLE DEEP LEARNING WITH APPLICATION TO DIGITAL HISTOPATHOLOGY IMAGE ANALYSIS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Zhilu Zhang May 2022 © 2022 Zhilu Zhang ALL RIGHTS RESERVED RELIABLE DEEP LEARNING WITH APPLICATION TO DIGITAL HISTOPATHOLOGY IMAGE ANALYSIS Zhilu Zhang, Ph.D. Cornell University 2022 Deep learning has achieved tremendous success over the past decade, pushing the limit in various application domains such as computer vision and natural language processing. Despite the advancements, recent work has demonstrated potential risks associated with modern neural networks, hindering the reliability of such deep learning systems for real-world applications. In this thesis, we consider sev- eral challenges associated with the reliable application of neural networks. Specif- ically, the broad term of reliability is broken down into two aspects. Firstly, it has been demonstrated that typical deep learning systems are prone to overfit- ting to noisy labels commonly present in large-scale datasets, thereby leading to sub-optimal performances. To alleviate this problem, we propose a novel loss func- tion and demonstrate its robustness against label noise. Secondly, prior work has highlighted problems in the uncertainty quantification of neural networks. This can significantly hamper the interpretability of neural network predictions. In this thesis, we discuss several strategies that enable us to obtain neural networks with better uncertainty estimations. Lastly, as a case study, we apply deep learning to a real-world problem of a large-scale whole-slide histopathology image classifi- cation task, and demonstrate the effectiveness of such a deep learning system for real-world medical application. BIOGRAPHICAL SKETCH Zhilu Zhang was born in Wuhan, China. He moved to Singapore in 2007 and completed his high school at Temasek Junior College, Singapore. He traveled to the US to further pursue his academic career and obtained a Bachelor of Arts degree with major in Mathematics and Physics from Carleton College, MN. Following his college graduation, Zhilu joined Cornell University as a Ph.D. student in the Fall of 2016. During his Ph.D. career, he has interned at several companies such as Amazon Web Services, Waymo, and ByteDance as research interns. After six years at Cornell, he is graduating in May 2022. In the short term, he looks forward to beginning working as an applied scientist at Amazon Web Services. iii ACKNOWLEDGEMENTS First of all, I am deeply indebted to my advisor, Mert R. Sabuncu. As a novice fresh out of college, I am extremely fortunate to have met Mert and to be given the precious opportunity to learn and work alongside him in a field I knew little about back then. Throughout my Ph.D. career, I have learned so much in every single aspect from him. I am, and will always be, immensely grateful for having been able to work under his supervision for my Ph.D. As Mert’s second Ph.D. student, our lab started small. I am hugely grateful to Evan M. Yu and Meenakshi Khosla for all the support and accompany during the initial years of my Ph.D. and all the countless nights spent together in the lab. I am also thankful to all my lab mates and friends for their support: Heejong Kim, Alan Wang, Batuhan Karaman, Gia H. Ngo, Tianyu Ma, Matthew Pool, Carmen Khoo, Victor Butoi, Cagla Bahadir, and Zijin Gu. Lastly, I would like to express my gratitude to my parents, Guojun and Hong, my brother, Zhiyuan, and my wife, Vianne. Without their tremendous under- standing and encouragement in the past few years, it would have been impossible for me to complete this journey. iv TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction and Background 1 1.1 A Brief Recap of Deep Learning . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Feed Foward Neural Networks . . . . . . . . . . . . . . . . . 3 1.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . 8 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.1 Vulnerability to Noises . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Unreliable Model confidence . . . . . . . . . . . . . . . . . . 10 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Label-Noise Robust Learning of Neural Networks with General- ized Cross Entropy Loss 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Generalized Cross Entropy Loss for Noise-Robust Classifications . . 18 2.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Lq Loss for Classification . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Truncated Lq Loss . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Toward a Better Understanding of Lq Loss . . . . . . . . . . 27 2.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 31 3 Improving Confidence Calibration for Convolutional Neural Net- works with Structured Dropout 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 An Analysis of the Performance of MC Dropout . . . . . . . . . . . 36 3.3.1 MC Dropout as Ensembles of Dropout Models . . . . . . . . 36 3.3.2 Decomposing the Performance of Ensembles . . . . . . . . . 38 3.3.3 Performance of MC Dropout and Model Diversity . . . . . . 39 3.3.4 Omnibus Dropout . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.5 Enhanced Ensemble Diversity with Structured Dropout . . . 43 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Bayesian Active Learning . . . . . . . . . . . . . . . . . . . . . . . . 49 v 4 Enhancing Uncertainty Estimates with Efficient Neural Network Ensembles 52 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Fixing MC Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.1 Toward Enhancing Model Diversity . . . . . . . . . . . . . . 58 4.4.2 Enhancing Individual Model Performance . . . . . . . . . . 60 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.3 How Many Subnetworks Can We Fit? . . . . . . . . . . . . . 70 5 Accelerating Uncertainty Estimates Computation with Uncertainty- Aware Distribution Distillation 72 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.1 Preliminary: Dropout for Bayesian Deep Learning . . . . . . 77 5.3.2 A Teacher-Student Paradigm for Sample-free Uncertainty Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . 84 5.4.2 Pixel-Wise Depth Estimation . . . . . . . . . . . . . . . . . 92 5.4.3 Ablation Study on Additional Augmentation . . . . . . . . . 94 5.4.4 Distilling from Deep Ensemble . . . . . . . . . . . . . . . . . 94 5.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6 Towards a Deeper Understanding of Knowledge Distillation 96 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3.1 Teacher-Student Training Objective . . . . . . . . . . . . . . 100 6.4 Multi-Generation Self-Distillation: A Close Look . . . . . . . . . . . 101 6.4.1 Predictive Uncertainty . . . . . . . . . . . . . . . . . . . . . 102 6.4.2 Confidence Diversity . . . . . . . . . . . . . . . . . . . . . . 102 6.4.3 Sequential Self-Distillation Experiment . . . . . . . . . . . . 104 6.5 An Amortized MAP Perspective of Self-Distillation . . . . . . . . . 106 6.5.1 Label Smoothing as MAP . . . . . . . . . . . . . . . . . . . 107 6.5.2 Self-Distillation as MAP . . . . . . . . . . . . . . . . . . . . 108 6.5.3 On the Relationship between Label Smoothing and Self- Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6 Beta Smoothing Labels . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.7 Empirical Comparison of Distillation and Label Smoothing . . . . . 112 vi 6.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 113 6.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.8 Discussion and Future Directions . . . . . . . . . . . . . . . . . . . 116 7 A Case Study of Deep learning to Digital Pathology Image Anal- ysis 118 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2.2 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . 124 7.2.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2.4 Model Inference . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2.5 Pathologist Evaluation . . . . . . . . . . . . . . . . . . . . . 127 7.2.6 Prediction Heatmap . . . . . . . . . . . . . . . . . . . . . . 127 7.2.7 UMAP Visualization . . . . . . . . . . . . . . . . . . . . . . 128 7.2.8 Statistical Analysis and Software . . . . . . . . . . . . . . . 128 7.2.9 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . 129 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.3.1 ML Models Accurately Predict IDH Mutation Status . . . . 130 7.3.2 Single-Scale ML Models Make Distinct Errors Relative to Each Other and to Humans . . . . . . . . . . . . . . . . . . 133 7.3.3 Patch-Level Predictions Reveal Features that Drive Accurate and Inaccurate Predictions . . . . . . . . . . . . . . . . . . . 134 7.3.4 Patch-Level Embedding Vectors Reflect Diagnostically Rel- evant Human-Identifiable Features . . . . . . . . . . . . . . 136 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8 Conclusion 143 Bibliography 146 A Supplementary Material for ”Label-Noise Robust Learning of Neural Networks with Generalized Cross Entropy Loss” 170 B Supplementary Material for ”Improving Confidence Calibration for Convolutional Neural Networks with Structured Dropout” 175 B.1 Brief Review of Dropout As Bayesian Approximation . . . . . . . . 175 B.2 Relationship between Different Performance Metrics . . . . . . . . . 177 B.3 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 C Supplementary Material for ”Enhancing Uncertainty Estimates with Efficient Neural Network Ensembles” 182 C.1 A Brief Review of the Edge-Pop Algorithm . . . . . . . . . . . . . . 182 C.2 Additional Ablation Studies . . . . . . . . . . . . . . . . . . . . . . 184 vii D Supplementary Material for”Towards a Deeper Understanding of Knowledge Distillation” 188 D.1 On Label Smoothing and Predictive Uncertainty Regularization . . 188 D.2 Additional Experiments with Temperature Scaling on Student Models189 D.3 Additional Experiments on Sequential Self-Distillation with Differ- ent Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 D.3.1 Additional Experiments with Different Amount of Label Smoothing ϵ . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 D.4 Additional Experiments with Self-Training Using EMA-Predictions 194 D.5 Additional Experiments with CIFAR-10 When Varying Trainset Size 195 D.6 Additional Experiments with CIFAR-100When VaryingWeight Decay196 D.7 Additional Experiments on Beta Smoothing . . . . . . . . . . . . . 198 D.8 Additional Experiments on the Effect of Quality of Teachers . . . . 199 D.9 Additional Experiments on Varying γ . . . . . . . . . . . . . . . . . 201 viii LIST OF TABLES 2.1 Average test accuracy and standard deviation (5 runs) on exper- iments with closed-set noise. We report accuracies of the epoch where validation accuracy is maximum. Forward T and T̂ rep- resent forward correction with the true and estimated confusion matrices, respectively [168]. q = 0.7 was used for all experiments with Lq loss and truncated Lq loss. Best 2 accuracies are bold faced. 28 2.2 Average test accuracy on experiments with CIFAR-10. We repli- cated the exact experimental setup as in [213]. The reported accu- racies are the average last epoch accuracies after training for 100 epochs. η = 40%. CCE, Forward and method by Wang et al. are adapted for direct comparison. . . . . . . . . . . . . . . . . . . . . 31 3.1 Results on benchmark datasets comparing accuracy and uncer- tainty estimates produced by different types of methods. The top performing result for each metric is bold-faced. MC omnibus- dropout is consistently the best method. The numbers in bracket next to dropout methods corresponds to the optimal drop rate found by grid search. . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Results for ResNet models on various datasets. Best results for efficient ensembles are highlighted in bold. Fixed classification layer is used for orthogonal Dropout. See Table 4.3 and the Appendix for further ablation study on this. . . . . . . . . . . . . . . . . . . . 64 4.2 Results for Wide ResNet28-10. Asterisk symbol (*) represents re- sults adapted directly from [77]. Best results for efficient ensembles are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Ablation study of the proposed method. orthogonal dropout meth- ods are trained without dropout mask optimization. ”MO” corre- sponds to ”mask optimization” and ”FC” corresponds to ”Fixed Classifier”. ”Ind Acc” denotes the averaged individual model ac- curacy in an ensemble, while ”Ens Acc” represents the ensemble accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Results on the segmentation problem. The “T”, “S” and “AU” corresponds to the teacher and student model, and the aleatoric uncertainty respectively. “T+AU” corresponds to a teacher model trained with the aleatoric uncertainty. “DD” corresponds to the student trained using Dropout Distillation [21]. Best performing results for each teacher-student pair are bold-faced. . . . . . . . . . 87 5.2 Results on the depth estimation. The “T”, “S” and “AU” corre- sponds to the teacher and student model, and the aleatoric un- certainty respectively. “T+AU” corresponds to a teacher model trained with the aleatoric uncertainty. . . . . . . . . . . . . . . . . 93 ix 5.3 Top-4 Rows : Impact of adding augmentation in training on quality of uncertainty produced on the CamVid and NYU datasets. ”T” and ”S” represents teacher and student models, and ”AUG” cor- responds to augmentation. Last Row : Uncertainty performance of student model when a deep ensemble with five NNs is used as the teacher model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.1 Summary of the demographics for the TCGA training, validation, and test datasets and the WCM test datasets. No significant differ- ences are seen in sex between the IDHmut and IDHwt groups. IDH mutant gliomas show statistically significant enrichment in younger patients, consistent with historic controls. † indicates average sim- ulation p-value: 140 IDH WT slides in the training dataset were randomly sampled and one-way Anova was then conducted. Simu- lations were repeated for 1000 times. * indicates propensity score matching accounting for age and sex . . . . . . . . . . . . . . . . . 124 B.1 Results comparing accuracy and uncertainty estimates obtained us- ing a single model when drop rate = 0.1 for all models. The top performing result for each metric is bold-faced. MC omnibus- dropout is the best method in general. . . . . . . . . . . . . . . . . 179 C.1 Comparison against deep ensembles with reduced convolutional ker- nel size, so that deep ensemble has the same number of parameters as orthogonal dropout. ”FC” stands for fixed classification. . . . . 186 C.2 Comparison against baseline methods when all methods have fixed classification layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 C.3 Comparison against other types of dropout. . . . . . . . . . . . . . 187 x LIST OF FIGURES 2.1 (a), (b) Test accuracy against number of epochs for training with CCE (orange) and MAE (blue) loss on clean data with (top) CIFAR-10 and (mid) CIFAR-100 datasets. (c) Average softmax prediction for correctly (solid) and wrongly (dashed) labeled train- ing samples, for CCE (orange) and Lq (q = 0.7, blue) loss on CIFAR-10 with uniform noise (η = 0.4). . . . . . . . . . . . . . . . 22 2.2 The test accuracy and validation loss against number of epochs for training with Lq loss at different values of q. . . . . . . . . . . . . . 29 3.1 From left to right (1) Accuracy of MC dropout and deep ensemble (2) the relative improvements in accuracy of deep ensemble andMC dropout (3) Brier score of MC dropout and deep ensemble against number of models (4) the relative improvements in in Brier score of deep ensemble and MC dropout against number of models. . . . 38 3.2 Interrater Agreement (IA) of models with different types of dropout with 0.1 dropout rate on the SVHN, CIFAR-10 and -100 datasets. The lower the IA, the more diverse the predictions of the models. Y-axis indicates different methods. MC dropout produces models with much larger IA, hence less model diversity, than structured dropout techniques in most of the cases. . . . . . . . . . . . . . . . 38 3.3 Test Brier score (left) and accuracy (right) against number of mod- els for ensemble prediction at test time on CIFAR-10. This corre- sponds to the number of different MC dropout instantiations at test time of the same model. The Model trained with omnibus dropout achieves the best in terms of accuracy and Brier score. . . . . . . . 41 3.4 Reliability diagrams of predictions produced by difference models. . 45 3.5 Left : Test accuracy against number of training samples for models with different methods of dropout and Variation Ratios as the ac- quisition function on CIFAR-10. Right : Relative improvements in test accuracy over that of the first iteration with different methods of dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Bar plots of accuracy of individual orthogonal dropout subnetworks of ResNet models. ”i-th” model represents the i-th subnetwork obtained using Algorithm 1 sequentially. . . . . . . . . . . . . . . . 67 4.2 Plot of accuracy/NLL/ECE against number of models in the en- sembles. For orthogonal dropout, number of models is varied by changing the size of each subnetwork and all the orthogonal dropout ensembles are of the same size. ”FC” corresponds to ”Fixed Clas- sifier”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 xi 5.1 An illustration of the proposed method. Given a trained teacher, a deterministic student is used to approximately parameterize the predictive distribution of the teacher model, enabling sample-free uncertainty estimation. . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Example predictions on CamVid. Each uncertainty map shows the sum of aleatoric and epistemic uncertainty. Same for all the following example plots. . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3 Example predictions on Pascal VOC2012. . . . . . . . . . . . . . . 85 5.4 (a)-(c): Comparison of performance against the running time for both the teacher (with the aleatoric uncertainty) and student model using the CamVid dataset. (d) Speed-up ratios of uncertainty esti- mates for the CamVid dataset with the Bayesian SegNet compared to Huang et al. [91] and Postels et al. [172], two other sample-free uncertainty estimation methods. . . . . . . . . . . . . . . . . . . . 88 5.5 Performance of models trained with CamVid and evaluated on Cityscapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 Top: Relative means of BALD for samples of seen and unseen classes during training compared to the “Reference” models, which refer to models trained with both seen and unseen classes. Bottom: Distribution of BALD for samples of seen and unseen classes during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 Example predictions on CamVid when “pedestrian” and “bicyclist” are held out during training. “Reference” refers to models trained with all classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Example predictions on NYU. . . . . . . . . . . . . . . . . . . . . . 93 6.1 Results for sequential self-distillation over 10 generations are shown above. Model obtained at the (i − 1)-th generation is used as the teacher model for training at the i-th generation. Accuracy and NLL are obtained on the test set using the student model, whereas the predictive uncertainty and confidence diversity are evaluated on the training set with teacher predictions. . . . . . . . . . . . . . 104 6.2 Results with teacher predictions scaled by varying temperature T . The flat lines in the plots correspond to the largest/smallest values achieved over 10 generations of sequential distillation with T = 1 in the previous experiments for accuracy, predictive uncertainty and confidence diversity/NLL. . . . . . . . . . . . . . . . . . . . . . . . 106 6.3 Experimental Results performed on CIFAR-100, CUB-200 and the Tiny-Imagenet dataset. ”CE”, ”LS”, ”B” and ”SD” refers to ”Cross Entropy”, ”Label Smoothing”, ”Beta Smoothing” and ”Self- Distillation” respectively. The top rows of each experiment show bar charts of accuracy on test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. 115 xii 7.1 A schematic for the end-to-end process of model training and de- ployment. WSI are tiled into patches of 256x256 size at 2.5X, 5X, 10X, and 20X magnification factors (A). In each training itera- tion (mini-batch), 200 randomly selected and augmented patches from a single magnification of a single WSI were passed to single- scale Densenet121 classifiers, initialized with imageNet pre-trained weights. Feature embedding vectors from each patch were then ag- gregated using näıve averaging, and the resulting vector was then passed to a final fully connected (linear) classifier (B). Following training, the predictions three versions of each single-scale model trained with different random seeds were averaged to produce a single-scale ensemble, and the predictions from each single-scale ensemble were averaged to produce the multiscale ensemble (MSE) predictions. (C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 ROC curves for the ML classifiers, pathologists, and hybrid mod- els on the WCM test data. Figure A compares the model perfor- mance of the single-scale ensembles and the multi-scale ensemble. (MSE). The performance of the semiquantitative predictions of two expert neuropathologists and the two-pathologist averaged consen- sus are compared in Figure B. Figure C compares the predictions of the top-performing neuropathologist with the MSE, and the hy- brid model generated by näıve averaging of pathologist and MSE predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.3 Patient-level predictions in the WCM test data, for the pathologists and ML models. Panel A compares the semiquantitative prediction scores of the two neuropathologists (κ = 0.656, R = 0.767). Panel B compares the two-neuropathologist consensus predictions to the multiscale classifier. (κ = 0.598, R = 0.674). Panel C shows all patient-level predictions using the single-scale models, multiscale ensemble, individual pathologists (P1, P2), two-pathologist con- sensus (P1+P2), and the hybrid classifier (P+WSIP1+MSE). . . . 132 xiii 7.4 This shows examples of the sliding windows visualizations, with representative patches from regions from 3 example cases that pro- vide insight into features recognized by the classifier. (A) shows a low power HE image of a slide that was accurately predicted as IDHmut by the neuropathologists, but was incorrectly classified by the MSE. (B) shows a heatmap of average pixel-level IDH mu- tation status predictions. Selected patches from image A demon- strate higher IDHmut predictions in regions of solid tumor (C), with higher IDHwt predictions in regions of minimally involved brain parenchyma (D). E and F show an example of a slide from an IDHmut case, which was misclassified by both the neuropathol- ogists and the ML classifier. Regions from this slide containing tumor with monomorphic gemistocytic cytomophology (G) and re- gions of minimally involved brain parenchyma with perineuronal and perivascular white space artifact (H) were associated with a higher prediction for IDHmut, while areas of minimally involved brain parenchyma without significant whitespace artifact (I) and regions with more bizarre cytology (J) were associated with a higher prediction of IDHwt status. Figures K and L show a slide from an IDHmut glioma which was accurately predicted by the ML clas- sifier, but inaccurately predicted by the neuropathologists. Areas of mildly cellular tumor, both with and without whitespace arti- fact (M and N respectively) were associated with higher IDHmut predictions, while regions of necrosis (O) and regions of minimally involved brain parenchyma (P) were associated with higher IDHwt predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.5 UMAP coordinates of the feature embedding vector activations from patches passed through the 10x classifier. A shows some ex- ample tiles in 2D UMAP coordinates. B shows the patch-level IDH status prediction scores as predicted by the 10x classifier. Tiles from region C demonstrate microcystic architecture. Tiles from re- gion D demonstrate hypercellular regions of infiltrating tumor, with round cytology, enriched for tumors with oligodendroglial morphol- ogy. Tiles from region E demonstrate hypercellular regions of tumor with a greater degree of nuclear spindling/elongation and nuclear pleomorphism. Tiles from region F demonstrate brain parenchyma without significant infiltration by tumor cells. . . . . . . . . . . . 137 B.1 Test Brier score (left) and accuracy (right) against number of mod- els for ensemble prediction at test time on SVHN and CIFAR-100. This corresponds to the number of different MC dropout instan- tiations at test time of the same model. The Model trained with omnibus dropout achieves the best in terms of accuracy and Brier score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 xiv B.2 Plots of test time NLL (left) and accuracy (right) against dropout rate for models trained with different types of dropout on the SVHN, CIFAR-10 and CIFAR-100 datasets. . . . . . . . . . . . . . 180 C.1 Plot of accuracy/NLL/ECE against number of models in the en- sembles. For the proposed method, the number of models is varied by changing the size of each subnetwork and all the orthogonal dropout ensembles are of the same size. For MIMO networks, the number of models is varied by changing the number of inputs and outputs (classifier layers) of the networks. . . . . . . . . . . . . . . 185 D.1 Left: Test accuracies of ResNet-34 models on the CIFAR-100 dataset when varying temperature. Right: ECE of ResNet-34 mod- els on the CIFAR-100 dataset when varying temperature. ”Scale both” corresponds to the originally proposed distillation objective in which both teacher and student models are temperature-scaled during training. ”Scale teacher only” corresponds to only temper- ature scaling teacher models during distillation. The green flat line represents the performance achieved by the teacher model trained with cross-entropy loss. . . . . . . . . . . . . . . . . . . . . . . . . 190 D.2 Results for sequential self-distillation over 5 generations are shown above for different temperatures. Top: temperature T = 2.0; Bot- tom: temperature T = 3.0. The same temperatures are used throughout the entire sequential distillation process. Model ob- tained at the (i− 1)-th generation is used as the teacher model for training at the i-th generation. Accuracy and NLL are obtained on the test set using the student model, whereas the predictive un- certainty and confidence diversity are evaluated on the training set with teacher predictions. . . . . . . . . . . . . . . . . . . . . . . . . 191 D.3 Experimental Results performed on CIFAR-100, CUB-200 and the Tiny-Imagenet dataset with different amount of label smoothing. Left: ϵ = 0.1, Right: ϵ = 0.3. ”CE”, ”LS”, ”B” and ”SD” refers to ”Cross Entropy”, ”Label Smoothing”, ”Beta Smoothing” and ”Self- Distillation” respectively. The top rows of each experiment show bar charts of accuracy on test set for each experiment conducted, while the bottom are bar charts of expected calibration error. . . . 193 D.4 Additional results to compare Beta smoothing against self-training explicitly with the EMA predictions. ”B” and ”ST” refer to ”beta smoothing” and ”self-training” respectively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. . . . . . . . . . . . . . . . . . . . . . . . 194 xv D.5 Left: Test accuracies of ResNet-34 models on the CIFAR-10 dataset for the teacher and student models when the training set size is varied. Right: The relative improvements in accuracy when the training set size is varied. . . . . . . . . . . . . . . . . . . . . . . . 196 D.6 Left: Test accuracies of ResNet-34 models on the CIFAR-100 dataset for the teacher and student models when the weight de- cay hyper-parameter is varied. Right: The relative improvements in accuracy when the weight decay hyper-parameter is varied. . . . 197 D.7 Ablation study on Beta smoothing. ”LS”, ”RB” and ”B” refers to ”Label Smoothing”, ”Random Beta Smoothing” and ”Beta Smoothing” respectively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. 198 D.8 Additional results on cross-distillation. ”SD” and ”CD” refers to ”self-distillation” and ”cross-distillation” respectively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. . . . . . . . . . . . . . . . . . 200 D.9 Additional results on pruned distillation. ”SD” and ”PD” refer to ”self-distillation” and ”pruned-distillation” respectively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. . . . . . . . . . . . . . . . . . 202 xvi CHAPTER 1 INTRODUCTION AND BACKGROUND Ever since the seminal work by Krizhevsky et al. [112], deep learning has demon- strated incredible capabilities in various fields in computer science like natural lan- guage processing and computer vision, significantly advancing the state-of-the-art and pushing the boundaries in such fields [124]. This resurgence in interest in neu- ral networks has even been spread to many fields outside of computer science such as physics, biology, and medicine [199, 212, 214]. With such demonstrated capa- bilities and potentials, the reliability and interpretability of neural networks come under careful scrutiny. Indeed, on top of good predictive performance achievable with such neural network systems, it is also paramount to be able to understand how reliable and interpretable such machine-generated predictions are. Recently, numerous works have raised concerns on the vulnerability of deep learning for real- world application purposes [64,232], highlighting the potential dangers of applying deep learning for sensitive application domains like medical image diagnosis and autonomous driving systems. In this thesis, we focus primarily on two aspects of the broad term ”reliability” of neural networks. Before discussing the chal- lenges associated with reliability, however, we start by offering a brief overview of deep learning in Section 1.1. We then further elaborate in section 1.2 on two of the challenges associated with the reliability of deep neural networks. Finally, we summarize in Section 1.3 the proposed methods to tackle the aforementioned problems. 1 1.1 A Brief Recap of Deep Learning Before giving a brief introduction to deep neural networks, we start by consid- ering a simple linear model. Suppose we are given a set of input-output pairs {(x1,y1), ..., (xp,yp)}, where xi ∈ Rn and yi ∈ Rm for all i = 1, .., p. Such input- output pairs can be anything with correlations. For instance, the inputs xi could be the height of human beings, and the outputs yi could be the weight of human beings. In modern computer vision applications, the inputs xi could be vectors that represent the pixels of images, and yi could be the corresponding classes of the input images. In a real-world medical imaging application, the inputs could be MRI images of patients, and the outputs could be booleans that represent whether the patients are cancerous. In short, the goal of the task is to predict y given a new x that is not present in the set of data points already given. In order to solve this prediction problem, a mathematical function, or a model, can be used to accomplish this task. In the simplest, we can assume that the input-output pairs follow a simple linear relationship. In this case, a simple linear transformation ŷ ≜ f(x) = xW + b, (1.1) where W ∈ Rn×m and b ∈ Rm, can be used to map inputs x to their corresponding outputs y. Given the above linear model, the natural question to ask is how to obtain the optimal parameters W and b. Indeed, different parameters W and b define distinctive linear transformations, and to find a model that is capable of providing accurate predictions given inputs x, we need to be able to search for the set of parameters. Such parameters can be obtained by solving an optimization problem 2 with a cost function. For a regression problem where the outputs y are real-valued scalars (e.g. weights of human beings), a common choice of the cost function is the least square cost function defined as: ∑ L 1= ||ŷ 2 p p − yp|| . (1.2) p For notational convenience, if we include the constant variable 1 in x and include the bias b in the weightW , Eqn 1.1 can be written more compactly as f(x) = xW . As such, we can denote the above cost function involving a summation using matrix multiplication L = (y − XW )T (y − XW ). Taking the derivative of the cost function, it is not difficult to see that the optimal solution to the above optimization problem is W = (XTX)−1XTy [76]. 1.1.1 Feed Foward Neural Networks Despite the simplicity, the linear model described above is limiting. Indeed, it would only be able to accurately predict if and only if the input-output pairs follow a linear relationship. Unfortunately, our world is a complex system, and a linear model would fail miserably in most complicated tasks. To solve this problem, a natural question to ask is, how can we model more complex relationships? We can use more complicated functions. Many methods have been proposed to model non-linear relationships. The feed- forward neural networks are one of the popular approaches. The idea is shockingly straightforward. Instead of having one linear transformation, we have multiple nested linear transformations one after another, with some non-linear functions in-between each one of the linear transformations. In this way, the overall function of nested linear transformations would be highly non-linear and expressive, capable 3 of capturing very complex relationships. For simplicity, let us first consider a neural network with one hidden layer or two linear transformations. Similar to the case of linear regression, given an input x, a linear transformation is first used to transform input to xW +b. Then, a non- linearity activation σ is applied before a second linear transformation is applied. Many types of activation are used in practice. Some popular examples include ReLU, sigmoid, and tanh [176]. Note that such non-linear activation functions are crucial in making the overall neural network function non-linear. Indeed, without them, neural networks would merely be sequences of linear transformations, which are also linear transformations collectively. Overall, mathematically, such a two- layer feed-forward neural network is defined by ŷ ≜ f(x) = σ(xW 1 + b)W 2. (1.3) These models are called feed-forward because information flows linearly layer by layer from inputs x to outputs ŷ. There are no feedback connections in which intermediate hidden layer outputs of the model are fed back into itself. Despite the unsophistication, neural networks are powerful models capable of representing very complex functions when we stack a lot of these linear trans- formations together. Rigorous analysis has shown that deep neural networks are universal function approximaters [106,136,167]. 1.1.2 Regularization More complex models are not necessarily always better. While deeper neural networks are capable of representing more complex functions, it can also lead to 4 a phenomenon known as overfitting [76]. Overfitting happens when your model predicts very well only on the training data (data points used to obtain the optimal parameters of the model) but performs much worse on data not seen during the optimization process. A central problem in machine learning is how to make a model perform satisfactorily not only on the training data but also on the test set. Indeed, predicting well on these unseen data points is the primary goal of a machine learning model. To achieve this goal, many strategies have been proposed to reduce test error or improve generalization performance. These strategies are known collectively as regularization. Often, regularization can potentially come at the cost of increased training errors. We scratch the surface and briefly discuss several methods for regularizing neural networks. Perhaps the simplest form of regularization is the “parameter norm penalties”. Widely used in most machine learning models [76], this form of regularization works by limiting the effective capacity of the models. This is done by adding a norm- based penalty term to the loss function of optimization. This norm is computed with respect to the model parameters so that the search space of parameters is restricted to values closer to zero. For instance, one of the commonly used penalty terms is the L2 norm regularization. In this case, for a two-layer neural network, the overall loss function would then become L 1 ∑ = ||σ(xW + b)W − y ||2 + λ ||W ||2 + λ ||W ||2 + λ ||b ||21 2 p 1 1 2 2 3 1 , (1.4)p p where λ1, λ2 and λ3 are hyper-parameters called weight decay terms. Such hyper- parameters can be selected based on a holdout set of training data not used directly to obtain the optimal weights of neural networks. This dataset is commonly known as the validation set. Another recently proposed regularization method most commonly used for neu- 5 ral networks is dropout regularization [195,210]. Similar to L2 norm regularization, dropout regularization also restricts the effective capacity of neural networks. This is done by randomly dropping out a subset of the parameters of neural networks during training. Effectively, a smaller model is used for learning the training data, thereby regularizing the model. Removal of these parameters is typically sampled i.i.d. with Bernoulli distribution. One of the best ways to improve the generalization performance is arguably to train our models with more data. However, in most scenarios, data can be expensive to obtain. In such scenarios, we can synthetically generate more data as a form of regularization. In this way, our models can overfit less to the lim- ited amount of training data. Such data generation processes often leverage the inherent structure of the data so that the inputs are perturbed in ways that do not change the corresponding outputs. For instance, in the case of natural images, we can rely on the limited amount of translation, rotations, and size invariance of images relative to their output label pairs as a way to generate additional training data [189]. In general, these methods are collectively known as data augmentation. 1.1.3 Optimization With a neural network model and a loss function, our next step is to find the set of parameters that predicts outputs accurately given inputs. Unlike the linear model, due to the non-convexity of the objective function, the loss functions of neural networks do not have closed-form solutions. Instead, we rely on gradient descent to optimize the loss function [14]. In general, suppose we have a loss function L(θ), where θ ∈ Rd denotes all the parameters associated with the model. The main idea of gradient descent involves updating the parameters θ in the opposite direction 6 of the gradient of the loss function ∇θL(θ) with respect to the parameters. To visualize, we follow the direction of the slope of the surface created by the loss function downhill until we reach a valley. Then, the set of parameters θ obtained at this valley should be one with a small loss, and hence good generalization performance. Traditionally, the gradient ∇θL(θ) is computed with the entire training data. This is termed batch gradient descent. At the i-th iteration of the gradient descent, parameters theta are updated with the update rule θi = θi−1 − η∇θL(θi−1), (1.5) where η denotes a hyper-parameter called the learning rate. η control the mag- nitude of the step size taken at each iteration. Due to the non-linearity of the loss function, a large learning rate can hinder convergence and lead to unstable training. On the other hand, a small learning rate can lead to slow convergence of the model. In the modern era of big data, it is often computationally infeasible to com- pute gradient with respect to the entire training set. To overcome the problem, Stochastic gradient descent (SGD) can be used instead. Only a small subset (a mini-batch) of the training data is used to compute a coarse estimate of the batch gradient ∇θL(θ) at each iteration of gradient update. Interestingly, it is recently observed that such stochasticity is often beneficial in helping deep neural networks better [75, 99,193]. Optimization is a crucial aspect of machine learning. Better and faster opti- mization algorithms enable us to obtain models of better qualities. Many methods have been proposed to improve the vanilla gradient descent algorithm discussed above. For instance, SGD with momentum [173] was proposed as a way to speed 7 up the convergence of SGD. It helps accelerate SGD in the relevant direction and dampens oscillations in the unwanted directions. This is done by adding a fraction γ of the gradient vector of the past time step to the current gradient vector: vi = γvi−1 + η∇θL(θi−1) (1.6) θi = θi−1 − vi. (1.7) There are many other more recently proposed algorithms for optimization. Some of the popular ones include Adam, RMSprop, and AdaGrad [109,185]. 1.1.4 Convolutional Neural Networks The primary focus of this thesis is on computer vision and medical imaging tasks. Within this domain, a group of neural networks called the convolutional neural networks (CNNs) has achieved tremendous success [68]. Unlike feed-forward dense neural networks, CNNs typically have a smaller number of parameters and thus are easier to optimize. They are a specialized kind of neural network for processing data that has a known grid-like topology. Being shift invariant or space invariant, they are designed with a built-in inductive bias of natural images. At a high level, the parameter efficiency is achieved by the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation- equivariant responses known as feature maps [125,164]. Mathematically, convolutional layers are defined by kernel matricesK. Suppose we have an input I. Within each convolutional layer, the convolution operation ∑∑∑ h(x, y, z) = (K ∗ I)(x, y, z) = I(x+ i, y + j, z + k)K(i, j, k) (1.8) i j k is performed to produce the output of each convolutional layer. Together with 8 nonlinear activation function functions such as ReLU, a simple CNN can then be formed by stacking numerous convolutional layers together. Following the tremendous initial success of CNNs for solving computer vision tasks [112], numerous efforts have been made to design better CNNs that are easier to train and converge faster and generalize better. Some of the popular approaches include the use of skip connections in such CNNs [80, 89, 190]. More recently, re- searchers have started exploring the possibility of casting CNN architecture design itself as an optimization problem to automatically obtain high-performing CNNs without the need for human heuristics [40]. 1.2 Challenges Despite the recent success in deep learning [112, 124], there remain many chal- lenges that need to be overcome for it to be safely adopted for many real-world applications. One of such obstacles is the reliability of deep learning. Indeed, in addition to applications like recommendation systems [235] in which safety and reliability are less of a concern, deep learning is being applied to many sensitive domains like medical image diagnosis [41, 175, 178] and autonomous driving sys- tems [67]. Reliability of deep learning is crucial for trustworthy application in these aforementioned domains. Nevertheless, modern neural networks have several sig- nificant shortcomings that can severely hamper such dependable adoption of deep learning systems. In this thesis, we focus on two such aspects. 9 1.2.1 Vulnerability to Noises First of all, modern neural networks are prone to noises present in data. For example, it has been shown that adversarially perturbing the inputs can easily fool a trained neural network model and significantly alter the prediction outcome of models [64, 228]. Moreover, even naturally present noises can be very harmful to the generalizability of neural network systems [81]. Such vulnerability to noises in inputs poses significant risks. While vulnerability to input noises is a significant topic worth further studying, in this thesis, we focus on noises present in the labels of the dataset. In today’s big data era, noises in labels are arguably inevitable in most scenarios. To exemplify, a recent study has found that even some of the most widely used benchmark datasets like CIFAR10 and ImageNet contain numerous erroneously labeled samples [160]. As we will further discuss in Chapter 2, as a result of the expressivity of modern neural networks, such noises present in the dataset can significantly hamper the generalization performance of learned models. As such, devising algorithms that enable us to train neural networks reliably in the presence of such noises in labels can be extremely beneficial for real-world applications. 1.2.2 Unreliable Model confidence In addition to accurate test-time predictions, neural networks also need to produce reliable model confidence intervals or uncertainty estimates. This is especially im- portant for sensitive application domains like medical image analysis. For instance, in order for physicians to safely interpret the predictions made by a neural network system and make corresponding treatment plans for potential cancer patients, the 10 model also needs to produce a trustworthy confidence interval on the predictions made. As we will discuss in Chapter 3, this is exactly what a modern neural network system lacks. Typically, the uncertainty estimates produced are often overconfident and uninformative [50]. 1.3 Contributions In this thesis, we propose several novel methodologies to tackle the aforementioned challenges associated with deep learning. Here, we give a high-level summary of the methodologies. In Chapter 2, we discuss a novel theoretically grounded set of noise-robust loss functions to tackle the problem of label-noise robust learning of deep neural networks. The proposed loss functions can be readily applied with any existing neural network architectures and algorithms. Recently, a Bayesian perspective has suggested that dropout regularization can be employed to obtain better probabilistic predictions at test time [51]. The au- thors of the paper termed the method Monte Carlo dropout. In Chapter 3, we take a step further and explore the use of various structured dropout techniques to further improve calibration of predictions. We also propose an omnibus dropout strategy that combines various structured dropout methods. We demonstrate the using structured dropout yield predictions with consistently better model confi- dence calibration. Monte Carlo (MC) dropout [51] is a simple and efficient ensembling method that can improve the accuracy and confidence calibration of high-capacity deep 11 neural network models. However, MC dropout is not as effective as more compute- intensive methods such as deep ensembles [120]. To bridge the gap between MC dropout and deep ensembles, we propose in Chapter 4 a simple, pruning-based ap- proach to compute non-overlapping dropout masks, which allows us to compute an ensemble of subnetworks. The proposed method can be seen as a computationally efficient alternative to deep ensembles. As we will discuss in Chapter 5, another shortcoming of MC dropout is that it requires multiple forward passes through the network during inference and there- fore can be too resource-intensive to be deployed in real-time applications. To solve the latency, we leverage a concept called knowledge distillation and propose a sim- ple, easy-to-optimize method for learning the conditional predictive distribution of a pre-trained dropout model. This allows us to obtain sample-free uncertainty estimation in computer vision tasks, thereby significantly reducing the inference cost. In addition to providing fast sample-free uncertainty estimation, we empirically observe that knowledge distillation also helps improving the generalization perfor- mance of neural networks. In Chapter 6, we provide a deeper understanding on the observed phenomeon. Specifically, we offer a new interpretation for teacher- student training as amortized MAP estimation, such that teacher predictions en- able instance-specific regularization. Our framework allows us to theoretically relate self-distillation to label smoothing, a commonly used technique that regu- larizes predictive uncertainty, and suggests the importance of predictive diversity in addition to predictive uncertainty. Fianlly in Chapter 7, we apply neural network for a real-world histopathology image classification task, and demonstrate that a carefully trained neural network 12 system can perform on par with pathologists with years of training. 13 CHAPTER 2 LABEL-NOISE ROBUST LEARNING OF NEURAL NETWORKS WITH GENERALIZED CROSS ENTROPY LOSS As discussed in Chapter 1, deep neural networks (DNNs) have achieved tremen- dous success in a variety of applications across many disciplines. Yet, their superior performance comes with the expensive cost of requiring correctly annotated large- scale datasets. Moreover, due to DNNs’ rich capacity, errors in training labels can hamper performance. To combat this problem, mean absolute error (MAE) has recently been proposed as a noise-robust alternative to the commonly-used categorical cross entropy (CCE) loss. However, as we show in this chapter, MAE can perform poorly with DNNs and challenging datasets. In this chapter, we present a theoretically grounded set of noise-robust loss functions that can be seen as a generalization of MAE and CCE. Proposed loss functions can be readily applied with any existing DNN architecture and algorithm, while yielding good performance in a wide range of noisy label scenarios. We report results from ex- periments conducted with CIFAR-10, CIFAR-100 and FASHION-MNIST datasets and synthetically generated noisy labels. 2.1 Introduction The resurrection of neural networks in recent years, together with the recent emer- gence of large scale datasets, has enabled super-human performance on many clas- sification tasks [112, 151, 159]. However, supervised DNNs often require a large number of training samples to achieve a high level of performance. For instance, the ImageNet dataset [35] has 3.2 million hand-annotated images. Although crowd- 14 sourcing platforms like Amazon Mechanical Turk have made large-scale annotation possible, some error during the labeling process is often inevitable, and mislabeled samples can impair the performance of models trained on these data. Indeed, the sheer capacity of DNNs to memorize massive data with completely randomly assigned labels [231] proves their susceptibility to overfitting when trained with noisy labels. Hence, an algorithm that is robust against noisy labels for DNNs is needed to resolve the potential problem. Furthermore, when examples are cheap and accurate annotations are expensive, it can be more beneficial to have datasets with more but noisier labels than less but more accurate labels [105]. Classification with noisy labels is a widely studied topic [48]. Yet, relatively little attention is given to directly formulating a noise-robust loss function in the context of DNNs. Our work is motivated by Ghosh et al. [59] who theoretically showed that mean absolute error (MAE) can be robust against noisy labels under certain assumptions. However, as we demonstrate below, the robustness of MAE can concurrently cause increased difficulty in training, and lead to performance drop. This limitation is particularly evident when using DNNs on complicated datasets. To combat this drawback, we advocate the use of a more general class of noise-robust loss functions, which encompass both MAE and CCE. Compared to previous methods for DNNs, which often involve extra steps and algorithmic modifications, changing only the loss function requires minimal intervention to existing architectures and algorithms, and thus can be promptly applied. Fur- thermore, unlike most existing methods, the proposed loss functions work for both closed-set and open-set noisy labels [213]. Open-set refers to the situation where samples associated with erroneous labels do not always belong to a ground truth class contained within the set of known classes in the training data. Conversely, closed-set means that all labels (erroneous and correct) come from a known set of 15 labels present in the dataset. The main contributions of this chapter are two-fold. First, we propose a novel generalization of CCE and present a theoretical analysis of proposed loss func- tions in the context of noisy labels. And second, we report a thorough empir- ical evaluation of the proposed loss functions using CIFAR-10, CIFAR-100 and FASHION-MNIST datasets, and demonstrate significant improvement in terms of classification accuracy over the baselines of MAE and CCE, under both closed-set and open-set noisy labels. The rest of the chapter is organized as follows. Section 2.2 discusses existing approaches to the problem. Section 2.3 introduces our noise-robust loss functions. Section 2.4 presents and analyzes the experiments and result. 2.2 Related Work Numerous methods have been proposed for learning with noisy labels with DNNs in recent years. Here, we briefly review the relevant literature. Firstly, Sukhbaatar and Fergus [197] proposed accounting for noisy labels with a confusion matrix so that the cross entropy loss becomes N 1 ∑ 1 ∑N ∑cL(θ) = − log p(ỹ = ỹn|xn, θ) = − log( p(ỹ = ỹn|y = i)p(y = i|xn, θ)), N N n=1 n=1 i (2.1) where c represents number of classes, ỹ represents noisy labels, y represents the latent true labels and p(ỹ = ỹn|y = i) is the (ỹn, i)’th component of the confusion matrix. Usually, the real confusion matrix is unknown. Several methods have been proposed to estimate it [61, 72, 82, 100, 168]. Yet, accurate estimations can 16 be hard to obtain. Even with the real confusion matrix, training with the above loss function might be suboptimal for DNNs. Assuming (1) a DNN with enough capacity to memorize the training set, and (2) a confusion matrix that is diagonally dominant, minimizing the cross entropy with confusion matrix is equivalent to minimizing the original CCE loss. This is because the right hand side of Eq. 2.1 is minimized when p(y = i|xn, θ) = 1 for i = ỹn and 0 otherwise, ∀ n. In the context of support vector machines, several theoretically motivated noise- robust loss functions like the ramp loss, the unhinged loss and the savage loss have been introduced [18, 146, 207]. More generally, Natarajan et al. [155] presented a way to modify any given surrogate loss function for binary classification to achieve noise-robustness. However, little attention is given to alternative noise robust loss functions for DNNs. Ghosh et al. [59, 60] proved and empirically demonstrated that MAE is robust against noisy labels. This chapter can be seen as an extension and generalization of their work. Another popular approach attempts at cleaning up noisy labels. Veit et al. [208] suggested using a label cleaning network in parallel with a classification network to achieve more noise-robust prediction. However, their method requires a small set of clean labels. Alternatively, one could gradually replace noisy labels by neural network predictions [179, 200]. Rather than using predictions for training, North- cutt et al. [161] offered to prune the correct samples based on softmax outputs. As we demonstrate below, this is similar to one of our approaches. Instead of pruning the dataset once, our algorithm iteratively prunes the dataset while training until convergence. Other approaches include treating the true labels as a latent variable and the noisy labels as an observed variable so that EM-like algorithms can be used to 17 learn true label distribution of the dataset [105,206,221]. Techniques to re-weight confident samples have also been proposed. Jiang et al. [97] used a LSTM network on top of a classification model to learn the optimal weights on each sample, while Ren, et al. [181] used a small clean dataset and put more weights on noisy samples which have gradients closer to that of the clean dataset. In the context of binary classification, Liu et al. [129] derived an optimal importance weighting scheme for noise-robust classification. Our method can also be viewed as re-weighting individual samples; instead of explicitly obtaining weights, we use the softmax outputs at each iteration as the weightings. Lastly, Azadi et al. [3] proposed a reg- ularizer that encourages the model to select reliable samples for noise-robustness. Another method that uses knowledge distillation for noisy labels has also been proposed [128]. Both of these methods also require a smaller clean dataset to work. 2.3 Generalized Cross Entropy Loss for Noise-Robust Classifications 2.3.1 Preliminaries We consider the problem of k-class classification. Let X ⊂ Rd be the feature space and Y = {1, · · · , c} be the label space. In an ideal scenario, we are given a clean dataset D = {(xi, yi)}ni=1, where each (xi, yi) ∈ (X ×Y). A classifier is a function that maps input feature space to the label space f : X → Rc. In this chapter, we consider the common case where the function is a DNN with the softmax output layer. For any loss function L, the (empirical) risk of the classifier f is defined as 18 RL(f) = ED[L(f(x), yx)] , where the expectation is over the empirical distribution. The most commonly used loss for classification is cross entropy. In this case, the risk becomes: ∑n ∑c1 RL(f) = ED[L(f(x;θ), yx)] = − yij log fj(xi;θ), (2.2)n i=1 j=1 where θ is the set of parameters of the classifier, yij corresponds to the j’th element of one-hot encoded label of the sample xi, yi = ey ∈∑{0, 1}c such thati 1⊤yi = 1 ∀ i, and fj denotes the j’th element of f . Note that, n j=1 fj(xi;θ) = 1, and fj(xi;θ) ≥ 0, ∀j, i,θ, since the output layer is a softmax. The parameters of DNN can be optimized with empirical risk minimization. We denote a dataset with label noise by Dη = {(xi, ỹ )}ni i=1 where ỹi’s are the noisy labels with respect to each sample such that p(ỹi = k|yi = j,xi) = (x η i ) jk . In this chapter, we make the common assumption that noise is conditionally independent of inputs given the true labels so that p(ỹi = k|yi = j,xi) = p(ỹi = k|yi = j) = ηjk. In general, this noise is defined to be class dependent. Noise is uniform with noise rate η, if ηjk = 1− η for j = k, and η ηjk = − for j ̸= k. The risk of classifier withc 1 respect to noisy dataset is then defined as RηL(f) = EDη [L(f(x), ỹx)]. Let f ∗ be the global minimizer of the risk RL(f). Then, the empirical risk minimization under loss function L is defined to be noise tolerant [145] if f ∗ is a global minimum of the noisy risk RηL(f). A loss function is called symmetric if, for some constant C, ∑c L(f(x), j) = C, ∀x ∈ X , ∀f. (2.3) j=1 19 The main contribution of Ghosh et al. [60] is they proved that if loss function is symmetric and η < c−1 , then under uniform label noise, for any f , RηL(f ∗) − c RηL(f) ≤ 0. Hence, f ∗ is also the global minimizer for R η L and L is noise tolerant. Moreover, if R (f ∗L ) = 0, then L is also noise tolerant under class dependent noise. Being a nonsymmetric and unbounded loss function, CCE is sensitive to label noise. On the contrary, MAE, as a symmetric loss function, is noise robust. For DNNs with a softmax output layer, MAE can be computed as: LMAE(f(x), ej) = ||ej − f(x)||1 = 2− 2fj(x). (2.4) With this particular configuration of DNN, the proposed MAE loss is, up to a constant of proportionality, the same as the unhinged loss Lunh(f(x), ej) = 1 − fj(x) [207]. 2.3.2 Lq Loss for Classification In this section, we will argue that MAE has some drawbacks as a classification loss function for DNNs, which are normally trained on large scale datasets us- ing stochastic gradient based techniques. Let’s look at the gradient of the loss functions: ∑ ∑n  n 1∂L(f(x i;θ), yi) ∑i=1 − ∇ f (x ;θ) for CCE fy (xi;θ) θ yi i = i (2.5) ∂θ i=1  n i=1−∇θfy (xi;θ) for MAE/unhinged loss.i Thus, in CCE, samples with softmax outputs that are less congruent with provided labels, and hence smaller fy (xi;θ) or larger 1/fy (xi;θ), are implicitly weighedi i more than samples with predictions that agree more with provided labels in the gradient update. This means that, when training with CCE, more emphasis is 20 put on difficult samples. This implicit weighting scheme is desirable for training with clean data, but can cause overfitting to noisy labels. Conversely, since the 1/fy (xi;θ) term is absent in its gradient, MAE treats every sample equally, whichi makes it more robust to noisy labels. However, as we demonstrate empirically, this can lead to significantly longer training time before convergence. Moreover, with- out the implicit weighting scheme to focus on challenging samples, the stochasticity involved in the training process can make learning difficult. As a result, classifica- tion accuracy might suffer. To demonstrate this, we conducted a simple experiment using ResNet [79] optimized with the default setting of Adam [109] on the CIFAR datasets [111]. Fig. 2.1(a) shows the test accuracy curve when trained with CCE and MAE respectively on CIFAR-10. As illustrated clearly, it took significantly longer to converge when trained with MAE. In agreement with our analysis, there was also a compromise in classification accuracy due to the increased difficulty of learning useful features. These adverse effects become much more severe when using a more difficult dataset, such as CIFAR-100 (see Fig. 2.1(b)). Not only do we observe significantly slower convergence, but also a substantial drop in test accuracy when using MAE. In fact, the maximum test accuracy achieved after 2000 epochs, a long time after training using CCE has converged, was 38.29%, while CCE achieved an higher accuracy of 39.92% after merely 7 epochs! Despite its theoretical noise-robustness, due to the shortcoming during training induced by its noise-robustness, we conclude that MAE is not suitable for DNNs with challenging datasets like ImageNet. To exploit the benefits of both the noise-robustness provided by MAE and the implicit weighting scheme of CCE, we propose using the the negative Box-Cox 21 (a) (b) (c) Figure 2.1: (a), (b) Test accuracy against number of epochs for training with CCE (orange) and MAE (blue) loss on clean data with (top) CIFAR-10 and (mid) CIFAR-100 datasets. (c) Average softmax prediction for correctly (solid) and wrongly (dashed) labeled training samples, for CCE (orange) and Lq (q = 0.7, blue) loss on CIFAR-10 with uniform noise (η = 0.4). transformation [15] as a loss function: L (1− fj(x) q) q(f(x), ej) = , (2.6) q where q ∈ (0, 1]. Using L’Hôpital’s rule, it can be shown that the proposed loss function is equivalent to CCE for limq→0 Lq(f(x), ej), and becomes MAE/unhinged loss when q = 1. Hence, this loss is a generalization of CCE and MAE. Relatedly, Ferrari and Yang [44] viewed the maximization of Eq. 2.6 as a generalization of 22 maximum likelihood and termed the loss function Lq, which we also adopt. Theoretically, for any input x, the sum of Lq loss with respect to all classes is bounded by: c− c(1−q) ∑c≤ (1− fj(x)q) ≤ c− 1 . (2.7) q q q j=1 Using this bound and under uniform noise with η ≤ 1 − 1 , we can show (see c Appendix) A ≤ (RLq(f ∗)−RLq(f̂)) ≤ 0, (2.8) where A = η[1−c (1−q)] − − < 0, f ∗ is the global minimizer of RLq(f), and f̂ is the globalq(c 1 ηc) minimizer of RηL (f). The larger the q, the larger the constant A, and the tighterq the bound of Eq. 2.8. In the extreme case of q = 1 (i.e., for MAE), A = 0 and RLq(f̂) = R ∗ Lq(f ). In other words, for q values approaching 1, the optimum of the noisy risk will yield a risk value (on the clean data) that is close to f ∗, which implies noise tolerance. It can also be shown that the difference (Rη ∗ ηL (f )−RL (f̂))q q is bounded under class dependent noise, provided R (f ∗Lq ) = 0 and qij < qii ∀i ̸= j (see Thm 2 in Appendix). The compromise on noise-robustness when using Lq over MAE prompts an easier learning process. Let’s look at the gradients of Lq loss to see this: ∂Lq(f(xi;θ), yi) 1 = fy (xi;θ) q(− ∇θfy (xi;θ)) = −f q−1y (xi;θ) ∇θfy (xi;θ), ∂θ i f (x ;θ) i i iyi i where fy (xi;θ) ∈ [0, 1] ∀ i and q ∈ (0, 1). Thus, relative to CCE, Lq loss weighsi each sample by an additional fy (xi;θ) q so that less emphasis is put on samples with i weak agreement between softmax outputs and the labels, which should improve robustness against noise. Relative to MAE, a weighting of f (x ;θ)q−1y i on eachi sample can facilitate learning by giving more attention to challenging datapoints 23 with labels that do not agree with the softmax outputs. On one hand, larger q leads to a more noise-robust loss function. On the other hand, too large of a q can make optimization strenuous. Hence, as we will demonstrate empirically below, it is practically useful to set q between 0 and 1, where a tradeoff equilibrium is achieved between noise-robustness and better learning dynamics. 2.3.3 Truncated Lq Loss ∑ Since a tighter bound in cj=1 L(f(x, j)) would imply stronger noise tolerance, we propose the truncated Lq loss: Lq(k) if fj(x) ≤ k Ltrunc(f(x), ej) =  (2.9)Lq(f(x), ej) if fj(x) > k where 0 < k < 1, and Lq(k) = (1− kq)/q. Note that, when k → 0, the truncated Lq loss becomes the normal Lq loss. Assuming k ≥ 1/c, the sum of truncated Lq loss with respect to all classes is bound∑ed by (see Appendix):c L 1d q( ) + (c− d)Lq(k) ≤ Ltrunc(f(x), ej) ≤ cLq(k), (2.10) d j=1 where d = max(1, (1−q) 1/q ). It can be verified that the difference between upper k and lower bounds for the truncated Lq loss, Lq(k), is smaller than that for the Lq loss of Eq. 2.7, if 1 c(1−q) − 1 d[Lq(k)− Lq( )] < . (2.11) d q As an example, when k ≥ 0.3, the above inequality is satisfied for all q and c. When k ≥ 0.2, the inequality is satisfied for all q and c ≥ 10. Since the derived bounds in Eq. 2.7 and Eq. 2.10 are tight, introducing the threshold k can thus lead to a more noise tolerant loss function. 24 If the softmax output for the provided label is below a threshold, truncated Lq loss becomes a constant. Thus, the loss gradient is zero for that sample, and it does not contribute to learning dynamics. While Eq. 2.10 suggests that a larger threshold k leads to tighter bounds and hence more noise-robustness, too large of a threshold would precipitate too many discarded samples for training. Ideally, we would want the algorithm to train with all available clean data and ignore noisy labels. Thus the optimal choice of k would depend on the noise in the labels. Hence, k can be treated as a (bounded) hyper-parameter and optimized. In our experiments, we set k = 0.5 that yields a tighter bound for truncated Lq loss, and which we observed to work well empirically. A potential problem arises when training directly with this loss function. When the threshold is relatively large (e.g., k = 0.5 in a 10-class classification problem), at the beginning of the training phase, most of the softmax outputs can be sig- nificantly smaller than k, resulting in a dramatic drop in the number of effective samples. Moreover, it is suboptimal to prune samples based on softmax values at the beginning of training. To circumvent the problem, observe that, by definition of the trun∑cated Lq loss:n ∑n argmin Ltrunc(f(xi;θ), yi) = argmin viLq(f(xi;θ), yi) + (1− vi)Lq(k), θ i=1 θ i=1 (2.12) where vi = 0 if fy (xi) ≤ k and vi = 1 otherwise, and θ represents the parametersi of the cl∑assifier. Optimizing the above loss is the sa∑me as optimizing the following:n n ∑n argmin viLq(f(xi;θ), yi)− viLq(k) = argmin wiLq(f(xi;θ), yi)− Lq(k) wi, θ i=1 θ,w∈[0,1] n i=1 i=1 (2.13) because for any θ, the optimal wi is 1 if Lq(f(xi;θ), yi) ≤ Lq(k) and 0 if Lq(f(xi;θ), yi) > Lq(k). Hence, we can optimize the truncated Lq loss by optimiz- 25 Algorithm 1: ACS for Training with Lq Loss Input: Noisy dataset Dη, total iterations T , threshold k Output: Optimized NN parameters θ (0) Initialize wi = 1 ∀ i ; ∑ (0) ∑ Update θ(0) = argmin nθ i=1∑wi Lq(f(xi;θ), yi)− L n (0) q(k) i=1wi ; while t < T do Update w(t) = argmin n ∑ w i=1wiLq(f(x ;θ (t−1) i ), yi)− Lq(k) ni=1wi ; // Pruning Step Update θ(t) ∑ (t) ∑ (t) = argmin nθ i=1wi Lq(f(x n i;θ), yi)− Lq(k) i=1 wi ing the right hand side of Eq. 2.13. If Lq is convex with respect to the parameters θ, optimizing Eq. 2.13 is a biconvex optimization problem, and the alternative convex search (ACS) algorithm [7] can be used to find the global minimum. ACS itera- tively optimizes θ and w while keeping the other set of parameters fixed. Despite the high non-convexity of DNNs, we can apply ACS to find a local minimum. We refer to the update of w as ”pruning”. At every step of iteration, pruning can be carried out easily by computing f(xi;θ (t)) for all training samples. Only samples with fy (xi;θ (t)) ≥ k and L i q(f(xi;θ), yi) ≤ Lq(k) are kept for updating θ during that iteration (and hence wi = 1 ). The additional computational complexity from the pruning steps is negligible. Interestingly, the resulting algorithm is similar to that of self-paced learning [117]. 2.4 Experiments The following setup applies to all of the experiments conducted. Noisy datasets were produced by artificially corrupting true labels. 10% of the training data was retained for validation. To realistically mimic a noisy dataset while justifiably analyzing the performance of the proposed loss function, only the training and validation data were contaminated, and test accuracies were computed with respect 26 to true labels. A mini-batch size of 128 was used. All networks used ReLUs in the hidden layers and softmax layers at the output. All reported experiments were repeated five times with random initialization of neural network parameters and randomly generated noisy labels each time. We compared the proposed functions with CCE, MAE and also the confusion matrix-corrected CCE, as shown in Eq. 2.1. Following [168], we term this ”forward correction”. All experiments were conducted with identical optimization procedures and architectures, changing only the loss functions. 2.4.1 Toward a Better Understanding of Lq Loss To better grasp the behavior of Lq loss, we implemented different values of q and uniform noise at different noise levels, and trained ResNet-34 with the default set- ting of Adam on CIFAR-10. As shown in Fig. 2.2, when trained on clean dataset, increasing q not only slowed down the rate of convergence, but also lowered the classification accuracy. More interesting phenomena appeared when trained on noisy data. When CCE (q = 0) was used, the classifier first learned predictive patterns, presumably from the noise-free labels, before overfitting strongly to the noisy labels, in agreement with Arpit et al.’s observations [2]. Training with in- creased q values delayed overfitting and attained higher classification accuracies. One interpretation of this behavior is that the classifier could learn more about predictive features before overfitting. This interpretation is supported by our plot of the average softmax values with respect to the correctly and wrongly labeled samples on the training set for CCE and Lq (q = 0.7) loss, and with 40% uniform noise (Fig. 2.1(c)). For CCE, the average softmax for wrongly labeled samples remained small at the beginning, but grew quickly when the model started overfit- 27 ting. Lq loss, on the other hand, resulted in significantly smaller softmax values for wrongly labeled data. This observation further serves as an empirical justification for the use of truncated Lq loss as described in section 2.3.3. We also observed that there was a threshold of q beyond which overfitting never kicked in before convergence. When η = 0.2 for instance, training with Lq loss with q = 0.8 produced an overfitting-free training process. Empirically, we noted that, the noisier the data, the larger this threshold is. However, too large of a q hampers the classification accuracy, and thus a larger q is not always preferred. In general, q can be treated as a hyper-parameter that can be optimized, say via monitoring validation accuracy. In remaining experiments, we used q = 0.7, which yielded a good compromise between fast convergence and noise robustness (no overfitting was observed for η ≤ 0.5). Table 2.1: Average test accuracy and standard deviation (5 runs) on experiments with closed-set noise. We report accuracies of the epoch where validation accuracy is maximum. Forward T and T̂ represent forward correction with the true and esti- mated confusion matrices, respectively [168]. q = 0.7 was used for all experiments with Lq loss and truncated Lq loss. Best 2 accuracies are bold faced. Uniform Noise Class Dependent Noise Datasets Loss Functions Noise Rate η Noise Rate η 0.2 0.4 0.6 0.8 0.1 0.2 0.3 0.4 CCE 93.24± 0.12 92.09± 0.18 90.29± 0.35 86.20± 0.68 94.06± 0.05 93.72± 0.14 92.72± 0.21 89.82± 0.31 MAE 80.39± 4.68 79.30± 6.20 82.41± 5.29 74.73± 5.26 74.03± 6.32 63.03± 3.91 58.14± 0.14 56.04± 3.76 FASHION Forward T 93.64 ± 0.12 92.69 ± 0.20 91.16 ± 0.16 87.59± 0.35 94.33 ± 0.10 94.03 ± 0.11 93.91 ± 0.14 93.65 ± 0.11 MNIST Forward T̂ 93.26± 0.10 92.24± 0.15 90.54± 0.10 85.57± 0.86 94.09 ± 0.10 93.66 ± 0.09 93.52 ± 0.16 88.53± 4.81 Lq 93.35 ± 0.09 92.58± 0.11 91.30± 0.20 88.01 ± 0.22 93.51± 0.17 93.24± 0.14 92.21± 0.27 89.53± 0.53 Trunc Lq 93.21± 0.05 92.60 ± 0.17 91.56 ± 0.16 88.33 ± 0.38 93.53± 0.11 93.36± 0.07 92.76± 0.14 91.62 ± 0.34 CCE 86.98 ± 0.44 81.88 ± 0.29 74.14 ± 0.56 53.82 ± 1.04 90.69 ± 0.17 88.59 ± 0.34 86.14 ± 0.40 80.11 ±1.44 MAE 83.72 ± 3.84 67.00 ± 4.45 64.21 ± 5.28 38.63 ± 2.62 82.61 ± 4.81 52.93 ± 3.60 50.36 ± 5.55 45.52 ± 0.13 Forward T 88.63 ± 0.14 85.07 ± 0.29 79.12 ± 0.64 64.30 ± 0.70 91.32 ± 0.21 90.35 ± 0.26 89.25 ± 0.43 88.12 ± 0.32 CIFAR-10 Forward T̂ 87.99± 0.36 83.25± 0.38 74.96± 0.65 54.64± 0.44 90.52± 0.26 89.09± 0.47 86.79± 0.36 83.55 ± 0.58 Lq 89.83 ± 0.20 87.13 ± 0.22 82.54 ± 0.23 64.07± 1.38 90.91 ± 0.22 89.33± 0.17 85.45± 0.74 76.74± 0.61 Trunc Lq 89.7 ± 0.11 87.62 ± 0.26 82.70 ± 0.23 67.92 ± 0.60 90.43± 0.25 89.45 ± 0.29 87.10 ± 0.22 82.28± 0.67 CCE 58.72 ± 0.26 48.20 ± 0.65 37.41 ± 0.94 18.10 ± 0.82 66.54± 0.42 59.20± 0.18 51.40± 0.16 42.74± 0.61 MAE 15.80 ± 1.38 9.03 ± 1.54 7.74 ± 1.48 3.76 ± 0.27 13.38± 1.84 11.50± 1.16 8.91± 0.89 8.20± 1.04 Forward T 63.16 ± 0.37 54.65 ± 0.88 44.62 ± 0.82 24.83 ± 0.71 71.05 ± 0.30 71.08 ± 0.22 70.76 ± 0.26 70.82 ± 0.45 CIFAR-100 Forward T̂ 39.19± 2.61 31.05± 1.44 19.12± 1.95 8.99± 0.58 45.96± 1.21 42.46± 2.16 38.13± 2.97 34.44± 1.93 Lq 66.81 ± 0.42 61.77 ± 0.24 53.16 ± 0.78 29.16 ± 0.74 68.36± 0.42 66.59± 0.22 61.45± 0.26 47.22± 1.15 Trunc Lq 67.61 ± 0.18 62.64 ± 0.33 54.04 ± 0.56 29.60 ± 0.51 68.86 ± 0.14 66.59 ± 0.23 61.87 ± 0.39 47.66 ± 0.69 28 Figure 2.2: The test accuracy and validation loss against number of epochs for training with Lq loss at different values of q. 29 2.4.2 Datasets CIFAR-10/CIFAR-100: ResNet-34 was used as the classifier optimized with the loss functions mentioned above. Per-pixel mean subtraction, horizontal random flip and 32×32 random crops after padding with 4 pixels on each side was performed as data preprocessing and augmentation. Following [90], we used stochastic gradient descent (SGD) with 0.9 momentum, a weight decay of 10−4 and learning rate of 0.01, and divided it by 10 after 40 and 80 epochs (120 in total) for CIFAR-10, and after 80 and 120 (150 in total) for CIFAR-100. To ensure a fair comparison, the identical optimization scheme was used for truncated Lq loss. We trained with the entire dataset for the first 40 epochs for CIFAR-10 and 80 for CIFAR-100, and started pruning and training with the pruned dataset afterwards. Pruning was done every 10 epochs. To prevent overfitting, we used the model at the optimal epoch based on maximum validation accuracy for pruning. Uniform noise was generated by mapping a true label to a random label through uniform sampling. Following Patrini, et al. [168] class dependent noise was generated by mapping TRUCK → AUTOMOBILE, BIRD → AIRPLANE, DEER → HORSE, and CAT ↔ DOG with probability η for CIFAR-10. For CIFAR-100, we simulated class- dependent noise by flipping each class into the next circularly with probability η. We also tested noise-robustness of our loss function on open-set noise using CIFAR-10. For a direct comparison, we followed the identical setup as described in [213]. For this experiment, the classifier was trained for only 100 epochs. We observed validation loss plateaued after about 10 epochs, and hence started pruning the data afterwards at 10-epoch intervals. The open-set noise was generated by using images from the CIFAR-100 dataset. A random CIFAR-10 label was assigned 30 Table 2.2: Average test accuracy on experiments with CIFAR-10. We replicated the exact experimental setup as in [213]. The reported accuracies are the average last epoch accuracies after training for 100 epochs. η = 40%. CCE, Forward and method by Wang et al. are adapted for direct comparison. Noise type CCE [213] Forward [213] Wang, et al. [213] MAE Lq Trunc Lq CIFAR-10 + CIFAR-100 (open-set noise) 62.92 64.18 79.28 75.06 71.10 79.55 CIFAR-10 (closed-set noise) 62.38 77.81 78.15 74.31 64.79 79.12 to these images. FASHION-MNIST: ResNet-18 was used. The identical data preprocessing, augmentation, and optimization procedure as in CIFAR-10 was deployed for train- ing. To generate a realistic class dependent noise, we used the t-SNE [138] plot of the dataset to associated classes with similar embeddings, and mapped BOOT → SNEAKER , SNEAKER → SANDALS, PULLOVER → SHIRT, COAT ↔ DRESS with probability η. 2.4.3 Results and Discussion Experimental results with closed-set noise is summarized in Table 3.1. For uniform noise, proposed loss functions outperformed the baselines significantly, including forward correction with the ground truth confusion matrices. In agreement with our theoretical expectations, truncating the Lq loss enhanced results. For class dependent noise, in general Forward T offered the best performance, as it relied on the knowledge of the ground truth confusion matrix. Truncated Lq loss pro- duced similar accuracies as Forward T̂ for FASHION-MNIST and better results for CIFAR datasets, and outperformed the other baselines at most noise levels for all datasets. While using Lq loss improved over baselines for CIFAR-100, no improvements were observed for FASHION-MNIST and CIFAR-10 datasets. We 31 believe this is in part because very similar classes were grouped together for the confusion matrices and consquently the DNNs might falsely put high confidence on wrongly labeled samples. In general, classification accuracy for both uniform and class dependent noise would be further improved relative to baselines with optimized q and k values and more number of epochs. Based on the experimental results, we believe the proposed approach would work well when correctly labeled data can be differentiated from wrongly labeled data based on softmax outputs, which is often the case with large- scale data and expressive models. We also observed that MAE performed poorly for all datasets at all noise levels, presumably because DNNs like ResNet struggled to optimize with MAE loss, especially on challenging datasets such as CIFAR-100. Table 2.2 summarizes the results for open-set noise with CIFAR-10. Following Wang et al. [213], we reported the last-epoch test accuracy after training for 100 epochs. We also repeated the closed-set noise experiment with their setup. Using Lq loss noticeably prevented overfitting, and using truncated Lq loss achieved bet- ter results than the state-of-the-art method for open-set noise reported in [213]. Moreover, our method is significantly easier to implement. Lastly, note that the poor performance of Lq loss compared to MAE is due to the fact that test accu- racy reported here is long after the model started overfitting, since a shallow CNN without data augmentation was deployed for this experiment. 32 CHAPTER 3 IMPROVING CONFIDENCE CALIBRATION FOR CONVOLUTIONAL NEURAL NETWORKS WITH STRUCTURED DROPOUT While neural networks achieve impressive classification accuracy across differ- ent tasks, they can suffer from poor calibration of their predictions. A Bayesian perspective has suggested that dropout, a regularization strategy often used dur- ing training, can be employed to obtain better probabilistic predictions at test time [51]. However, empirical results so far have not been encouraging, particu- larly with convolutional networks. In this chapter, through the lens of ensemble learning, we associate this unsatisfactory performance with the correlation be- tween the models sampled with regular dropout. Motivated by this, we explore the use of various structured dropout techniques to promote model diversity and improve calibration of predictions. We also propose an omnibus dropout strategy that combines various structured dropout methods. Using the SVHN, CIFAR-10 and CIFAR-100 datasets, we empirically demonstrate the superior performance of omnibus dropout, and show its merit in a Bayesian active learning application. 3.1 Introduction Deep neural networks (NNs) achieve state-of-the-art classification accuracy in many applications. However, in real world scenarios, like medical diagnosis and autonomous driving, reliable probabilistic predictions are often crucial and need to be considered in assessing performance. Most modern NNs are trained with maximum likelihood to produce point estimates that are often over-confident [70]. Bayesian techniques can be used with neural networks to obtain well-calibrated 33 predictions [139, 157]. Monte Carlo (MC) dropout [51], a cheap approximate in- ference technique which obtains uncertainty by performing dropout [195] at test time, is a popular Bayesian method for uncertainty estimates. Despite its improvement, MC dropout can still produce over-confident pre- dictions [121], particularly with convolutional architectures. In this chapter, we propose a simple yet effective solution to this problem. Inspired by the recent success of explicit ensembles of neural networks obtained using random initial- izations [10], we reiterate the original notion of dropout as ”an extreme form of model combination with extensive parameter sharing” [195], and interpret MC dropout as an ensemble of models. Borrowing machinery from ensemble learn- ing, we then attribute the poor performance of MC dropout to its limited model diversity compared to that of explicit ensembles. This perspective reveals how structured dropout methods [58, 204] can improve performance by promoting di- versity. While the importance of diversity has been demonstrated by others, prior works consider explicit ensembles of different models. To the best of our knowl- edge, this is the first chapter to examine structured dropout as a way to enhance diversity in an ensemble obtained from a single model. As discussed below, we also propose to combine different structured dropout methods, which we call omnibus dropout. We empirically verify that omnibus dropout can yield models with su- perior performance on the SVHN, CIFAR-10 and CIFAR-100 datasets compared to not only MC dropout, but also some of the most widely adopted baselines like temperature scaling [70]. Furthermore, we demonstrate its merit in a Bayesian active learning experiment [54]. Summary of Contributions. 34 • We experimentally illustrate that the poor performance of standard MC dropout is primarily attributable to lack of model diversity among predic- tions. • We show that using structured dropout can significantly boost model diver- sity, and hence the calibration of NNs. • We propose the omnibus dropout, a simple combination of different types of structured dropout methods, to further enhance the performance of MC dropout. • We demonstrate the effectiveness of the proposed method with various bench- mark datasets. 3.2 Related Work Dropout was first introduced as a stochastic regularization technique for NNs [195]. Inspired by the success of dropout, numerous variants have been proposed [55,58, 65, 90, 191, 204, 211]. Unlike regular dropout, most of these methods drop parts of the NNs in a structured manner. For instance, DropBlock [58] applies dropout to small patches of the feature map in convolutional networks, SpatialDrop [204] drops out entire channels, and Stochastic Depth Net [90] drops out entire ResNet blocks. These methods were proposed to boost test time accuracy. In this chapter, we show that these structured dropout techniques can be successfully applied to obtain better uncertainty estimates as well. Dropout can be thought of as performing approximate Bayesian inference [52] and offer estimates of uncertainty. Many other approximate Bayesian inference techniques have also been proposed for NNs [110, 134]. However, these meth- 35 ods can demand a sophisticated implementation, are often harder to scale, and can suffer from sub-optimal performance [12]. Another popular alternative to ap- proximate the intractable posterior is Markov Chain Monte Carlo (MCMC) [157]. More recently, stochastic gradient versions of MCMC were also proposed to allow scalability [63,137,216]. Nevertheless, these methods are often computationally ex- pensive, and sensitive to the choice of hyper-parameters. Lastly, there have been efforts to approximate the posterior with Laplace approximation [139, 183]. A related approach, the SWA-Gaussian [140] is another technique for Gaussian pos- terior approximation using the Stochastic Weight Averaging (SWA) algorithm [95]. There are also non-Bayesian techniques to obtain calibrated uncertainty. For instance, temperature scaling [70] has been empirically shown to be effective in calibrating the predictions. A related line of work uses an ensemble of several randomly-initialized NNs [121]. The method, known as deep ensembles, requires training and saving multiple NNs. An ensemble of snapshots of the trained model at different iterations can help obtain better uncertainty estimates [56]. Compared to an explicit ensemble, this approach requires training only one model. Never- theless, models at different iterations must all be saved to deploy the algorithm, which can be computationally demanding. 3.3 An Analysis of the Performance of MC Dropout 3.3.1 MC Dropout as Ensembles of Dropout Models Assume a dataset D = {(xi, yi)}ni=1, where each (xi, yi) ∈ (X × Y) is i.i.d. We consider the problem of k-class classification, and let X ⊆ Rd be the input space 36 and Y = {1, ..., k} be the label space1. A classifier is a function that maps input features to labels f : X → Rk. We restrict our attention to NN functions fw(x) : X → Rk, where w = {W Li}i=1 corresponds to the parameters of a network with L-layers, and Wi corresponds to the weight matrix in the i-th layer. We define a likelihood model p(y|x,w) = softmax(fw(x)). Maximum likelihood estimation can be performed to compute point estimates for w. Recently, Gal and Ghahramani [51] proposed a novel viewpoint of dropout as approximate Bayesian inference (See Appendix A for a brief review). This perspective offers a simple way to marginalize out model weights at test time to obtain better calibrated predictions∫: p(y = c|x,Dtrain) = p(y = c|x,w)p(w|Dtrain) dw ∑T (3.1) ≈ 1 p(y|x,w(t)), T t=1 where w(t) ∼ q(w|Dtrain) is assumed to be independently drawn layer-wise weight (t) matrices: Wi ∼ Ŵi·diag(Bernoulli(p)), Ŵi is the parameter matrix learned during training, and p is the dropout rate. In this chapter, we view each dropout sample w(t) in Equation 3.1 as an individual model such that MC dropout is performing ensemble averaging. As we will see in the following sections, this ensemble learning perspective on dropout inference provides us with principled ways to enhance MC dropout. Lastly, note using structured dropout in lieu of regular dropout amounts to only a change of the approximate distribution q(w|Dtrain) in Equation 3.1, so that we are performing Bayesian variational inference with a different class of approximate distributions. For instance, in the channel-level dropout, we sample one Bernoulli random variable for each channel. [!htb] 1Extension to regression tasks is straightforward but left out of this chapter. 37 95.5 1e 2 1e 31.25 95.0 1.2 dropout1.00 0 94.5 1.1 deep ensembledropout 0.75 94.0 1.0 1 deep ensemble 0.50 93.5 0.25 0.9 2 93.0 0.00 0.8 0.25 0.7 3 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Number of Models Number of Models Number of Models Number of Models Figure 3.1: From left to right (1) Accuracy of MC dropout and deep ensemble (2) the relative improvements in accuracy of deep ensemble and MC dropout (3) Brier score of MC dropout and deep ensemble against number of models (4) the relative improvements in in Brier score of deep ensemble and MC dropout against number of models. SVHN CIFAR-10 CIFAR-100 dropout dropBlock dropChannel dropLayer dropOmnibus deep ensemble 0.575 0.625 0.675 0.725 0.50 0.55 0.60 0.65 0.65 0.70 0.75 0.80 Interrater Agreement Interrater Agreement Interrater Agreement Figure 3.2: Interrater Agreement (IA) of models with different types of dropout with 0.1 dropout rate on the SVHN, CIFAR-10 and -100 datasets. The lower the IA, the more diverse the predictions of the models. Y-axis indicates different methods. MC dropout produces models with much larger IA, hence less model diversity, than structured dropout techniques in most of the cases. 3.3.2 Decomposing the Performance of Ensembles First proposed by Krogh and Vedelsby [114], the error-ambiguity decomposition enables one to quantify the performance of ensembles with resp∑ect to individual models. Let {ht}Tt=1 be an ensemble of T classifiers, and H(x) = t ht(x)/T is the ensemble prediction. In classification problems, ht(x) is often a probability vector such that hi(x) = p(y = i|x,w ). In MC dropout h (x) = p(y|x,w(t)t t t ). Model ambiguity can be then defined as: α(ht|x) = ||ht(x)−H(x)||22, which quantifies the difference between an individual model and the ensemble average. The Brier score measures both the accuracy and calibration of probabilistic 38 Accuracy Accuracy Increase Brier Score Brier Decrease classifications, and is proportional to mean squared error (MSE), which can be decomposed as: MSE(H) = Ex[MSE(H|x)] (3.2) = Ex[MSE(h|x)]− Ex[α(h|x)], where MSE(ht|x) = ||y − ht(x)||2, y is the one-hot encoded vector of the correct label y, 1 ∑T MSE(h|x) = MSE(ht|x), ∑T t (3.3)T1 α(h|x) = α(ht|x) T t correspond to the average MSE, and ensemble diversity (average ambiguity), re- spectively. Equation 3.2 suggests that the more accurate and the more diverse the models, the better performance will be achieved by the ensemble. We use MSE instead of the negative log likelihood (NLL), another commonly used mea- sure for quality of uncertainty estimates, due to mathematical convenience. The two metrics are closely related, and insights obtained from MSE carry over to NLL. In general, MSE or NLL can be seen as comprehensive measures influenced by both the accuracy and the calibration of the model. We give a brief discussion in Appendix B on the relationship between these metrics. 3.3.3 Performance of MC Dropout and Model Diversity The discussion of the previous section provides us with a potential recipe to en- hance MC dropout. To illustrate the importance of diversity, we conduct an ex- periment using ResNet-50 on CIFAR-10 to compare MC dropout with an explicit ensemble of five NNs (details can be found in Section 3.4). As we see from Fig- ure 3.1, individual models in deep ensemble, on average, perform better than the 39 ones in MC dropout, likely because of the reduced effective capacity of the latter. Furthermore, the performance of the ensembles improve with more models. Yet the improvement is larger for deep ensemble because of increased ensemble diver- sity, since we know from Equation 3.2 that the decrease in the Brier score in this analysis is attributable to the increase in ensemble diversity. The lack of diver- sity among MC dropout models is largely because neighboring pixel features are often correlated in convolutional layers. Thus, even with dropout, similar infor- mation propagates through the network in every iteration, leading to very similar predictions. Although model diversity can be encouraged naively by increasing dropout rates, doing so often leads to reduced MSE of individual models, thereby hampering ensemble performance, as can be seen from Eq. 3.2. 3.3.4 Omnibus Dropout While model diversity can be promoted via explicit ensembles, they demand more computational resources, which can be prohibitively expensive. Though typically more number of samples is needed for dropout based methods at test time, unlike deep ensembles, dropout uncertainty can be obtained sequentially, which has a lower memory requirement. In this chapter, we hypothesize that the main cause in the lack of diversity with MC dropout for CNNs is the locally similar features present in image data. As such, in order to enhance diversity in an ensemble obtained from a single model, we examine the use of structured dropout, which drops information from contiguous regions of feature maps so that more divergent information is propagated to subse- quent layers during training at each iteration. Specifically, we compare dropout at the patch-level which randomly drops out small patches of feature maps [58], the 40 0.015 dropout 0.014 dropBlock 0.013 dropChannel dropLayer 0.012 dropOmnibus 0.011 0.010 0.009 0.008 0 5 10 15 20 25 30 Number of Models in the Ensemble 0.94 0.93 dropout 0.92 dropBlock dropChannel dropLayer 0.91 dropOmnibus 0 5 10 15 20 25 30 Number of Models in the Ensemble Figure 3.3: Test Brier score (left) and accuracy (right) against number of models for ensemble prediction at test time on CIFAR-10. This corresponds to the number of different MC dropout instantiations at test time of the same model. The Model trained with omnibus dropout achieves the best in terms of accuracy and Brier score. 41 Accuracy Brier Score channel-level which drops out entire channels of feature maps at random [204], and layer-level which drops out entire layers of CNNs at random [90]. We denote these as dropBlock, dropChannel and dropLayer respectively. We identify the test-time sampling of models trained with the aforementioned structured dropout methods as MC dropBlock, MC dropChannel, MC dropLayer. Intuitively, using such struc- tured dropout techniques in lieu of standard dropout can be a simple and yet effective way to enhance diversity in an ensemble sampled through dropout. For instance, at the channel level, since entire channels in the hidden layers are ran- domly dropped out each time, there would be much more diversity associated with the resulting predictions. As we demonstrate experimentally in the following section, the increased diver- sity of structured dropouts can come at the cost of reduced performance of individ- ual models. Moreover, given considerable choices of dropout strategies available, it can be hard to pick the best one. Therefore, we propose a simple omnibus dropout strategy, which combines all the aforementioned methods. The implementation of omnibus dropout involves the sequential execution of the nested group of dropout methods: dropLayer, dropChannel, dropBlock and regular dropout. As such, for each of the layers in the CNN, we first sample a Bernoulli random variable to determine if this particular layer in dropped out. If the layer is retained, we then proceed to determine the channel level dropout, and then the patch level dropout followed lastly by standard dropout. In practice however, all the random variables at different levels are sampled independently from one another. In order to cali- brate the overall dropout rate of omnibus dropout, given a predetermined overall dropout rate (e.g. 0.1), we can calculate the corresponding individual dropout rate at each level so that overall only 10% of the features are dropped out on average. We use a same dropout rate for all the dropout methods. Empirically we find 42 this simple choice to work well. As our results show, omnibus dropout yields good performance by promoting model diversity without hampering individual model performance. 3.3.5 Enhanced Ensemble Diversity with Structured Dropout We empirically investigate model diversity achieved with various forms of afore- mentioned dropout methods. For a fair comparison, we fix the dropout rate for all methods to 0.1 so that all models have the same effective number of parameters. There are numerous measures to quantify diversity of ensembles [239]. We use Interrater Agreement (IA) [118], d∑efined as: 1 n ρ(xk)(T − ρ(xk)) κ = 1− T k=1 , (3.4) n(T − 1)p̄(1− p̄) where T is the number of individual classifiers, n is the number of test samples, ρ(xk) is the number of models that classify the k-th sample correctly, and p̄ is average classification accuracy across classifiers. When all classifiers perfectly agree on the test set κ = 1, and smaller values indicate more diverse predictions. Figure 3.2 summarizes IA for sampled models trained on different datasets with different dropout methods. We also compare the results with deep ensemble. The number of models used to compute IA, T , is fixed to five for all approaches. In general, IA for MC dropout is much higher than structured dropout techniques. On the other hand, structured dropout can yield ensembles that are as diverse as the computationally expensive method of deep ensemble, confirming our expectation that dropping out correlated information can produce sampled models with more ambiguity. Note that the large IA for MC dropLayer on SVHN is likely caused 43 by a relatively small model used for that problem - an 18-layer ResNet. Lastly, note that while MC omnibus-dropout yields models much more diverse than MC dropout, it is often not the most diverse one either. The moderate diversity of MC omnibus-dropout, we believe, is the key to its effectiveness. To better understand its behavior, we study the performance metrics as a function of number of sampled models in the ensemble. Figure 3.3 shows the Brier score (left) and accuracy (right) against number of models for the CIFAR-10 dataset (Similar results observed for SVHN and CIFAR-100. See Appendix C). Firstly, as seen from Figure 3.3 (left), while the performance of individual models sampled from MC dropout is one of the best, the gain in Brier score with a larger number of test-time MC samples is much smaller compared to structured dropout techniques. On the other hand, though a larger diversity indeed leads to much sharper improvements as number of sampled models increases, the Brier scores (hence MSE) of individual models sampled from MC dropBlock, MC dropChannel andMC dropLayer are much larger than that ofMC dropout, suggesting a trade-off between diversity and the performance of individual sample models. MC omnibus- dropout which enjoys the benefits from both structured and regular dropouts, is able to not only achieve good performance on one sampled model (with a Brier score close to MC dropout), but also good model diversity as evident by a sig- nificantly larger decrease in Brier score as number of models increases. Similar observations can be made from the accuracy plot of Figure 3.3 (right). 44 SVHN 0.25 deterministic 0.20 tempScalingdropout 0.15 dropBlockdropChannel 0.10 dropLayerdropOmnibus 0.05 Deep Ensemble 0.00 0.05 0.100.5 0.6 0.7 0.8 0.9 1.0 Confidence CIFAR-10 0.20 0.15 0.10 0.05 0.00 0.05 0.5 0.6 0.7 0.8 0.9 1.0 Confidence CIFAR-100 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.4 0.6 0.8 1.0 Confidence Figure 3.4: Reliability diagrams of predictions produced by difference models. 45 Confidence - Accuracy Table 3.1: Results on benchmark datasets comparing accuracy and uncertainty estimates produced by different types of methods. The top performing result for each metric is bold-faced. MC omnibus-dropout is consistently the best method. The numbers in bracket next to dropout methods corresponds to the optimal drop rate found by grid search. Datasets Methods Accuracy ↑ NLL ↓ Brier ↓ (×10−3) ECE ↓ (×10−2) Temp Scaling 95.7± 0.1 0.163± 0.002 6.62± 0.10 0.995± 0.160 Dropout (0.35) 96.7± 0.1 0.128± 0.001 5.11± 0.06 0.934± 0.045 SVHN DropBlock (0.1) 96.8± 0.1 0.133± 0.002 5.19± 0.07 1.26± 0.14 DropChannel (0.2) 96.7± 0.1 0.130± 0.001 5.15± 0.06 0.799± 0.032 DropLayer (0.25) 96.3± 0.1 0.144± 0.002 5.69± 0.05 0.846± 0.250 Omnibus dropout (0.15) 96.9± 0.1 0.127± 0.001 4.97± 0.09 1.15± 0.06 Temp Scaling 93.9± 0.1 0.189± 0.002 9.06± 0.08 0.905± 0.114 Dropout (0.2) 93.1± 0.1 0.224± 0.003 10.2± 0.1 1.64± 0.07 CIFAR10 DropBlock (0.1) 93.4± 0.1 0.203± 0.003 9.89± 0.10 0.743± 0.116 DropChannel (0.15) 93.7± 0.1 0.193± 0.002 9.34± 0.9 0.812± 0.104 DropLayer (0.1) 94.0± 0.2 0.206± 0.001 9.09± 0.17 0.941± 0.068 Omnibus dropout (0.1) 94.4± 0.1 0.173± 0.001 8.38± 0.10 0.607± 0.078 Temp Scaling 74.5± 0.3 1.00± 0.01 3.57± 0.04 4.02± 0.62 Dropout (0.2) 74.1± 0.4 1.18± 0.01 3.71± 0.05 9.18± 0.23 CIFAR100 DropBlock (0.15) 73.7± 0.5 1.04± 0.02 3.66± 0.05 4.46± 0.97 DropChannel (0.15) 74.9± 0.5 0.996± 0.02 3.46± 0.04 3.17± 0.11 DropLayer (0.25) 75.7± 0.2 1.01± 0.01 3.42± 0.03 2.90± 0.24 Omnibus dropout (0.25) 75.3± 0.2 0.929± 0.005 3.40± 0.02 1.65± 0.21 3.4 Experiments We empirically evaluate the performance of MC dropBlock, MC dropChannel, MC dropLayer and MC omnibus-dropout, and compare them to MC dropout and tem- perature scaling. We include in Appendix C further experiments with explicit dropout ensembles and their comparison to deep ensembles. Model. Layer-level dropout requires skip connections so that there is still information flow through the network after dropping out an entire layer. Some of the examples include the FractalNet [122] and the ResNet [79]. We use the PreAct-Resnet [80] for all our experiments. We refer to the preAct-ResNet trained without dropout as a deterministic model. MC dropout, MC dropBlock and MC dropChannel models are implemented through inserting the corresponding dropout layers with a constant p before each convolutional layer. A block size of 3×3 is used 46 for MC dropBlock. We follow [58] to match up the effective dropout rate of MC dropBlock to the desired dropout rate p. MC dropLayer is implemented through randomly dropping out entire ResNet blocks at a constant rate p. We empirically observe that, dropping out downsampling ResNet blocks during testing is harmful to the quality of uncertainty estimates. This is in agreement with experiments of [209]2. Hence, downsampling blocks are only dropped out during training. MC omnibus-dropout is implemented by including all types of aforementioned dropouts, each with the same dropout rate. For a full Bayesian treatment, we also insert a dropout layer before the fully connected layer at the end of the NNs. For all models with dropout of all types, we sample 30 times at test-time for Monte Carlo estimation. Datasets. We conduct experiments using the SVHN [158], CIFAR-10 and CIFAR-100 [111] datasets with standard train/test-set split. Validation sets of 10000 and 5000 samples are used for SVHN and the CIFARs. We use the 18-, 50- and 101-layer PreAct-ResNet for SVHN, CIFAR-10 and CIFAR-100. Training. We perform preprocessing and data augmentation using per-pixel mean subtraction, horizontal random flip and 32× 32 random crops after padding with 4 pixels on each side. We use stochastic gradient descent (SGD) with 0.9 momentum, a weight decay of 10−4 and learning rate of 0.01, and divided it by 10 after 125 and 190 epochs (250 in total) for SVHN and CIFAR-10, and after 250 and 375 (500 in total) for CIFAR-100. Evaluation. All the results are computed on the test set using the model at the optimal epoch based on validation accuracy. We use the Brier score, negative log- likelihood (NLL), expected calibration error (ECE), and Classification accuracy 2In their experiments, ResNet blocks are only dropped out during testing, but not training. 47 to evaluate performance (see Appendix B for definitions). Following [154], we partition predictions into 20 equally spaced bins and take a weighted average of the bins’ accuracy and confidence difference to estimate ECE. To visualize calibration performance, we also plot the reliability diagrams [140], which are plots of the difference between accuracy and confidence against confidence. The closer the curve to the X-axis, the more calibrated the model predictions. Results. Table 3.1 summarizes the performance of various models using met- rics mentioned previously. To ensure a fair comparison, we treat the dropout rate as a hyper-parameter and conduct a linear grid search with 0.05 interval for op- timal dropout rate based on NLL. The optimal dropout rates are shown in the table next to methods. Standard deviations are obtained on five models with random initializations for all dropout models. As seen from Table 3.1 and Fig- ure 3.4, all forms of structured dropout models offer better uncertainty estimates than MC dropout in general. Overall, MC omnibus-dropout is consistently the best performing model. Moreover, we also perform experiments with five explicit ensembles of models trained together with all types of dropout for further com- parison against deep ensembles, and most of the dropout models outperform deep ensembles trained without dropout. Again, omnibus dropout is consistently one of the best methods (See Appendix C). Lastly, as evident from moderately increased classification accuracy over deterministic temperature scaling models, all types of dropout methods can be incorporated into architectures with no accuracy penalty. We believe the relatively good performance of MC dropout on SVHN compared to CIFARs is because the former task is easier so that the model can still predict accurately at an aggressive dropout rate of 0.35 at which even regular dropout can produce acceptably diverse sampled models. In contrast, as observed in our 48 experiments, while using larger dropout rates for the more difficult CIFAR datasets can lead to more calibrated predictions, accuracy and NLL suffer due to drop in MSE of individual models (see Appendix C). Lastly, we believe the results for MC dropBlock can be improved by optimizing the choice of block size. A pre-fixed block size of 3 × 3 can be too small for the upstream convolutional layers where the size of feature maps are much larger than the block size, and too large for the last few downstream layers where the feature maps are comparable to the block size, as supported by sharp increases in NLL after the optimal dropout rate. 3.5 Bayesian Active Learning To further demonstrate the merit of omnibus dropout, we consider the downstream task of Bayesian active learning on CIFAR-10. Active learning involves first train- ing on a small amount of labeled data. Then, an acquisition function based on the outputs of models is used to select a small subset of unlabeled data so that an oracle can provide labels for these queried data. Samples that a model is the least confident about are usually selected for labeling, in order to maximize the information gain. The model is then retrained with the additional labeled data that is provided. The above process can be repeated until a desired accuracy is achieved or the labeling resources are exhausted. In our experiment, we train models with structured dropout at different scales using the identical setup as described in the beginning of this section, except that only 2000 training samples are used initially. To match up model capacity, the dropout rate is set to 0.1 for all methods. We also compare again a deter- ministic model. After the first iteration, we acquire 1000 samples from a pool 49 BALD 0.90 0.85 0.80 0.75 deterministic dropout 0.70 dropBlock dropChannel 0.65 dropLayer dropOmnibus 0.60 0.2 0.4 0.6 0.8 1.0 Number of Training Samples 1e4 Entropy 0.90 0.85 0.80 0.75 0.70 0.65 0.2 0.4 0.6 0.8 1.0 Number of Training Samples 1e4 Variation 0.90 0.85 0.80 0.75 0.70 0.65 0.2 0.4 0.6 0.8 1.0 Number of Training Samples 1e4 Figure 3.5: Left : Test accuracy against number of training samples for models with different methods of dropout and Variation Ratios as the acquisition function on CIFAR-10. Right : Relative improvements in test accuracy over that of the first iteration with different methods of dropout. 50 Test Accuracy of ”unlabeled” data, and combine the acquired samples with the original set of labeled images to retrain the models. Following [∑54], we consider three acquisi- tion functions: Max Entropy, H[y|x,Dtrain] = − c p(y = c|x,Dtrain) log p(y = c|x,Dtrain), the BALD metric (Bayesian Active Learning by Disagreement), I[y,w|x,Dtrain] = H[y|x,Dtrain] − Ep(w|Dtrain)[H[y|x,w]], and the Variation Ratios metric, variation-ratio[x] = 1 − maxy p(y,x,Dtrain). We repeat the acquisition process eight times so that in the last iteration, the training set contains 10000 images. To mimic a real world scenario in which number of labeled samples is small, we do not use a validation set, and the accuracies reported for this exper- iment correspond to the last-epoch accuracies. We repeat experiments five times for consistency. Figure 3.5 shows the test accuracy against number of training samples for different models. In general, MC omnibus-dropout yields the best performance by far. Interestingly, MC omnibus-dropout is able to outperform all other methods consistently by a significant margin after the first iteration when all samples are randomly selected. In addition, it can be seen that, after the first iteration when all 2000 training images are randomly selected, the test accuracy using MC dropout is on par with that of other structured dropout methods. However, as more labeled data are added, the relative increase in accuracy is more significant for models using structured dropout compared to that of using regular dropout. This suggests that the uncertainty estimates obtained with structured dropout are more useful for assessing ”what the model doesn’t know”, thereby allowing for the selection of samples to be labeled in a way that better helps improve performance. Note also that the comparative gain in accuracy by MC omnibus-dropout during the later stages of the learning process is not as large. We suspect this can be caused by the saturation effect on test accuracy. 51 CHAPTER 4 ENHANCING UNCERTAINTY ESTIMATES WITH EFFICIENT NEURAL NETWORK ENSEMBLES In the previous chapter, We presented several ways to improve upon Monte dropout for uncertainty method. Monte Carlo (MC) dropout [51] is a simple and efficient ensembling method that can improve the accuracy and confidence calibration of high-capacity deep neural network models. However, MC dropout is not as ef- fective as more compute-intensive methods such as deep ensembles [120]. This performance gap can be attributed to the relatively poor quality of individual models in the MC dropout ensemble and their lack of diversity. These issues can in turn be traced back to the coupled training and substantial parameter shar- ing of the dropout models. Motivated by this perspective, we propose a strategy in this chapter to compute an ensemble of subnetworks, each corresponding to a non-overlapping dropout mask computed via a pruning strategy and trained independently. We show that the proposed subnetwork ensembling method can perform as well as standard deep ensembles in both accuracy and uncertainty es- timates, yet with a computational efficiency similar to MC dropout. Lastly, using several computer vision datasets like CIFAR10/100, CUB200, and Tiny-Imagenet, we experimentally demonstrate that subnetwork ensembling also consistently out- performs recently proposed approaches that efficiently ensemble neural networks. 4.1 Introduction An effective way to improve model accuracy and confidence calibration in deep learning is ensembling. One efficient technique that leverages this idea is ”Monte 52 Carlo (MC) dropout” [51] which extends the popular dropout technique used for regularization during training [195]. In MC Dropout, test-time inference involves multiple forward passes through the model, each executed with a different ran- dom dropout mask as in during the training phase. This yields an ensemble of predictions which are then averaged. While MC dropout can improve a baseline model, it is still inferior to explicit ensembles of neural networks trained independently with random initialization (called deep ensembles) [120]. Using the perspective of the error-ambiguity de- composition [238], we can attribute this performance gap to the relatively poor performance of individual models and/or limited diversity in the MC dropout ensemble. We further hypothesize these issues are largely due to the extensive parameter sharing among MC dropout models. With this perspective in mind, we explore the idea of creating an ensemble of subnetworks in which a pre-determined number of non-overlapping dropout masks are used. We present an easy-to-implement greedy optimization procedure that sequentially computes dropout masks via a recent dropout-mask optimization technique and trains each subnetwork independently. The resulting algorithm enables us to obtain a diverse ensemble of non-overlapping subnetworks within one neural network. That is, we are able create many models out of one1. We demonstrate that subnetwork ensembling consistently outperforms MC dropout and several other recently proposed approaches that efficiently ensemble neural networks in terms of both accuracy and uncertainty estimates. We also show that our proposed approach achieves results on par with that of deep ensembles, yet with the much better test-time computational efficiency. 1Hence our title: Ex uno plures. 53 Summary of Contributions. 1. We present the novel idea of ensembling non-overlapping subnetworks within one neural network architecture. 2. We propose a simple sequential pruning based procedure to enhance the performance of subnetwork ensembling. 3. We demonstrate and discuss the regularization effect achieved by training pruned networks and using a randomized and frozen fully connected layer in the network. 4. Our experiments demonstrate that subnetwork ensembling outperforms MC dropout and several state-of-the-art methods for efficient ensembling. 4.2 Related Works Ensemble Learning An ensemble of models has long known to be an effective method to boost the performance of machine learning models [37, 238]. More re- cently, with the growing interest in deep learning, ensembles of neural networks have gained much attention. Notably, Lakshminarayanan et al. [120] demonstrated that simple ensembles of neural networks (NNs) trained independently (called “Deep Ensembling”) can offer improved predictive uncertainty and accuracy. In fact, somewhat surprisingly, deep ensembles often outperform more sophisticated Bayesian NNs. Building on this, several recent works have attempted to under- stand the unexpected effectiveness of deep ensembles [45,131,174,217]. To further enhance the performance of ensembles of neural networks, previous methods have also pinpointed the importance of model diversity [98,113,234], and 54 explored ways to promote diversity in ensembles. For instance, Sinha et al. [192] proposed an Information Bottleneck-based approach to explicitly stimulate diver- sity among predictions. In related work, Jain et al. [96] utilized out-of-distribution samples to encourage diversity among models. Enhanced diversity can also be obtained by ensembling across different architectures of NNs through neural archi- tecture search [230] or by varying the hyperparameters of models [219]. Our proposed method can be regarded as an approach for efficient ensembles of NNs. Several techniques have been previously proposed in this direction. For instance, BatchEnsemble [218] makes use of rank-one matrices to approximate weight matrices in NNs for fast ensembling of models. Lately, a multi-input multi- output (MIMO) [77] configuration was discovered to be an effective method to utilize a single model’s capacity to train multiple subnetworks. Another recent work [39], similar to ours, uses a set of pre-determined masks in place of the stochastic sampling of MC dropout for improved uncertainty estimation. However, their masks are randomly generated at the beginning of training and frozen. There is also significant sharing of parameters among the models, potentially reducing diversity and thus leading to sub-optimal performance. Neural Network Pruning Network pruning aims to compress neural networks by reducing the number of parameters present in the model [46,71,73,74,126,127, 130]. It often involves selecting and discarding parameters from a pre-trained net- work, after which the compressed network is fine-tuned. While earlier approaches choose parameters based on their magnitudes [46,73,74], numerous other selection criteria have also been explored recently [25,186,187]. Inspired by recent success in pruning, our proposed training procedure makes use of an importance score-based optimization approach for dropout mask optimization [177, 187]. There have also 55 been efforts to compress a network at initialization, omitting pre-training [47,78]. However, post-training pruning methods typically outperform these methods. In addition to network compression, network pruning has been used for goals like multi-task learning [144] and continual learning [62]. In this work, we demonstrate pruning can be an effective regularization strategy and can further be used to obtain a subnetwork ensemble with a performance matching that of an explicit ensemble of full networks. Dropout First introduced as a technique to regularize networks, dropout in- volves independent random removal of neurons or weights during training with a pre-determined probability [195]. Later, Gal and Ghahramani [51] showed that dropout can be applied at test-time, called Monte Carlo (MC) dropout, which can be viewed as an approximate Bayesian technique and yields better estimates of uncertainty. Several improvements have also been proposed to improve MC dropout [1, 28, 53, 104, 236]. Unlike MC dropout which generates random masks on the fly at every iteration, we demonstrate that pruning-derived dropout masks can lead to significantly better performance. 4.3 Preliminaries Problem Setup We consider the problem of k-class classification, though the proposed method can be trivially extended to the regression setting. Suppose we are given a dataset D = {xi, yi}nn=1 where each feature-label pair (xi, yi) ∈ X ×Y , and X ⊆ Rd and Y = {1, .., k} denotes the feature space and label space respec- tively. Typically, a neural network (NN) fw(x) parameterized by w can be used to map input features to corresponding labels for classification purpose. We define 56 a likelihood model p(y|x;w) = Cat (softmax (fw(x))), a categorical distribution with parameters softmax (fw(x)) ∈ ∆(k). Here ∆(k) denotes the k-dimensional probability simplex. Typically, maximum likelihood estimation (MLE) is per- formed on the train dataset to obtained the optimal parameters for the NN. At test time, p(y|x;w) is supposed to reflect the uncertainty in the predictions of the net- work. However, modern NNs are often poorly calibrated, yielding overly-confident predictions [70]. Deep Ensembles Previous work [120] demonstrates deep ensembles, or ensem- bles of independently-trained NNs, as an effective remedy to the calibration prob- lem. Given an ensemble of models (each with its parameter wi), an aggregated prediction can be obtained with 1 ∑N p(y|x) = p(y|x;wi). (4.1) N n=1 Leveraging the under-specification property [33] of modern neural networks to- gether with stochasticity provided through random initialization of NNs and stochastic optimization, deep ensembles often lead to a drastic improvement in terms of both accuracy and quality of uncertainty estimates. However, a crucial shortcoming of deep ensembles is the computational overhead; an ensemble of five models would roughly cost five times more resources, including storage. This can be prohibitive in real-world applications with computational constraints. MC Dropout MC dropout can be used as an efficient alternative for ensembling. Instead of training several models independently, one can train with dropout, where a randomly sampled Bernoulli mask is applied to the weights during forward and backward propagation. At test time, an ensemble of models can then be obtained 57 ”for free” via multiple forward passes with random instantiations of the dropout masks. Compared to deep ensembles, MC dropout incurs no additional memory cost. Nevertheless, predictions produced through MC dropout are often outper- formed by deep ensembles in both accuracy and uncertainty estimates. 4.4 Fixing MC Dropout The error-ambiguity decomposition shows that the performance of an ensemble is determined by two factors: the average performance of individual models that make up the ensemble and the degree of diversity across model predictions [98,113]. Based on this perspective, the gap between MC dropout and deep ensemble can be due to two reasons. Firstly, as we see in our Ablation Study below, individual models sampled through dropout, on average, are no better than the independently trained models in a deep ensemble. This is probably due to the common training used for all models and the reduced effective capacity of the models. Moreover, MC dropout exhibits significantly less diversity than deep ensembles, likely due to the high degree of parameter sharing between models and, again, the common training paradigm. In this section, we propose changes to MC dropout to obtain an ensemble of non-overlapping and independently trained subnetworks. 4.4.1 Toward Enhancing Model Diversity In a standard dropout training scheme, only a very small portion of the parameters is dropped out (typically, dropout rates are set to be around 10% to 20%.). As a direct consequence, there is extensive parameter sharing among models sampled 58 through MC dropout, leading to poor model diversity. While model diversity can be naively increased by choosing a higher dropout rate, this often results in much worse individual models, thus hampering the overall performance. Moreover, compared to models in a deep ensemble which enjoy a completely independent optimization procedure, in dropout, a random model is generated on the fly at every training iteration. This can also increase the correlation of predictions among MC dropout models. Orthogonal Dropout To enhance model diversity, instead of drawing random Bernoulli masks on the fly at each iteration, we propose to use a set of fixed, non-overlapping dropout masks during training. We term this the ”orthogonal dropout.” These masks can be generated by simply randomly partitioning every layer’s weights into k non-overlapping sets. For instance, we can randomly partition a standard neural network into an ensemble of five subnetworks each with 20% of the weights. Since the dropout masks are non-overlapping by construction, we can completely decouple the training of the subnetworks. With orthogonal dropout, each dropout subnetwork is effectively an independent model, thereby allowing us to achieve much more model diversity. To further decouple the dropout models, we can also maintain an independent set of batchnorm layers [94] for each subnetwork. Similar to MC dropout, at inference, we can apply dropout masks to weights of NNs before forward propagation and aggregate predictions of the subnetworks following Equation 1. Unlike MC dropout which contains effectively an infinite number of models to be sampled from, with orthogonal dropout, there is a predetermined, fixed amount of dropout subnetworks for ensembling. Nevertheless, as we show in our experiments, the gain from the decoupled training procedure and resulting diversity 59 offsets the limitation of the relatively small ensemble size. We note that orthogonal dropout incurs some additional memory cost. Since the dropout masks are fixed before training, we need to keep track of these. More- over, for a NN with k subnetworks, we might need k sets of batchnorm layers and k sets of bias terms in any fully connected (FC) layers 2. Nevertheless, compared to the number of parameters in modern architectures, this additional memory cost is negligible. The additional cost of batchnorm layers can also be mitigated through batchnorm-free NNs [17]. Finally, as we present below, we can share FC layers between subnetworks, which further reduces memory burden. 4.4.2 Enhancing Individual Model Performance Naive orthogonal dropout implementation with randomly generated dropout masks can cause lackluster individual model performance (see Ablation Study below). Indeed, for an orthogonal dropout ensemble with 5 subnetworks, each dropout subnetwork contains 20% of the parameters compared to a model in a deep en- semble, likely causing an accuracy drop in individual models and hence the overall ensemble performance. Orthogonal Dropout Mask Optimization Our central goal is to maximize orthogonal dropout subnetwork performance given a predetermined level of spar- sity. A natural solution to this would be to optimize the dropout masks and weights simultaneously. Let mi ∈ {0, 1}n for i = 1, ..., k denote binary dropout masks applied to n weights. Then, a general optimization objective can be given 2We do not have bias terms for convolutional layers, as is common in many deep architectures 60 Algorithm 2: Orthogonal Dropout Optimization Input: NN parameters w; subnetwork size k. Output: Optimized NN parameters w; dropout subnetwork masks {m1, ...,mk}. Initialize w0 = w, m0 = 1 (an identity mask of all 1’s) ; for i = 1 : k do wi = wi−1 ◦ (1−mi−1) ; // mask parameters already in use Randomly initialize wi; Minimize E(x,y)∼D [L (p(y|x;wi), y)] w.r.t wi ; // pre-training step Apply the modified edge-pop algorithm to find optimal mi given wi ; // pruning step Minimize E(x,y)∼D [L (p(y|x;wi ◦mi), y)] w.r.t wi ◦mi ; // finetuning step end by: [ ( 1 ∑ )]k min E(x,y)∼D L p(y|x;w ◦mi), y w,m1,...,mk k i=1 (4.2) n s.t. = k and mi ⊥ mk,∀i ̸= j ∈ {1, ..., k},∥mi∥0 where ◦ denotes the Hadamard product, ∥·∥0 denotes the number of non-zero elements in a vector, and ⊥ indicates orthogonality (i.e., the vector product is zero). The first condition ensures that all the dropout subnetworks contain the same number of parameters while the second condition enforces the masks to be non-overlapping. The above optimization is infeasible to solve in practice. We make two observa- tions that allows us to propose an approximate solver. First, given the individual masks mi, the problem reduces to k independent weight learning problems. Sec- ond, optimizing for a dropout mask is similar to the problem of pruning a neural network. Based on these observations, we propose to simplify the optimization procedure into a series of greedy optimizations of {wi,mi}, for i = {1, . . . , k}. To do this, 61 we adopt a three-step approach originally proposed for network pruning. Specif- ically, at the i-th iteration, a pre-training step is first executed on all available model weights that exclude the ones used (i.e, not masked out) so far. Given the pre-trained parameters, an optimization step is then performed to determine the optimal dropout mask mi before a final fine-tuning step with only the retained weights wi ◦mi. The pre-training and finetuning steps are straightforward with stochastic gradi- ent descent (SGD) given the dropout masks. On the other hand, the intermediate step of binary dropout mask optimization of mi is worth focusing on. In this work, we adopt the score-based edge-pop algorithm [177] to find the optimal mask mi given a pre-trained network weights. In essence, the edge-pop algorithm trans- forms the discrete optimization problem into a differentiable problem where SGD can be used. This is achieved by assigning a continuous score to each weight in the NN indicating its relative importance. During a forward pass, a binary mask mi can be generated by ranking these scores and choosing the weight with the largest scores. A gradient for this sort and choose layer can be approximated via a relaxed backward pass. Detailed description of the algorithm can be found in the Appendix. The original edge-pop algorithm is applied to a randomly initialized net- work [177]. As such, the importance scores are also randomly initialized in their setup. In our work, however, the edge-pop algorithm is applied on a pre-trained network. In this scenario, we found it critical to initialize the scores proportional to the magnitude of weights. This is inspired by the effectiveness of the weight magnitude-based method for network pruning [46]. A similar observation was also made previously by Sehwag et al. [187]. In practice, we use magnitudes of weights 62 divided by the layer-wise maximum as the initial values of scores so that all scores have values in [−1, 1]. Another modification we applied is that, at the i-th it- eration, weights that were used in a previous dropout iteration are masked out and excluded from consideration by the edge-pop algorithm. Thus, we propose to approximately solve the original optimization problem of Equation 4.2 with a greedy, sequential optimization procedure. A summary of the proposed orthogonal dropout algorithm can be found in Algorithm 1. Fixing The Classification Layer Applying dropout to the final fully-connected (i.e., classification) layer can significantly reduce the performance of each subnet- work, hampering overall ensemble performance. One way around this is to not apply dropout and have all subnetworks share the classification layer. With this in mind, and inspired by some recent reports [85,205] in addition to our own empiri- cal experience, we found randomly initializing and freezing a shared (no dropout) classification layer to be very effective. Thus, this is our default implementation for the proposed orthogonal dropout method in our experiments. Ablation study below includes results without a fixed classification layer. 4.5 Experiments To evaluate the performance of the proposed method, we conduct extensive exper- iments with popular NN architectures on several benchmark datasets. We use the CIFAR-10 and CIFAR-100 dataset [111], the CUB-200 dataset [215] and the tiny- imagenet3 dataset [35]. We use the ResNet18 [79] and the Wide-ResNet28-10 [229] for CIFAR datasets, and ResNet50 model for the CUB-200 and the Tiny-Imagenet 3https://www.kaggle.com/c/tiny-imagenet 63 Table 4.1: Results for ResNet models on various datasets. Best results for efficient ensembles are highlighted in bold. Fixed classification layer is used for orthogonal Dropout. See Table 4.3 and the Appendix for further ablation study on this. Method Accuracy (↑) NLL (↓) ECE (↓) Size Deterministic 93.5% 0.296 0.0408 1× MC Dropout 94.4% 0.191 0.0202 1× BatchEnsemble 94.8% 0.203 0.0269 ∼ 1× CIFAR10 ResNet18 MIMO Ensemble 94.3% 0.205 0.0180 ∼ 1× Masksemble 93.8% 0.202 0.0099 ∼ 1× Orthogonal Dropout (Ours) 95.1% 0.157 0.0082 ∼ 1× Deep Ensemble 94.8% 0.175 0.0110 5× Deterministic 73.0% 1.28 0.141 1× MC Dropout 73.3% 1.11 0.0902 1× BatchEnsemble 74.3% 1.05 0.0910 ∼ 1× CIFAR100 ResNet18 MIMO Ensemble 73.8% 1.09 0.0664 ∼ 1× Masksemble 73.7% 0.999 0.0224 ∼ 1× Orthogonal Dropout (Ours) 77.7% 0.864 0.0191 ∼ 1× Deep Ensemble 76.7% 0.921 0.0377 5× Deterministic 50.1% 2.45 0.196 1× MC Dropout 50.2% 2.49 0.171 1× BatchEnsemble 54.0% 2.33 0.128 ∼ 1× CUB200 ResNet50 MIMO Ensemble 52.0% 2.11 0.0879 ∼ 1× Masksemble 49.6% 2.32 0.118 ∼ 1× Orthogonal Dropout (Ours) 61.4% 1.67 0.0335 ∼ 1× Deep Ensemble 55.6% 1.98 0.0725 5× Deterministic 56.3% 2.18 0.164 1× MC Dropout 56.3% 1.96 0.0871 1× BatchEnsemble 54.3% 2.31 0.185 ∼ 1× Tiny-Imagenet ResNet50 MIMO Ensemble 54.0% 1.96 0.0461 ∼ 1× Masksemble 50.1% 2.07 0.0405 ∼ 1× Orthogonal Dropout (Ours) 63.2% 1.55 0.0189 ∼ 1× Deep Ensemble 61.4% 1.73 0.0336 5× datasets. We use accuracy, the negative log-likelihood (NLL) [120] and the ex- pected calibration error (ECE) [70] to measure performance. ECE, in particular, is a measure of the quality of uncertainty. Baselines In addition to MC dropout and deep ensembles, we compare orthog- onal dropout with several other recently proposed state-of-the-art methods for 64 efficient ensembles. These include BatchEnsembles [218], MIMO ensembles [77] and Masksembles [39]. We use an ensemble of 5 models for all types of ensembling methods except MIMO, for which we found an ensemble of 2 models gave the best performance for ResNet models. Lastly, during inference, we do 30 forward passes for MC dropout which we observe was sufficient to achieve its best performance. Optimization For ResNet models, we use SGD with identical hyper-parameters as originally used in the ResNet paper. We optimize the models for 150 epochs during both the pre-training and finetuning step, and optimize dropout masks for 20 epochs during pruning step for our orthogonal dropout. In order to ensure a fair comparison, we train baseline models for longer so that they are fully converged. MC dropout, MIMO ensembles and the models in deep ensembles are trained for 200 epochs respectively. When training MIMO ensembles, we also use a batch rep- etition of 4 to enhance the model performance, as suggested in the original paper. We empirically observe that it takes longer for BatchEnsemble and Masksembles to converge, and thus train these models for 500 epochs. For experiments using Wide-ResNet28-10, we follow the identical training procedure used by Havasi et al. [77] for a fair comparison against their experimental results. 4.5.1 Results Experimental results for the ResNet models are summarized in Table 4.1. As seen clearly from the table, our proposed method significantly outperforms other recently proposed state-of-the-art (SOTA) methods of memory-efficient ensembles in terms of both accuracy and quality of uncertainty estimates. Similar trends can be also observed from experiments with Wide ResNet28-10 as well; our proposed 65 Table 4.2: Results for Wide ResNet28-10. Asterisk symbol (*) represents results adapted directly from [77]. Best results for efficient ensembles are highlighted in bold. CIFAR10 CIFAR100 Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓) Deterministic* 96.0% 0.159 0.023 79.8% 0.875 0.086 MC Dropout* 95.9% 0.160 0.024 79.6% 0.830 0.050 BatchEnsemble* 96.2% 0.143 0.021 81.5% 0.740 0.056 MIMO* 96.4% 0.123 0.010 82.0% 0.690 0.022 Masksembles 94.6% 0.173 0.008 76.7% 0.843 0.015 Orthogonal Dropout (Ours) 96.6% 0.122 0.005 82.8% 0.701 0.021 Deep Ensembles* 96.6% 0.114 0.010 82.7% 0.666 0.021 method consistently outperforms other SOTA methods. More interestingly, for experiments with the ResNet models, we can see from Table 4.1 that orthogonal dropout even produces performance better than that of standard deep ensembles, which consumes approximately five times more mem- ory resources during inference time.This improvement is likely due to at least two reasons. Firstly, as we discuss further below, the sequential dropout mask op- timization serves as an additional vehicle for regularization. Secondly, as we will further elaborate in Section 4.5.2, this improvement over deep ensembles is also due to fixing the classification layer. Nevertheless, we emphasize that, even without fixing the classification layer, our method consistently outperforms other recently proposed SOTA efficient ensembling techniques. This can be confirmed by compar- ing results for orthogonal dropout without the fixed classification layer summarized in Table 4.3. Lastly, we note that the relative gap between orthogonal dropout and deep ensembles becomes negligible for the experiments with Wide ResNet28-10 for CIFARs. We conjecture that this is because a higher L2 regularization is used for training of this model, following the exact training configuration of Havasi et al. [77], thereby nullifying the regularization effect of fixing the classifier layers. We leave it as future work to further understand this regularization effect. 66 94.5 CIFAR-10 CIFAR-100 75 94.0 74 73 93.5 72 93.0 711st 2nd 3rd 4th 5th 1st 2nd 3rd 4th 5th (a) (b) CUB200 Tiny-Imagenet 60.0 57.5 59 55.0 58 52.5 57 1st 2nd 3rd 4th 5th 1st 2nd 3rd 4th 5th (c) (d) Figure 4.1: Bar plots of accuracy of individual orthogonal dropout subnetworks of ResNet models. ”i-th” model represents the i-th subnetwork obtained using Algorithm 1 sequentially. Individual Model Performance To gain further insights into orthogonal dropout, we show in Figure 4.1 bar plots of the accuracy of the individual subnet- works in ResNet models on various datasets. Firstly, it can be seen that for all datasets except CUB200, subnetworks obtained later during the proposed greedy optimization procedure, in general, exhibit poorer performance. Intuitively, this is because later subnetworks have fewer parameters for learning. For instance, for an orthogonal dropout ensemble with five subnetworks, during training of the third subnetwork, only 60% of the weights are available for parameter and dropout mask optimization. Interestingly, we see that the 2nd model is consistently the best per- forming subnetwork out of the ensemble of five subnetworks. We hypothesize that removing a small portion of parameters in neural networks can implicitly regularize 67 Accuracy Accuracy Accuracy Accuracy Table 4.3: Ablation study of the proposed method. orthogonal dropout methods are trained without dropout mask optimization. ”MO” corresponds to ”mask optimization” and ”FC” corresponds to ”Fixed Classifier”. ”Ind Acc” denotes the averaged individual model accuracy in an ensemble, while ”Ens Acc” represents the ensemble accuracy. CIFAR10 CIFAR100 Ind Acc (↑) Ens Acc (↑) NLL (↓) ECE (↓) IA (↓) Ind Acc (↑) Ens Acc (↑) NLL (↓) ECE (↓) IA (↓) MC Dropout 93.4% 94.4% 0.287 0.0415 0.708 71.3% 73.3% 1.11 0.0902 0.776 Orthogonal Dropout 93.2% 94.3% 0.188 0.0138 0.597 71.4% 75.1% 0.977 0.0455 0.655 Orthogonal Dropout+MO 93.8% 94.9% 0.176 0.0122 0.601 72.3% 76.3% 0.935 0.0328 0.648 Orthogonal Dropout+MO+FC 93.9% 95.1% 0.157 0.0082 0.594 73.7% 77.7% 0.864 0.0191 0.638 Deep Ensemble 93.4% 94.8% 0.175 0.0110 0.581 72.7% 76.7% 0.921 0.0377 0.652 Deep Ensemble+FC 93.5% 95.1% 0.151 0.0091 0.580 74.1% 77.8% 0.858 0.0221 0.645 a network and account for these results. This could also help explain why for the CUB200 dataset, even the 5th model outperformed the 1st by far. Compared to the other three datasets, the CUB200 dataset is significantly smaller in size, consisting of only approximately 6000 images for training. As such, aggressive regulariza- tion can potentially significantly improve generalization performance. Lastly, we note that, despite the significantly lower accuracy of later subnetworks, we found including them in the ensemble still leads to a positive gain. 4.5.2 Ablation Study In this section, we conduct additional experiments using CIFAR10/100 and the ResNet18 model to decompose the contribution of each component of our pro- posed method. Specifically, we compare standard MC dropout against 1. dropout with randomly generated and orthogonal masks, 2. orthogonal dropout with mask optimization, 3. orthogonal dropout with both mask optimization and fixed clas- sification layer. To further demonstrate the effect of fixing the classification layer, we also train a deep ensemble with a fixed classification layer. In addition to ac- curacy, NLL and ECE, we also report the individual level accuracy and compute the Inter-rater Agreement (IA) [118] between individual models in an ensemble, as 68 CIFAR-10 CIFAR-100 95.2 78 95.0 94.8 77 94.6 76 94.4 94.2 75 94.0 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 Number of Models (a) (b) CIFAR-10 CIFAR-100 0.23 1.10 0.22 1.05 0.21 1.00 0.20 0.95 0.19 0.90 0.18 0.17 0.85 0.16 0.80 2 3 4 5 6 7 8 9 10 0.15 2 3 4 5 6 7 8 9 10 Number of Models (c) (d) CIFAR-10 CIFAR-100 0.08 0.020 Ours 0.07 0.018 Deep Ensemble with FC 0.016 Deep Ensemble w/o FC 0.06 0.014 0.05 0.012 0.04 0.010 0.03 0.008 0.02 0.006 2 3 4 5 6 7 8 9 10 0.004 2 3 4 5 6 7 8 9 10 Number of Models (e) (f) Figure 4.2: Plot of accuracy/NLL/ECE against number of models in the ensembles. For orthogonal dropout, number of models is varied by changing the size of each subnetwork and all the orthogonal dropout ensembles are of the same size. ”FC” corresponds to ”Fixed Classifier”. a measure of diversity in the ensemble to gain further insights. The lower the IA, the more diverse the ensembles. Results for this ablation study can be found in Table 4.3. Firstly, we note that 69 ECE NLL Accuracy ECE NLL Accuracy even orthogonal dropout without dropout mask optimization outperforms standard MC dropout significantly for CIFAR100, despite the significantly smaller ensemble size (for MC dropout, results are obtained with 30 forward passes whereas orthog- onal dropout ensemble contains only five subnetworks). This can be explained by increased diversity among models in ensembles, as evident from considerably lower IA for orthogonal dropout. In fact, the amount of diversity in orthogonal dropout is almost identical to that of deep ensembles. Furthermore, we see that optimizing dropout masks and fixing the classier layer further boost the performance of orthogonal dropout substantially. The improve- ment primarily is attributable to the increase in individual model performance. To isolate the effect of fixing the classifier layer, we also train a deep ensemble of 5 models with a fixed classifier layer. Note that fixing the classification layer consis- tently yields an ensemble with better performance in this particular experimental setup. Remarkably, orthogonal dropout models with mask optimization and fixed classifier layer perform as well as deep ensembles with 5 models with fixed classifier layer in terms of both accuracy and calibration. 4.5.3 How Many Subnetworks Can We Fit? We also investigate how many subnetworks we can fit into a ResNet18 model. This can be achieved by adjusting the percentage of parameters each subnetwork consumes. For instance, if each subnetwork consists of 50% of the parameters, we can fit in 2 subnetworks into a ResNet18 model, and if each subnetwork consists of 10% of the parameters, the orthogonal dropout ensemble would contain 10 subnetworks. Nevertheless, increasing the ensemble size can decrease the quality of individual model performance. As such, there is an inherent trade-off that needs 70 to be balanced. In figure 4.2, we plot model performance (ensemble accuracy, NLL and ECE) against the number of subnetworks in a single ResNet18 model. Surprisingly, we see that we can even fit 10 subnetworks into a ResNet18 model using orthogonal dropout with an extremely competitive ensemble performance. This is in stark contrast with the MIMO ensemble, which also aims at fitting multiple subnet- works into one model, whose performance degrades drastically with 4 or more subnetworks in the model [77]. Moreover, WRN28-10, a network with much more parameters, was used in their experiment. With ResNet18, we empirically observe even 3 subnetworks in MIMO yields a poor performance (see Appendix). To further understand the quality of orthogonal dropout ensemble, we also plot the performance achievable by an explicit deep ensemble with the same numbers of models in the ensemble. In general, we see that orthogonal dropout is capable of outperforming even a standard deep ensemble of 10 models for this particular experimental setup. Moreover, we see that orthogonal dropout matches a deep ensemble of 8 models with fixed classification layers, giving us significant memory saving. 71 CHAPTER 5 ACCELERATING UNCERTAINTY ESTIMATES COMPUTATION WITH UNCERTAINTY-AWARE DISTRIBUTION DISTILLATION Calibrated estimates of uncertainty are critical for many real-world computer vision applications of deep learning. While there are several widely-used uncertainty es- timation methods, dropout inference [51] stands out for its simplicity and efficacy. This technique, however, requires multiple forward passes through the network during inference and therefore can be too resource-intensive to be deployed in real- time applications. We propose a simple, easy-to-optimize distillation method for learning the conditional predictive distribution of a pre-trained dropout model for fast, sample-free uncertainty estimation in computer vision tasks. We empirically test the effectiveness of the proposed method on both semantic segmentation and depth estimation tasks, and demonstrate our method can significantly reduce the inference time, enabling real-time uncertainty quantification, while achieving im- proved quality of both the uncertainty estimates and predictive performance over the regular dropout model. 5.1 Introduction Uncertainty exists in many machine learning problems due to noise in the obser- vations and incomplete coverage of domain. How certain can we trust the model built upon limited yet imperfect data? Reliable uncertainty estimates are crucial for trustworthy applications such as medical diagnosis and autonomous driving. Many algorithms have been proposed to estimate the uncertainty of neural net- works (NN) [13,110,134,220]. Among these, the MC dropout [51] is arguably one 72 Bayesian SegNet Loss Input Teacher Sample Ground Truth SegNet Mean Loss Input Student Samples from Teacher Variance Figure 5.1: An illustration of the proposed method. Given a trained teacher, a deterministic student is used to approximately parameterize the predictive distri- bution of the teacher model, enabling sample-free uncertainty estimation. of the most popular approaches due to its simplicity and scalability. This approach has been adopted in different computer vision tasks recently [11, 43]. A follow-up work [104] further enhanced quality of uncertainty estimates by incorporating both aleatoric and epistemic uncertainty into deep neural networks. Despite its success, MC dropout requires test-time sampling to obtain uncertainty. This costly sam- pling process can introduce severe latency in real-time prediction tasks such as the perception system of self-driving vehicles and lead to undesired consequences. In order to eliminate the expensive dropout sampling at the test-time, prior work [21] has explored distilling knowledge from MC dropout samples of a teacher model into a student network (Dropout Distillation or DD). Nevertheless, DD has several limitations. Specifically, the student model only learns from the predictive means of the dropout teacher model, and the dispersion of the teacher’s predic- tion which entails important uncertainty information associated with the predic- tions [142], is completely neglected in their approach. To address the problem, in this chapter, we propose an easy-to-optimize, generally applicable distillation 73 framework for fast, sample-free uncertainty estimation. Specifically, we approxi- mate the entire predictive distribution produced by a MC-dropout teacher with flexible parametric distributions. At test time, the parameters of the distribution are output by a single deterministic student network to obtain reliable uncertainty estimates in one forward pass. In addition, we show that our method can distill both epistemic and aleatoric uncertainty with little extra computation. We examine the effectiveness of the proposed method on regression and clas- sification with high resolution, real-world datasets. For regression, we experiment on monocular depth estimation using NYU Depth V2 [156] and KITTI [57]. For classification, we experiment on semantic segmentation using CamVid [19] and VOC2012 [42]. In addition to significant faster inference time, quantitative and qualitative results show the student network produces uncertainty estimates of bet- ter quality than those of the teacher model, i.e. MC dropout pre-trained model. We also demonstrate the predictive mean and uncertainty obtained with our method are superior to those learned from DD [21]. 5.2 Related Work Uncertainty estimation can be obtained for deep learning in a principled manner through Bayesian neural networks [139, 157]. However, they typically suffer from significant computational burdens due to the intractability of posteriors. As such, computational tractability has been a primary focus of research. One such direc- tion is through Markov Chain Monte Carlo (MCMC) [157]. For instance, stochastic gradient versions of MCMC have been proposed [29, 63, 137, 216] to scale MCMC method to large datasets. Nevertheless, these approaches can be difficult to scale 74 to high-dimensional data. An alternative solution is through variational inference in which parametric distributions are used to approximate the intractable true pos- teriors of the weights of neural networks [13,110,134,220]. However, the variational inference can suffer from sub-optimal performances [12]. There have also been Non-Bayesian techniques for uncertainty estimation. For instance, an ensemble of randomly-initialized NNs [121] has shown to be effective. However, it requires training and saving multiple NNs, which can be costly in prac- tice. Methods to more efficiently obtaining ensembles exist [56, 88], but these can come at the cost of quality of uncertainty estimates. Most of the above-mentioned methods require multiple forward passes of NNs at test time, which prohibits their deployment in real-time computer vision systems. Several techniques have been proposed to speed up uncertainty estimation. Postels et al. [172] proposed a method for sampling-free uncertainty estimation through variance propagation. However, the simplistic assumptions they made about the covariance matrices of NN activation might lead to inaccurate approximations. Another approach speeds up the sampling process by leveraging the temporal information in videos [91]. However, the method cannot be generically applied and results in a non-trivial drop in predictive performance. Ilg et al. [93] proposed the use of a multi-hypotheses NN with a novel loss function to obtain sample-free uncertainty estimates in the optical flow estimation task. However [93] uses an additional network to merge the hypotheses that incurs extra memory cost at inference time. Distillation-based methods have also been explored. [21] proposed to distill predictive means from a dropout teacher to a student network. As addressed above, this leads to the loss of the epistemic uncertainty of the MC dropout teacher. Most similar to our method is [143], which uses the Dirichlet distribution to approximate 75 predictive distribution of an ensemble of networks. However, the proposed method requires a large size of ensemble to effectively train a student network, which can be prohibitively expensive for challenging computer vision tasks. In addition, the proposed use of the Dirichlet distribution is not only practically hard to optimize, but also not applicable for regression tasks. 5.3 Method Suppose we have a dataset D = (X,Y ) = {(xi,yi)}ni=1, where each (xi,yi) ∈ (X × Y) is i.i.d. and X ⊆ Rd corresponds to the feature space. Most tasks in computer vision can be considered as either regression or classification. For regression, Y ⊆ Rk for some integer k, and in the context of k-class classification, Y = {1, · · · , k} is the label space. We define fw(x) to be a neural network such that f : X → Y , and w = {W }Li i=1 corresponds to the parameters of the network with L-layers, where each Wi is the weight matrix in the i-th layer. We define the model likelihood p(y|x,w) = p(y|fw(x)). For regression tasks, it is common to assume p(y|fw(x)) = N (f (x), σ2w ), for some noise term σ. For classification tasks, p(y|fw(x)) = Softmax(fw(x)) is commonly assumed. To capture epistemic uncertainty, we put a prior distribution on the weights of the network, p(w). A common choice is the zero mean Gaussian N (0, I). Bayes Theorem can then be used to obtain the posterior p(w|X,Y ) = p(Y |X,w)p(w)/p(Y |X), with which the predictive distribution can be d∫etermined by p(y|x,Dtrain) = p(y|x,w)p(w|Dtrain) dw. (5.1) 76 5.3.1 Preliminary: Dropout for Bayesian Deep Learning The marginal distribution p(Y |X), and thus p(w|X,Y ) are often intractable. Variational inference uses a tractable family of distributions qθ(w) paramaterized by θ to approximate the true posterior p(w|X,Y ), thereby turning the prob- lem into a tractable optimization task. MC dropout, which casts the dropout regularization as approximate Bayesian inference, is one such example [51]. It involves training NNs with dropout after each weight layer. With an opti- ∫mized model, the approximate predictive distribution is given by q(y|x,Dtrain) = p(y|x,w)qθ(w) dw. The integral can be approximated through performing Monte Carlo integration over qθ(w). This corresponds to dropout at test time. In classification for example, ∑T1 p(y = c|x,Dtrain) ≈ Softmax(fwt(x)), (5.2)T t=1 where wt ∼ qθ(w) are dropout samples from the NN. Epistemic uncertainty can be computed with the approximate inference frame- work as derived above. For regression, epistemic uncertainty is captured by the predictive variance, which can be approximated by computing the variance of the dropout approximate distribution: ∑T1 σ2y ≈ fwt(x)Tf TwT t (x)− µyµy, (5.3) ∑ t=1 where µ = Ty t=1 fwt(x). In the context of classification, numerous measures have been proposed as uncertainty estimates [54]. In this chapter, we use the mutual information between the predictions and the model posterior (BALD), I [y,w|x,Dtrain] = H [y|x,Dtrain]− E [H [y|x,w]] (5.4) as the uncertainty estimates in classification tasks. 77 As proposed by Kendall and Gal [104], aleatoric uncertainty can also be incor- porated into the dropout model for concurrent estimation of both the epistemic and aleatoric uncertainty. To do so, an input-dependent observation noise parameter σ̂2 is output together with the prediction [ ] µ̂, σ̂2 = fw(x), (5.5) where σ̂2 is a vector with the same dimension as ŷ that represents a diagonal covariance matrix. σ̂ can be optimized with maximum likelihood estimation by assuming that aleatoric uncertainty follows a parametric distribution (e.g Gaus- sian) for regression. For classification, this Gaussian distribution is placed over the logit space. Although the epistemic and aleatoric uncertainties are not mutually exclusive, the total uncertainty can be approximated using ∑T ∑T1 1 Var(y) ≈ µ̂Tt µ̂t − µT 2 T y µy + σ̂t (5.6) ∑ Tt=1 t=1 where µy = T t=1 µ̂t and the first and second term in the above expression ap- proximates the epistemic and aleatoric uncertainties respectively. 5.3.2 A Teacher-Student Paradigm for Sample-free Uncer- tainty Estimation Despite the success of the MC dropout, inferring uncertainty at test-time often requires multiple forward passes to generate samples of prediction, limiting its application to many time-sensitive applications. In this chapter, we propose to use a deterministic neural network fϕ(x) to parameterize a distribution r(y|x,Dtrain) that approximates the predictive distribution q(y|x,Dtrain) of the dropout model. Specifically, fϕ(x) learns to directly output the parameters of r(y|x,Dtrain). When 78 trained, fϕ(x) only requires one forward pass to infer both predictive mean and uncertainty from the parameterized distribution r(y|x,Dtrain), thus eliminating expensive sampling processes at test time. Training fϕ(x) is straight-forward using a teacher-student paradigm similar to the knowledge distillation [84]. We first train a Bayesian neural network (BNN) fw(x) (e.g. a dropout model) on Dtrain. We then generate samples of predictions from the pre-trained fw(x). These samples serve as “observations” from the distri- bution q(y|x,Dtrain) for fϕ(x) to learn the parameters of r(y|x,Dtrain) given each input x ∈ Dtrain. Eventually fϕ(x) learns an efficient mapping from input images to the parameters of the distribution r(y|x,Dtrain) that accurately approximates q(y|x,Dtrain). For simplicity, in the following illustration we term the BNN fw(x) as the teacher model and fϕ(x) as the student model. Sampling from the Bayesian Teacher As mentioned above, predictive samples {ŷt = fwt(x)}mt=1 are generated from the teacher to train the student. In the more complicated scenario where aleatoric uncertainty is modeled by teacher, we incorporate aleatoric uncertainty into each predictive sample with ŷt = µ̂t + σ̂tϵ, ϵ ∼ N (0, I). (5.7) where σ̂t is the aleatoric uncertainty output by the teacher given an input. In practice, σ̂ 2t can be nois∑y. To stabilize training, instead of σ̂t , we first compute empirical mean σ̃2 ≜ 1 Tt=1 σ̂ 2 t , and use σ̃ 2 to generate all samples {ŷt}mT t=1. The larger the number of samples, the more accurate the student approximation can be to the teacher predictive distribution. However, sampling a large number of 79 samples requires intensive computational resources. To cope with this challenge, we generate a small number ofm predictive samples from the teacher for each input on-the-fly at each epoch during training. In order to learn aleatoric unce[rtainty], we further get k random samples from N (0, I) for each predictive sample µ̂ , σ̃2t (see Eq. 5.7). In practice we use m = 5 and k = 10. As we demonstrate in the experimental section, a small number of m and k per input is sufficient to learn student model with excellent performance. Optimizing the Student We use maximum likelihood estimation (MLE) to optimize fϕ(x). Given the samples {ŷt}mt=1 generated by the teacher, we minimize the negative log likelihood for each input x ∑ Ls = − log r (ŷt|x;ϕ) . (5.8) t where r(ŷt|x;ϕ) is parameterized by fϕ(x). In order to avoid division by zero and enable unconstrained optimization of the variance, we use log variance s = log(σ̂2) as the output of the student. Thus, we have [µ̂, s] = fϕ(x). For regression problems, we use the Laplace distribution to approximate the variational predictive distribution. For simplicity, we assume independence among all the dimensions of outputs so that log variance s is a vector of the same dimension as µ̂. Given the Laplace assumption, a numerically stable MLE training objective can be derived from Eq. 5.8 a∑s √ ( ) L 1 1 1 1s = 2 exp − si |ŷ − µ̂ |+ si (5.9) N M 2 ti i 2 i,t where i and t corresponds to the summation over the output space and the gen- erated samples respectively and N and M are number of instances (e.g pixels) in 80 the output space and the generated predictive samples from teacher, respectively. The reason to choose Laplace distribution over Gaussian distribution is because it is more appropriate to model the variances of residuals with ℓ1 loss, which usually outperforms ℓ2 loss in computer vision tasks. For classification problems, we use a logit-normal distribution to model teacher’s approximate predictive distribution q(y|x,Dtrain) on the simplex1. In practice, we use a Gaussian distribution with a diagonal covariance matrix to ap- proximate the teacher’s predictive distribution on the logit space. As a result, the student model outputs µi and si as the mean and log variance of the Gaussian for each member of the logits. Similar to the regression set-up, we derive a numerically stable Gaussian MLE training objective L 1 1 ∑ 1 2 1 s = exp (−si) ||ŷ − µ̂ | | + si (5.10) N M 2 ti i 2 i,t where yti are predicted logits sampled from teacher. Since close-form solution does not exist for the moments of a logit-normal distribution, Monte Carlo sampling on the logit space is performed at test time to obtain uncertainty estimates. This only incurs a tiny computational overhead during inference as it amounts to multiple forward passes of one layer of the student network (the softmax function). As shown in experiments, the student model still has a large advantage in inference time over its teacher in addition to better performance. We empirically observe that training solely with the above loss functions some- times leads to sub-optimal predictive performance. This may be due to the noisy signal provided by the generated samples. Thus we leverage ground truth labels in addition to predictive samples from the teacher to stabilize the training of the 1Dirichlet distribution is an obvious alternative, but we empirically observe that training with logit-normal is much more numerically stable. 81 student model. We use the loss function for which the teacher model is trained in conjunction with the Ls, leading to the total loss Ltotal = Ls + λLt, (5.11) where the λ is a hyper-parameter to be tuned and Lt corresponds to the categorical cross entropy loss for classification tasks or L1 loss for regression tasks. We found that λ = 1 generally performs well for our experiments. Additional Augmentation When the same training dataset is used for both the teacher and the student train- ing, the student may underestimate the epistemic uncertainty of the teacher due to overfitting of the teacher network to the training data. Ideally, in order to fully capture the teacher predictive distribution, the dataset used to train the student should not overlap with the one for the teacher. However, training using only a subset of available samples can lead to sub-optimal performance. To alleviate this problem, we perturb the training set during the training of the student us- ing extra data augmentation methods unused when training the teacher, in order to synthetically generate new samples unseen by the teacher model. We choose color jittering as the augmentation method that augments each image via color jitter with random variation in the range of [−0.2, 0.2] in four aspects: brightness, contrast, hue, and saturation when training the student. As we demonstrate below, this extra augmentation during student training can be crucial for enhanced quality of uncertainty estimates. We emphasize that the additional gain in uncertainty estimates does not directly come from data augmentation, but rather from teacher predictions that more closely correspond 82 Figure 5.2: Example predictions on CamVid. Each uncertainty map shows the sum of aleatoric and epistemic uncertainty. Same for all the following example plots. to the test-time predictive distributions as a consequence of this augmentation. In the experiments below, we show that the teacher model does not have the same performance boost with the extra augmentation. 5.4 Experiments We conduct experiments on two pixel-wise computer vision tasks: semantic seg- mentation and depth regression. We compare the performance of the proposed method with that of the teacher models using MC dropout. For a holistic evalu- ation, we consider teacher networks trained both with and without the aleatoric uncertainty. Following [104, 172], we use 50 samples for MC dropout to evalu- ate teacher’s performance and uncertainty. Architectures identical to that of the teacher models without the dropout layers are used as student models. As dis- cussed in the previous section, we use 50 samples from the logit space to evaluate uncertainty (BALD) of student models for classification tasks. To demonstrate the general applicability, we also show the effectiveness of the proposed method when 83 the teacher network corresponds to a Deep Ensemble [121]. Evaluation Metrics On top of metrics to evaluate the performance of the predictive means of our mod- els, we measure both the Area Under the Sparsification Error curve (AUSE) [93] and the expected calibration error (ECE) [70] as measures to evaluate the quality of uncertainty estimates. In essence, AUSE measures how much the estimated un- certainty coincides with true predictive errors. Brier score and the mean absolute error are used as predictive errors to compute AUSE for classification and regres- sion tasks respectively. In the context of classification, ECE measures how much the predictive means of probabilities from the softmax function are representative of the true correctness of predictions. In the context of regression, we use ECE described in [115] to quantify the amount of mismatch between the predictive dis- tribution and the empirical CDFs. We follow [115] and compute ECE with the ℓ2 norm with a bin size of 30. 5.4.1 Semantic Segmentation Bayesian SegNet [103], which contains dropout layers inserted after the central four encoder and decoder units, was proposed to obtain uncertainty estimates for semantic segmentation. In this work, we use the architecture with a dropout rate of 0.5 as the teacher model for all our experiments. We use the CamVid and VOC2012 datasets. For CamVid, following Kendall et al. [103], we use 11 generalized classes and a downsampled image size of 360 × 480. For the teacher network, we train using 84 Figure 5.3: Example predictions on Pascal VOC2012. the Stochastic Gradient Descent (SGD) with an initial learning rate of 10−3, a momentum of 0.9, and a weight decay of 5 × 10−4 for 100,000 steps. In order to achieve faster convergence, we initialize the student network using the weights of the teacher network. To this end, a smaller initial learning rate of 5× 10−4 is used to train the student network for 80,000 steps. We employ a “poly” learning rate policy on both the teacher and student networks as done by Chen et al. [27]. We use a batch size of 4 for both per step. For VOC2012, we use the same augmented “train” and “val” split as in [27]. Input images are resized to 224× 224. For the optimal performance of the teacher model, SGD with a higher initial learning rate of 10−2 is used instead, with a batch size of 8 for 150000 steps. Similarly, we initialize the student model with the weights of the teacher. The student model is trained for 100000 steps with an initial learning rate of 10−3 using a size of 8 per step. The performance on the “val” split is reported in the results. We also include the results of the student models trained using Dropout Distillation (DD) [21] as a baseline comparison. 85 Evaluation Results for both the teacher and the student are summarized in Table 5.1. On top of a significant boost in run-time, the student network also leads to improvements in terms of most of the metrics evaluated. We believe the reason for the observed improvements in both predictive performance and uncertainty estimates is mainly due to learning the entire predictive distribution implicitly through samples from the teacher models with the proposed optimization objective can have the loss at- tenuation effect as described in [104]. In contrast, Dropout Distribution (DD) [21], which only distills the mean prediction of the teacher as the standard knowledge distillation, shows worse performances of the student than those of the teachers in all the metrics. This further demonstrates the benefit of distilling the entire predictive distribution from the teacher. Figure 5.2 and 5.3 are random selected examples from the validation set of CamVid and Pascal VOC respectively. Visual examples suggest that the student model can accurately capture both the predictive mean and uncertainty of the teacher model. Furthermore, a closer comparison reveals the exceptional quality of the uncertainty estimates produced by the student model. For instance, in the second example from the CamVid dataset in Figure 5.2, a small part of the ego vehicle is captured by the camera at the bottom of the figure. While the teacher model confidently predicts the area as “road surface”, the student model highlights this subtle anomaly with high uncertainty estimates. A similar contrast is also observed in the top example of Figure 5.3, where the boundary of people is assigned much higher uncertainty by the student model. Besides, the bowls and plates on the dining table, which are not in the list of labeled classes for the dataset, also “confuses” the student model, but not the teacher. 86 Table 5.1: Results on the segmentation problem. The “T”, “S” and “AU” corre- sponds to the teacher and student model, and the aleatoric uncertainty respectively. “T+AU” corresponds to a teacher model trained with the aleatoric uncertainty. “DD” corresponds to the student trained using Dropout Distillation [21]. Best performing results for each teacher-student pair are bold-faced. Camvid Model T S DD [21] T+AU S+AU Accuracy ↑ 0.906 0.907 0.903 0.907 0.909 Classwise Acc ↑ 0.764 0.765 0.747 0.766 0.750 IOU ↑ 0.645 0.650 0.642 0.645 0.650 ECE ↓ (×10−3) 3.78 2.23 6.73 3.67 2.86 AUSE ↓ (×10−2) 1.47 1.60 2.59 1.63 1.60 Runtime (s) ↓ 1.6 0.078 0.078 2.1 0.078 Pascal VOC Model T S DD [21] T+AU S+AU Accuracy ↑ 0.834 0.851 0.828 0.831 0.848 Classwise Acc ↑ 0.813 0.828 0.806 0.809 0.827 IOU ↑ 0.697 0.727 0.691 0.693 0.722 ECE ↓ (×10−3) 62.7 59.0 67.5 63.0 59.0 AUSE ↓ (×10−2) 4.35 3.82 4.86 4.20 4.31 Runtime (s) ↓ 0.51 0.028 0.028 0.68 0.028 Run-Time Comparison Figure 5.4 (a)-(c) illustrate a comparison of running time and performance using different numbers of samples for MC dropout. While the running time of MC dropout can be shortened with fewer samples, it comes at the cost of quality of prediction and uncertainty estimates. The running time of MC Dropout is optimized by caching results before the first dropout layer for a fair comparison. We further demonstrate the merit of the proposed method by comparing the running time of the student with several other recently proposed sample-free meth- ods for uncertainty estimates. Figure 5.4 (d) illustrates the speed boost with different methods on the CamVid dataset with Bayesian SegNet. The ratios are 87 (a) (b) (c) (d) Figure 5.4: (a)-(c): Comparison of performance against the running time for both the teacher (with the aleatoric uncertainty) and student model using the CamVid dataset. (d) Speed-up ratios of uncertainty estimates for the CamVid dataset with the Bayesian SegNet compared to Huang et al. [91] and Postels et al. [172], two other sample-free uncertainty estimation methods. computed with respect to the same baseline of MC dropout with 50 samples at test time. Our proposed method achieves a more significant boost in speed than pre- viously proposed methods for accelerating dropout inference, in addition to other advantages such as wider applicability and improved predictive performance. Performance under Distribution Shift We also evaluate the performance of the proposed method under a distribution shift using models trained with the CamVid dataset. The Cityscapes dataset [31], 88 Figure 5.5: Performance of models trained with CamVid and evaluated on Cityscapes. which contains street scenes collected from different cities, is an ideal dataset for such evaluation. We emphasize that neither the teacher nor the student sees images from the Cityscapes dataset during training. The results are summarized in Figure 5.5, which is evaluated on the overlapped classes between CamVid and Cityscapes. Surprisingly, while both the teacher and student models perform unsatisfactorily, the student performs significantly better than the teacher in terms of all of the metrics evaluated, suggesting its enhanced robustness against the distribution shift when trained with the proposed teacher-student pipeline. We hypothesis that by seeing the distribution of soft labels from a bayesian teacher from the distillation process, the student learns to output less confident, more generalizable outputs. The true cause can leave for further works. This can be important for lots of application domains with long-tail scenarios like autonomous driving. 89 (a) (b) (c) (d) Figure 5.6: Top: Relative means of BALD for samples of seen and unseen classes during training compared to the “Reference” models, which refer to models trained with both seen and unseen classes. Bottom: Distribution of BALD for samples of seen and unseen classes during training. Outlier Detection In addition, we examine the effectiveness of the uncertainty estimates for outlier detection using the CamVid dataset. Following [172], we use “pedestrian” and “bi- cyclist” as held-out classes and exclude them from training. Ideally, classes unseen during training should have much higher uncertainty estimates than that of the seen classes. We show in Figure 5.6 comparisons of relative means of the uncer- 90 Figure 5.7: Example predictions on CamVid when “pedestrian” and “bicyclist” are held out during training. “Reference” refers to models trained with all classes. tainty estimates against those of “reference” models, which refer to models trained with both seen and unseen classes, for both inlier and outlier classes. While both teacher and student assign higher uncertainty to outlier classes compared to the “reference” models on average, the relative mean is much higher for the student. To further quantify the performance, we also compute the Jensen–Shannon distance between distributions of uncertainty estimates of inlier and outlier classes [140]. Again, the difference in the inlier and outlier distribution is larger for the student network, suggesting its enhanced ability for outlier detection. Lastly, we show in Figure 5.7 two randomly chosen examples to illustrate the difference between teacher and student. As seen clearly, regions with pedestrians and bicyclists have higher uncertainty estimates when they are not present in training for both the teacher and student. The magnitude is much larger for the student as represented by bright spots in the uncertainty plot. 91 5.4.2 Pixel-Wise Depth Estimation For pixel-wise depth estimation tasks, NYU DEPTH V2 (NYU) and the KITTI Odometry dataset (KITTI) are used to conduct experiments. We follow the same ResNet-based architecture to [141] for the training of both datasets in RGB based depth estimation, with dropout p = 0.2 placed after each convolutional layer except the final one. For NYU, we use the same train/test split as in [141] and for KITTI we train our models on sequences 00-10 and evaluate them on sequences 11-21. Identical procedures are used to train the teacher models for both NYU and KITTI. During training, SGD optimizer with an initial learning rate of 0.01, a momentum of 0.9, and weight decay of 10−4 with ”poly” learning rate poly is adopted for a total of 40 epochs. For NYU, we initialize the student model with the weights of the teacher and train the student model for 30 epochs using a smaller learning rate of 0.005. We empirically observe that initializing with the teacher model for KITTI leads to overfitting to the training set and thus we train the student model from scratch with the identical procedure as used for teacher training. We use a batch size of 8 in all of the depth estimation experiments. Evaluation The quantitative performance of both the teacher and the student models is sum- marized in Table 5.2. Similar to segmentation tasks, the student model outper- forms the teacher in most of the evaluation metrics. Example predictions shown in Figure 5.8 again illustrate that the student network is able to closely approxi- mate the uncertainty estimates produced by the teacher model. Moreover, as more number of dropout layers are inserted into the NNs for experiments with depth estimation, the relative speed-up ratio achieved by the student model is further 92 Table 5.2: Results on the depth estimation. The “T”, “S” and “AU” corre- sponds to the teacher and student model, and the aleatoric uncertainty respectively. “T+AU” corresponds to a teacher model trained with the aleatoric uncertainty. NYU KITTI Model T S T+AU S+AU T S T+AU S+AU RMSE ↓ 0.542 0.540 0.548 0.548 4.80 4.75 4.83 4.81 REL ↓ 0.155 0.152 0.158 0.154 0.123 0.122 0.117 0.117 log 10 ↓ 0.065 0.064 0.065 0.064 0.053 0.052 0.052 0.051 δ1 ↑ 0.793 0.798 0.794 0.799 0.843 0.847 0.845 0.846 δ2 ↑ 0.947 0.949 0.945 0.946 0.948 0.951 0.950 0.949 δ3 ↑ 0.985 0.984 0.982 0.981 0.981 0.982 0.981 0.981 ECE ↓ (×10−2) 9.38 8.09 5.79 5.13 7.80 2.95 4.53 2.18 AUSE ↓ (×10−2) 6.01 6.06 5.88 5.82 0.701 0.660 0.597 0.595 Runtime (s) ↓ 0.73 0.016 0.739 0.016 0.28 0.007 0.29 0.007 Table 5.3: Top-4 Rows : Impact of adding augmentation in training on quality of uncertainty produced on the CamVid and NYU datasets. ”T” and ”S” represents teacher and student models, and ”AUG” corresponds to augmentation. Last Row : Uncertainty performance of student model when a deep ensemble with five NNs is used as the teacher model. CamVid NYU ECE (×10−3) AUSE (×10−2) ECE (×10−3) AUSE (×10−2) T w/o AUG 3.67 1.62 57.9 5.88 T w/ AUG 3.90 1.62 57.1 5.90 S w/o AUG 4.63 2.19 54.0 5.91 S w/ AUG 2.86 1.60 51.3 5.80 S w/ Ens T 2.96 1.91 56.3 5.93 Figure 5.8: Example predictions on NYU. 93 increased due to less cached computation for the teacher. For instance, the student model achieves a speed-up ratio of 46 for the NYU dataset. 5.4.3 Ablation Study on Additional Augmentation To demonstrate the importance of additional augmentation during student train- ing, we also summarize in Table 5.3 results when the student is trained without extra augmentation. Using extra augmentation in the student training process as discussed in Section 5.3.2 helps the student produce much better uncertainty esti- mation. We can also see that the same extra augmentation does not improve the performance of the teacher’s uncertainty estimation, suggesting that the student model benefits from seeing the teacher’s predictions more closely aligned with the test-time predictive distributions, rather from data augmentation itself. 5.4.4 Distilling from Deep Ensemble To examine the effectiveness of using deep ensembles as teachers [143], we train an ensemble of deterministic neural networks with aleatoric uncertainty [121]. The training detail is identical to that described above. Due to limited computational resources, we fix the number of models in the ensemble to five. Dirichlet distribu- tion is not used to approximate teacher’s predictive distribution for classification as in [143] because we empirically found it very numerically unstable and led to failure of convergence. We show the uncertainty results in Table 3. Full results can be found in the Appendix. As seen clearly from Table 3, the student obtained from the ensemble teacher have worse calibration performance than the student distilled from MC-Dropout teachers. The gap is likely due to the difficulty in learning a 94 good predictive distribution with just 5 samples. 5.4.5 Discussion Our experiments show that incorporating aleotoric uncertainty can result in mini- mal improvements for both teachers and students. This could be caused by signifi- cant overlaps between the two types of uncertainties learned by the teacher model, since the two types of uncertainties are not mutually exclusive, and can coincide significantly [104]. Nonetheless, aleatoric uncertainty can be beneficial for other tasks and datasets. The goal of the chapter is to propose a general distillation strategy capable of also incorporating aleotoric uncertainty. Using the proposed approach, as clearly seen from the experimental results, students can match or surpass their teacher models in performance with or without aleatoric uncertainty. We also stress that, since the student is supervised by both the ground truth labels and the teacher’s predictions, there can be discrepancies in predictive dis- tributions between the teacher and the student models. Nevertheless, we believe these discrepancies can be beneficial and account for the improved performance of the student. As demonstrated, student models produce well-calibrated uncertainty maps that also semantically make sense without the need for expensive multiple forward passes. 95 CHAPTER 6 TOWARDS A DEEPER UNDERSTANDING OF KNOWLEDGE DISTILLATION It has been recently demonstrated that multi-generational self-distillation can im- prove generalization [49]. Despite this intriguing observation, reasons for the en- hancement remain poorly understood. In this chapter, we first demonstrate ex- perimentally that the improved performance of multi-generational self-distillation is in part associated with the increasing diversity in teacher predictions. With this in mind, we offer a new interpretation for teacher-student training as amortized MAP estimation, such that teacher predictions enable instance-specific regular- ization. Our framework allows us to theoretically relate self-distillation to label smoothing, a commonly used technique that regularizes predictive uncertainty, and suggests the importance of predictive diversity in addition to predictive un- certainty. We present experimental results using multiple datasets and neural network architectures that, overall, demonstrate the utility of predictive diversity. Finally, we propose a novel instance-specific label smoothing technique that pro- motes predictive diversity without the need for a separately trained teacher model. We provide an empirical evaluation of the proposed method, which, we find, often outperforms classical label smoothing. 6.1 Introduction First introduced as a simple method to compress high-capacity neural networks into a low-capacity counterpart for computational efficiency, knowledge distillation [84] has since gained much popularity across various application domains ranging from 96 computer vision to natural language processing [108,128,166,224,226] as an effec- tive method to transfer knowledge or features learned from a teacher network to a student network. This empirical success is often justified with the intuition that deeper teacher networks learn better representation with greater model complex- ity, and the ”dark knowledge” that teacher networks provide facilitates student networks to learn better representations and hence enhanced generalization per- formance. Nevertheless, it still remains an open question as to how exactly student networks benefit from this dark knowledge. The problem is made further puzzling by the recent observation that even self-distillation, a special case of the teacher- student training framework in which the teacher and student networks have iden- tical architectures, can lead to better generalization performance [49]. It was also demonstrated that repeated self-distillation process with multiple generations can further improve classification accuracy. In this work, we aim to shed some light on self-distillation. We start off by revisiting the multi-generational self-distillation strategy, and experimentally demonstrate that the performance improvement observed in multi-generational self-distillation is correlated with increasing diversity in teacher predictions. In- spired by this, we view self-distillation as instance-specific regularization on the neural network softmax outputs, and cast the teacher-student training procedure as performing amortized maximum a posteriori (MAP) estimation of the softmax probability outputs. The proposed framework provides us with a new interpreta- tion of the teacher predictions as instance-specific priors conditioned on the inputs. This interpretation allows us to theoretically relate distillation to label smoothing, a commonly used technique to regularize predictive uncertainty of NNs, and sug- gests that regularization on the softmax probability simplex space in addition to the regularization on predictive uncertainty can be the key to better gener- 97 alization. To verify the claim, we systematically design experiments to compare teacher-student training against label smoothing. Lastly, to further demonstrate the potential gain from regularization on the probability simplex space, we also de- sign a new regularization procedure based on label smoothing that we term “Beta smoothing.” Our contributions can be summarized as follows: 1. We provide a plausible explanation for recent findings on multi-generational self-distillation. 2. We offer an amortized MAP interpretation of the teacher-student training strategy. 3. We attribute the success of distillation to regularization on both the label space and the softmax probability simplex space, and verify the importance of the latter with systematically designed experiments on several benchmark datasets. 4. We propose a new regularization technique termed “Beta smoothing” that improves upon classical label smoothing at little extra cost. 5. We demonstrate self-distillation can improve calibration. 6.2 Related Works Knowledge distillation was first proposed as a way for model compression [4, 20, 84]. In addition to the standard approach in which the student model is trained to match the teacher predictions, numerous other objectives have been explored for enhanced distillation performance. For instance, distilling knowledge from 98 intermediate hidden layers were found to be beneficial [83, 92, 107, 184, 194, 224]. Recently, data-free distillation, a novel scenario in which the original data for the teacher is unavailable to students, has also been extensively studied [22,26,148,225]. The original knowledge distillation technique for neural networks [84] has stim- ulated a flurry of interest in the topic, with a large number of published improve- ments and applications. For instance, prior works [5,188] have proposed Bayesian techniques in which distributions are distilled with Monte Carlo samples into more compact models like a neural network. More recently, there has also been work on the importance of distillation from an ensemble of model [143], which provides a complementary view on the role of predictive diversity. Lopez-Paz et al. [132] combined distillation with the theory of privileged information, and offered a gen- eralized framework for distillation. To simplify distillation, Zhu et al. [240] pro- posed a method for one-stage online distillation. There have also been successful applications of distillation for adversarial robustness [166]. Several papers have attempted to study the effect of distillation training on stu- dent models. Furlanello et al. [49] examined the effect of distillation by comparing the gradients of the distillation loss against that of the standard cross-entropy loss with ground truth labels. Phuong et al. [171] considered a special case of distilla- tion using linear and deep linear classifiers, and theoretically analyzed the effect of distillation on student models. Cho and Hariharan [30] conducted a thorough experimental analysis of knowledge distillation, and observed that larger models may not be better teachers. Another experimentally driven work to understand the effect of distillation was also done in the context of natural language process- ing [237]. Most similar to our work is [227], in which the authors also established a connection between label smoothing and distillation. However, our argument 99 comes from a different theoretical perspective and offers complementary insights. Specifically, [227] does not highlight the importance of instance-specific regulariza- tion. We also provide a general MAP framework and a careful empirical comparison of label smoothing and self-distillation. 6.3 Preliminaries We consider the problem of k-class classification. Let X ⊆ Rd be the feature space and Y = {1, .., k} be the label space. Given a dataset D = {xi, y ni}n=1 where each feature-label pair (xi, yi) ∈ X × Y , and we are interested in finding a function that maps input features to corresponding labels f : X → Rc. In this work, we restrict the function class to the set of neural networks fw(x) where w = {W }Li i=1 are the parameters of a neural network with L layers. We define a likelihood model p(y|x;w) = Cat (softmax (fw(x))), a categorical distribution with param- eters softmax (fw(x)) ∈ ∆(L). Here ∆(L) denotes the L-dimensional probability simplex. Typically, maximum likelihood estimation (MLE) is performed. This leads to the cross-entropy loss ∑n ∑k Lcce(w) = − yij log p(y = j|xi;w), (6.1) i=1 j=1 where yij corresponds to the j-th element of the one-hot encoded label yi. 6.3.1 Teacher-Student Training Objective Given a pre-trained mod∑el (∑teacher) fwt , distillation loss can be defined as:n k ( ) Ldist(w) = − [softmax fwt(x)/T ]j log p(y = j|xi;w), (6.2) i=1 j=1 100 where [·]j denotes the j’th element of a vector. A second network (student) fw can then be trained with the following total loss: L(w) = αLcce(w) + (1− α)Ldist(w), (6.3) where α ∈ [0, 1] is a hyper-parameter, and T corresponds to the temperature scaling hyper-parameter that flattens teacher predictions. In self-distillation, both teacher and student models have the same network architecture. In the original self-distillation experiments conducted by Furlanello et al. [49], α and T are set to 0 and 1, respectively throughout the entire training process. Note that, temperature scaling has been applied differently compared to pre- vious literature on distillation [84]. As addressed in Section 6.5, we only apply temperature scaling to teacher predictions in computing distillation loss. We em- pirically observe that this yields results consistent with previous reports. More- over, as we show in the Appendix D.2, performing temperature scaling only on the teacher but not the student models can lead to significantly more calibrated predictions. 6.4 Multi-Generation Self-Distillation: A Close Look Self-distillation can be repeated iteratively such that during training of the i-th generation, the model obtained at (i−1)-th generation is used as the teacher model. This approach is referred to as multi-generational self-distillation, or “Born-Again Networks” (BAN). Empirically it has been observed that student predictions can consistently improve with each generation. However, the mechanism behind this improvement has remained elusive. In this work, we argue that the main attribute that leads to better performance is the increasing uncertainty and diversity in 101 teacher predictions. Similar observations that more “tolerant” teacher predictions lead to better students were also made by Yang et al. [223]. Indeed, due the mono- tonicity and convexity of the negative log likelihood function, since the element that corresponds to the true label class of the softmax output p(y = yi|xi;w) is often much greater than that of the other classes, together with early stopping, each subsequent model will likely have increasingly unconfident softmax outputs corresponding to the true label class. 6.4.1 Predictive Uncertainty We use Shannon Entropy to quantify the uncertainty in instance-specific teacher predictions p(y|x;wi), averaged over the training set, which we call “Average Pre- dictive Uncertainty,” and define as: ∑n ∑n ∑k Ex [H (p(·|x;wi))] ≈ 1 1 H (p(·|xj;wi)) = −p(yc|xj;wi) log p(yc|xj;wi). n n j=1 j=1 c=1 (6.4) Note that previous literature [38,169] has also proposed to use the above measure as a regularizer to prevent over-confident predictions. Label smoothing [169,198] is a closely related technique that also penalizes over-confident predictions by explic- itly smoothing out ground-truth labels. A detailed discussion on the relationship between the two can be found in Appendix D.1. 6.4.2 Confidence Diversity Average Predictive Uncertainty is insufficient to fully capture the variability asso- ciated with teacher predictions. In this chapter, we argue it is also important to 102 consider the amount of spreading of teacher predictions over the probability sim- plex among different (training) samples. For instance, two teachers can have very similar Average Predictive Uncertainty values, but drastically different amounts of spread on the probability simplex if the softmax predictions of one teacher are much more diverse among different samples than the other. We coin this population spread in predictive probabilities “Confidence Diversity.” As we show below, characterizing the Confidence Diversity can be important for understanding teacher-student training. The differential entropy1 over the entire probability simplex is a natural mea- sure to quantify the confidence diversity. However, accurate entropy estimation can be challenging, and its computation is severely hampered by the curse of dimensionality, particularly in applications with a large number of classes. To alleviate the problem, in this chapter, we propose to measure only the entropy of the softmax element corresponding to the true label class, thereby simplifying the measure to a one-dimen(sional )entropy estimation task. Mathematically, if we denote c = ϕ(x, y)[softmax fw(x) ]y, and let pC be the probability density func- tion of the random variable Cϕ(X, Y ) where (X, Y ) ∼ p(x, y), then, we quantify Confidence Diversity via the differen∫tial entropy of C: h(C) = − pC(c) log pC(c) dc. (6.5) We use the KNN-based entropy estimator to compute h(C) over the training set [9]. In essence, the above measure quantifies the amount of spread associated with the teacher predictions on the true label class. The smaller the value, the more similar the softmax values are across different samples. 1This is distinct from the average predictive uncertainty discussed in the previous section, which measures the average Shannon entropy of probability vectors. 103 74.0 3.0 73.8 1.20 Test NLL 0.10 73.6 3.5 73.4 1.15 0.08 4.0 73.2 4.5 73.0 1.10 0.06 72.8 1.05 0.04 5.0 72.6 Test Accuracy Predictive Uncertainty 5.572.4 0.02 Confidence Diversity6.0 0 2 4 6 8 10 1.00 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Generations Generations Generations Generations Figure 6.1: Results for sequential self-distillation over 10 generations are shown above. Model obtained at the (i − 1)-th generation is used as the teacher model for training at the i-th generation. Accuracy and NLL are obtained on the test set using the student model, whereas the predictive uncertainty and confidence diversity are evaluated on the training set with teacher predictions. 6.4.3 Sequential Self-Distillation Experiment We perform sequential self-distillation with ResNet-34 on the CIFAR-100 dataset for 10 generations. At each generation, we train the neural networks for 150 epochs using the identical optimization procedure as in the original ResNet paper [79]. Following Furlanello et al. [49], α and T are set to 0 and 1 respectively throughout the entire training process. Additional experiments with different values of T can be found in Appendix D.3. Fig. 6.1 summarizes the results. As indicated by the general increasing trend in test accuracy, sequential distillation indeed leads to improvements. The entropy plots also support the hypothesis that subsequent generations exhibit increasing diversity and uncertainty in predictions. Despite the same increasing trend, the two entropy metrics quantify different things. The increase in average predictive uncertainty suggests overall a drop in the confidence of the categorical distribution, while the growth in confidence diversity suggests an increasing variability in teacher predictions. Interestingly, we also see obvious improvements in terms of NLL, suggesting in addition that BAN can improve calibration of predictions [70]. To further study the apparent correlation between student performance and en- tropy of teacher predictions over generations, we conduct a new experiment, where 104 we instead train a single teacher. This teacher is then used to train a single gen- eration of students while varying the temperature hyper-parameter T in Eq. 6.3, which explicitly adjusts the uncertainty and diversity of teacher predictions. For consistency, we keep α = 0. Results are illustrated in Fig. 6.2. As expected, increasing T leads to greater predictive uncertainty and diversity in teacher pre- dictions. Importantly, we see this increase leads to drastic improvements in the test accuracy of students. In fact, the gain is much greater than the best achieved with 10 generations of BAN with T = 1 (indicated with the flat line in the plot). The identified correlation is consistent with the recent finding that early-stopped models, which typically have much larger entropy than fully trained ones, serve as better teachers [30]. Lastly, we also see improvements in NLL with increasing entropy of teacher predictions. However, too high T leads to a subsequent increase in NLL, likely due to teacher predictions that lack in confidence. A closer look at the entropy metrics of the above experiment reveals an impor- tant insight. While the average predictive uncertainty is strictly increasing with T , the confidence diversity plateaus after T = 2.5. The plateau of confidence diversity coincides closely with the stagnation of student test accuracy, hinting at the im- portance of confidence diversity in teacher predictions. The apparent correlation between accuracy and confidence diversity can be also seen from the additional sequential self-distillation experiments found in Appendix D.3. This makes in- tuitive sense. Given a training set, we would expect that some of the samples be much more typically representative of the label class than others. Ideally, we would hope to classify the typical examples with much greater confidence than an ambiguous example of the same class. Previous results show that training with such instance-specific uncertainty can indeed lead to better performance [170]. Our view is that in self-distillation, the teacher provides the means for instance-specific 105 75.25 3.0 75.00 1.5 Test NLL 2.5 Predictive Uncertainty 0 74.75 1.4 1 74.50 1.3 2.0 74.25 1.2 1.5 2 74.00 1.1 1.0 3 73.75 Test Accuracy 1.0 0.5 4 73.50 5 Confidence Diversity0.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 temperature temperature temperature temperature Figure 6.2: Results with teacher predictions scaled by varying temperature T . The flat lines in the plots correspond to the largest/smallest values achieved over 10 generations of sequential distillation with T = 1 in the previous experiments for accuracy, predictive uncertainty and confidence diversity/NLL. regularization. 6.5 An Amortized MAP Perspective of Self-Distillation The instance-specific regularization perspective on self-distillation motivates us to recast the training procedure as performing Maximum a posteriori (MAP) estimation on the softmax probability vector. Specifically, suppose now that the likelihood p(y|x, z) = Cat(z) be a categorical distribution with parameter z ∈ ∆(L) and the conditional prior p(z|x) = Dir(αx) be a Dirichlet distribu- tion with instance-specific parameter αx. Due to conjugacy of the Dirichlet prior, a closed-form solution of ẑ = ∑ci+αxi−1i − , where ci corresponds to number ofj cj+αxj 1 occurrences of the i-th category, can be easily obtained. The above framework is not useful for classification when given a new sample x without any observations y. Moreover, in the common supervised learning setup, only one observation of label y is available for each sample x. The MAP solution shown above merely relies on the provided label y for each sample x, without exploiting the potential similarities among different samples (xi)’s in the entire dataset for more accurate estimation. For example, we could have different samples that are almost duplicates (cf. [6]), but have different yi’s, which could inform us 106 about other labels that could be drawn from zi. Thus, instead of relying on the instance-level closed-form solution,(we can)train a (student) network to amortize the MAP estimation ẑi ≈ softmax fw(xi) with a given training set, resulting in an optimization problem of: ∑n ∑n max log p(z|xi, yi;w,αx) = max log p(y = yi|z,xi;w) + log p(z|xi;w,αx) w w i=1 ∑i=1n ∑n ∑k = max ︸ log[softm︷a︷x (fw(xi))]y︸+ ︸ ([αx︷]︷c − 1) log[z]i i ︸c .w i=1 i=1 c=1 Cross entropy Instance-specific regularization (6.6) Eq. 6.6 is an objective that provides us with a function to obtain a MAP solution of z given an input sample x. Note that, we do not make any assumptions about the availability or number of label observations of y for each sample x. This enables us to find an approximate MAP solution to x at test-time when αx and y are unavailable. The resulting framework can be generally applicable to various scenarios like semi-supervised learning or learning from multiple labels per sample. Nevertheless, in the following, we restrict our attention to supervised learning with a single label per training sample. 6.5.1 Label Smoothing as MAP The difficulty now lies in obtaining the instance-specific prior Dir(αx). A naive independence assumption that p(z|x) = p(z) can be made. Under such an as- sumption, a sensible choice of prior would be a uniform distribution across all possible labels. Choosing [αx]c = [α] β c = + 1 for all c ∈ {1, ..., k} for somek 107 hyper-parameter β, the MAP objective becomes ∑n ∑n ∑k LLS = − 1 log[z]y + β − log[z]c. (6.7)i k i=1 i=1 c=1 As noted in prior work, this loss function is equivalent to the commonly used label smoothing (LS) regularization [169, 198] (derivations can be found in Ap- pendix D.1). Observe also that the training objective in essence promotes predic- tions with larger predictive uncertainty, but not confidence diversity. 6.5.2 Self-Distillation as MAP A better instance-specific prior distribution can be obtained using a pre-trained (teacher) neural network. Let us consider a network fwt traine(d with)the reg- ular MLE objective, by maximizing p(y|x;wt) = Cat(softmax fwt(x) , where ∑[exp(fwt (x))][softmax (f (x))] = iwt i . Now, due to conjugacy of the Dirichlet prior, j [exp(fwt (x))]j the marginal likelihood p(y|x;αx) is a Dirichlet-multinomial distribution [150]. In the case of single label observation considered, the marginal likelihood reduces to a categorical distribution. As such, we have: p(y|x;αx) = Cat(αx), where αx is normalized such that [α ] = ∑[αx]ix i . We can thus interpret exp (fwt(x)) as the j [αx]j parameters of the Dirichlet distribution to obtain a useful instance-specific prior on z. However, we observe that there is a scale ambiguity that needs resolving, since any of the following will yield the same αx: αx = β exp(fwt(x)/T ) + γ, (6.8) where T = 1 and γ = 0, and β corresponds to some hyper-parameter. Using T > 1 and γ > 0 corresponds to flattening the prior distribution, which we found to be useful in practice - an observation consistent with prior work. Note that in the limit of T → ∞, the instance-specific prior reduces to a uniform prior corresponding to 108 classical label smoothing. Setting γ = 1 (we also experimentally explore the effect of varying γ. See Appendix D.9 for details), we obtain ∑ αx = β exp(fwt(x)/T ) + 1 = β [exp(fwt(x)/T )]j softmax(fwt(x)/T ) + 1. j (6.9) Plugging this into Eq. 6.6 yields ∑n ∑n ∑k LSD = − log[z]y + β ωx −[softmax(fwt(xi)/T )]c log[z]c, (6.10)i i i=1 i=1 c=1 very similar to the dis∑tillation loss of Eq. 6.3, with an additional sample-specific weighting term ωx = j[exp(fwt(xi)/T )]j!i Despite the interesting result, we empirically observe that, with temperature values T found to be useful in practice, the relative weightings of samples are too close to yield a significant difference from regular distillation loss. Hence, for all of our experiments, we still adopt the distillation loss of Eq. 6.3. However, we believe that, with teacher models trained with an objective more appropriate than MLE, the difference might be bigger. We hope to explore alternative ways of obtaining teacher models to effectively utilize the sample re-weighted distillation objective as future work. The MAP interpretation, together with empirical experiments conducted in Section 6.4, suggests that multi-generational self-distillation can in fact be seen as an inefficient approach to implicitly flatten and diversify the instance-specific prior distribution. Our experiments suggest that instead, we can more effectively tune for hyper-parameters T and γ to achieve similar, if not better, results. Moreover, from this perspective, distillation in general can be understood as a regularization strategy. Some empirical evidence for this can be found in Appendix D.5 and D.6. 109 6.5.3 On the Relationship between Label Smoothing and Self-Distillation The MAP perspective reveals an intimate relationship between self-distillation and label smoothing. Label smoothing increases the uncertainty of predictive probabilities. However, as discussed in Section 6.4, this might not be enough to prevent overfitting, as evidenced by the stagnant test accuracy despite increasing uncertainty in Fig. 6.2. Indeed, the MAP perspective suggests that, ideally, each sample should have a distinct probabilistic label. Instance-specific regularization can encourage confidence diversity, in addition to predictive uncertainty. While the predictive uncertainty can be explicitly used for regularization as previously discussed in Section 6.4.1, we observe empirically that promoting con- fidence diversity directly through the proposed measure in Section 6.4.2 can be hard in practice, yielding unsatisfactory results. This could have been caused by difficulty in estimating confidence diversity accurately using mini-batch samples. Naively promoting confidence diversity during the early stage of training could also have harmed learning. As such, we can view distillation as an indirect way of achieving this objective. We leave it as future work to further explore alternative techniques to enable direct regularization of confidence diversity. 6.6 Beta Smoothing Labels Self-distillation requires training a separate teacher model. In this chapter, we propose an efficient enhancement to label smoothing strategy where the amount of smoothing will be proportional to the uncertainty of predictions. Specifically, we 110 make use of the exponential moving average (EMA) predictions as implemented by Tarvainen and Valpola [201] of the model at training, and obtain a ranking based on the confidence (the magnitude of the largest element of the softmax) of predictions at each mini-batch, on the fly, from smallest to largest. Instead of assigning uniform distributions [α ] = βx c +1 for all c ∈ {1, ..., k} to all samples ask priors, during each iteration, we sample and sort a set of i.i.d. random variables {b1 ≤ ... ≤ bm} from Beta(a, 1) where m corresponds to the mini-batch size and a corresponds to the hyper-parameter associated with the Beta distribution. Then, we assign [αx ]y = βbi + 1 and [α ] = β 1−bi x c − + 1 for all c ̸= yi as the prior toi i i k 1 each sample xi, based on the ranking obtained. In this way, samples with larger confidence obtained through the EMA predictions will receive less amount of label smoothing and vice versa. Thus, the amount of label smoothing applied to a sample will be proportional to the amount of confidence the model has about that sample’s prediction. Those instances that are more challenging to classify will, therefore, have more smoothing applied to their labels. In practice, for consistency with distillation, Eq. 6.3 is used for training. Beta- smoothed labels of bi on the ground truth class and 1−bi − on all other classes arek 1 used in lieu of teacher predictions for each xi. Lastly, note that EMA predictions are used in order to stabilize the ranking obtained at each iteration of training. We empirically observe a significant performance boost with the EMA predictions. We term this method Beta smoothing. To better examine the role of EMA predictions has on Beta smoothing, we conduct two ablation studies. Firstly, since the EMA predictions are used for Beta smoothed labels, we compare the effectiveness of Beta smoothing against self-training explicitly using the EMA predictions (see Appendix D.4 for details). 111 Moreover, to test the importance of ranking obtained from EMA predictions, we include in the Appendix D.7 an additional experiment for which random Beta smoothing is applied to each sample. Beta smoothing regularization implements an instance-specific prior that en- courages confidence diversity, and yet does not require the expensive step of train- ing a separate teacher model. We note that, due to the constantly changing prior used at every iteration of training, Beta smoothing does not, strictly speaking, correspond to the MAP estimation in Eq. 6.6. Nevertheless, it is a simple and effective way to implement the instance-specific prior strategy. As we demon- strate in the following section, it can lead to much better performance than label smoothing. Moreover, unlike teacher predictions which have unique softmax values for all classes, the difference between Beta and label smoothing only comes from the ground-truth softmax element. This enables us to conduct more systematic experiments to illustrate the additional gain from promoting confidence diversity. 6.7 Empirical Comparison of Distillation and Label Smoothing To further demonstrate the benefits of the additional regularization on the soft- max probability vector space, we design a systematic experiment to compare self- distillation against label smoothing. In addition, experiments on Beta smoothing are also conducted to further verify the importance of confidence diversity, and to promote Beta smoothing as a simple alternative that can lead to better per- formance than label smoothing at little extra cost. We note that, while previous works have highlighted the similarity between distillation and label smoothing from 112 another perspective [227], we provide a detailed empirical analysis that uncovers additional benefits of instance-specific regularization. 6.7.1 Experimental Setup We conduct experiments on CIFAR-100 [111], CUB-200 [215] and Tiny- imagenet [35] using ResNet [79] and DenseNet [89]. We follow the original opti- mization configurations, and train the ResNet models for 150 epochs and DeseNet models for 200 epochs. 10% of the training data is split as the validation set. All experiments are repeated 5 times with random initialization. For simplicity, label smoothing is implemented with explicit soft labels instead of the objective in Eq. 6.7. We fix ϵ = 0.15 in label smoothing for all our experiments (additional experiments with ϵ = 0.1, 0.3 can be found in the Appendix D.3.1). The hyper- parameter α of Eq. 6.3 is taken to be 0.6 for self-distillation. Only one generation of distillation is performed for all experiments. To systematically decompose the effect of the two regularizations in self-distillation, given a pre-trained teacher and α, we manually search for temperature T such that the average effective label of the ground-truth class, α+ (1−α)[softmax (fwt(xi)/T )]y , is approximately equali to 0.85 to match the hyper-parameter ϵ chosen for label smoothing. Eq. 6.3 is also used for Beta smoothing with α = 0.4. The parameter a of the Beta distribution is set such that E[α + (1 − α)bi] = ϵ, to make the average probability of ground truth class the same as ϵ−label smoothing. We emphasize that the goal of the experiment is to methodically decompose the gain from the two aforementioned regularizations of distillation. Note that, both α and T can influence the amount of predictive uncertainty and confidence diversity in teacher predictions at the same time. This coupled effect can make hyper- 113 parameter tuning hard. Due to limited computational resources, hyper-parameter tuning is not performed, and the results for all methods can be potentially en- hanced. Lastly, we also incorporate an additional distillation experiment in which the deeper DenseNet model is used as the teacher model for comparison against self-distillation. Results can be found in Appendix D.8. 6.7.2 Results Test accuracies are summarized in the top row for each experiment in Fig. 6.3. Firstly, all regularization techniques lead to improved accuracy compared to the baseline model trained with cross-entropy loss. In agreement with previous results, self-distillation performs better than label smoothing in all of the experiments with our setup, in which the effective degree of label smoothing in distillation is, on average, the same as that of regular label smoothing. The results suggest the importance of confidence diversity in addition to predictive uncertainty. It is worth noting that we obtain encouraging results with Beta smoothing. Outperforming label smoothing in all but the CIFAR-100 ResNet experiment, it can even achieve comparable performance to that of self-distillation for the CUB-200 dataset with no separate teacher model required. The improvements of Beta smoothing over label smoothing also serve direct evidence on the importance of confidence diversity, as the only difference between the two is the additional spreading of the ground truth classes. We hypothesize that the gap in accuracy between Beta smoothing and self-distillation is mainly due to better instance-specific priors set by a pre-trained teacher network. The differences in the non-ground-truth classes between the two methods could also account for the small gap in accuracy performance. Results on calibration are shown in the bottom rows of Fig. 6.3, where we report 114 CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12 76 78 74 76 0.14 0.10 0.12 0.05 CE LS B SD CE LS B SD (a) (b) CUB-200, ResNet-34 CUB-200, DenseNet-121-12 56 60.0 54 57.5 52 55.0 0.2 0.2 0.1 0.1 0.0 CE LS B SD 0.0 CE LS B SD (c) (d) Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12 58 59 56 58 57 54 0.10 0.1 0.05 0.0 CE LS B SD CE LS B SD (e) (f) Figure 6.3: Experimental Results performed on CIFAR-100, CUB-200 and the Tiny-Imagenet dataset. ”CE”, ”LS”, ”B” and ”SD” refers to ”Cross Entropy”, ”Label Smoothing”, ”Beta Smoothing” and ”Self-Distillation” respectively. The top rows of each experiment show bar charts of accuracy on test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. 115 ECE Accuracy ECE Accuracy ECE Accuracy ECE Accuracy ECE Accuracy ECE Accuracy the expected calibration error (ECE) [70]. As anticipated, all regularization tech- niques lead to enhanced calibration. Nevertheless, we see that the errors obtained with self-distillation are much smaller in general compared to label smoothing. As such, instance-specific priors can also lead to more calibrated models. Beta smoothing again not only produces models with much more calibrated predictions compared to label smoothing but compares favorably to self-distillation in a ma- jority of the experiments. 6.8 Discussion and Future Directions Recent literature shows that label smoothing leads to better calibration perfor- mance [153]. In this chapter, we demonstrate that distillation can also yield more calibrated models. We believe this is a direct consequence of not performing tem- perature scaling on student models during training. Indeed, with temperature scaling also on the student models, the student logits are likely pushed larger during training, leading to over-confident predictions. More generally, we have only discussed the teacher-student training strategy as MAP estimation. There have been other recently proposed techniques involv- ing training with soft labels, which we can interpret as encouraging confidence diversity or implementing instance-specific regularization. For instance, the mixup regularization [233] technique creates label diversity by taking random convex com- binations of the training data, including the labels. Recently proposed consistency- based semi-supervised learning methods such as [119,201], on the other hand, uti- lize predictions on unlabeled training samples as an instance-specific prior. We believe this unifying view of regularization with soft labels can stimulate further 116 ideas on instance-specific regularization. 117 CHAPTER 7 A CASE STUDY OF DEEP LEARNING TO DIGITAL PATHOLOGY IMAGE ANALYSIS Machine Learning (ML) algorithms have been increasingly applied to a range of histopathology tasks. An area that remains under-explored is the in-depth analysis of the errors made by pathologists and ML models in these tasks. Furthermore, little has been reported on hybrid approaches combining human and ML predic- tions. In this final chapter, we present a detailed empirical analysis comparing expert neuropathologists and ML models for the task of predicting IDH mutation status in HE-stained histology slides of infiltrating gliomas, independently and synergistically. We find that errors made by neuropathologists and ML models trained using the TCGA dataset are distinct, representing modest agreement be- tween predictions (human-vs-human κ = 0.656; human-vs-ML model κ = 0.598). While no ML model surpassed human performance on an independent institu- tional test dataset (human AUC = 0.918, max ML AUC = 0.883), a hybrid model aggregating human and ML predictions demonstrates predictive performance com- parable to the consensus of two expert neuropathologists (hybrid classifier AUC = 0.921 vs. two-neuropathologist consensus AUC = 0.920). In addition, we show that models trained at different optical objective levels can exhibit different types of errors, underscoring the significance of aggregating across spatial scales in the ML approach. Finally, we present a detailed interpretation of our multi-scale ML ensemble model which reveals that predictions are driven by human-identifiable features at the patch-level. 118 7.1 Introduction With the advancement of computer processing power and the demonstrated utility of deep learning approaches across multiple data-rich domains, the adoption of ma- chine learning to medical diagnostics is anticipated to have a transformative effect on patient care. Already, methylation-based machine learning (ML) approaches to the classification of tumors of the central nervous system (CNS) have demon- strated performance that can exceed traditional histology-based diagnosis [24], and has allowed for the identification of novel entities [123] and molecular sub- types within established classification systems [101, 123, 202]. Molecularly-defined entities continue to emerge, many demonstrating overlapping histology with other established tumor classes [180]. However, routine histopathologic examination remains the mainstay of oncologic diagnosis due to its relatively low cost, ubiqui- tous accessibility, the limited availability of advanced molecular assays, and estab- lished robustness - particularly when performed by experienced subspecialty expert histopathologists. Even in healthcare centers with access to advanced molecular assays, the availability of subspecialty experts needed to perform organ-specific histopathologic examination and integrate molecular results into the overall diag- nostic picture may be lacking. Developing robust machine learning models that leverage the immense, data-rich trove of existing and prospective histology slides via digitally scanned whole slide images (WSI) and that reproduce or augment subspecialist histopathology expertise can 1) help general pathologists render ac- curate subspecialty diagnoses, 2) serve as a check on human sources of error by acting as a highly reproducible and fatigue-free assistant, 3) help prioritize the highest yield assays for a given specimen, reducing costs and tissue expenditure, and 4) reveal discordant biases between ML models and human pathologists, which when approached synergistically could increase the detection of clinically pertinent 119 biomarkers than either in isolation. Moreover, interrogating and understanding the features that drive ML classification could reveal avenues for improvement in hu- man expert assessments. Infiltrating gliomas are the most common primary tumors of the CNS in adults [149, 165], and despite significant advances in the understanding of their biology, they are considered incurable by current standards of care, including sur- gical gross total resection, radiotherapy, and chemotherapy [196]. Historically, infiltrating gliomas were classified into the broad categories of astrocytoma and oligodendroglioma on cytomorphological grounds, and assigned histologic grades based on particular features including mitotic activity, necrosis, and microvascular proliferation. The term ‘glioblastoma’ (GBM) was synonymous with the highest grade variant of infiltrating astrocytoma (IV of IV) and such tumors carry a poor prognosis with an average survival less than 2 years [16, 133]. With the discovery of isocitrate dehydrogenase (IDH) mutation as a key driver of gliomagenesis in 25- 30% of infiltrating gliomas and its correlation with a favorable prognosis, recent consensus guidelines regard IDH-mutant (IDHmut) tumors as biologically distinct entities from IDH-wildtype (IDHwt) tumors, and indeed the term ‘glioblastoma’ is now only applied to IDHwt infiltrating astrocytomas with high grade histo- logical and/or molecular features [16, 87, 133, 182, 222]. While IDHmut gliomas are enriched for tumors with lower-grade histologic features, there is no known definitive histologic standard for determining IDH status from histomorphology alone, and immunohistochemical or molecular means are currently required for such a determination; however, histomorphologic correlates of molecular alter- ations are well-recognized in many tumor types, including infiltrating gliomas. As noted by the WHO, certain histologic features have a stronger association with IDHmut status, including gemistocytic and oligodendroglial-like cytomorphology, 120 while higher grade features such as palisading necrosis and microvascular prolifer- ation are enriched in IDHwt tumors; however these features lack sensitivity and specificity [162,163]. Our experience suggests that subspecialty neuropathologists who review a high volume of infiltrating gliomas can predict the presence of IDH mutation from routine HE stains with a relatively high degree of accuracy. There- fore, we believed that histological prediction of IDH-status represented an ideal prototype for the more general paradigm of designing computer vision models to interrogate whole-slide images (WSI) to predict critical, clinically relevant tumor biomarkers, and to combine these results with human assessment. As discussed in previous chapters, convolutional neural networks (CNNs) are a class of neural network architectures that have achieved state-of-the-art perfor- mance in a range of computer vision problems. A challenge in the application of CNNs to WSI processing is that there is a practical limit to the input image size that can be handled (typically less than 1000x1000 pixels2) by today’s hardware resources, such as GPU compute power and memory. WSI often have in the range of 105 pixels in each dimension, and key diagnostic features are usually seen only in small foci, necessitating tiling of the source image into appropriately sized train- ing patches, and aggregation of patch-level class predictions to generate slide-level predictions. Previous work has shown that CNNs can be used to classify WSI histology data, particularly in epithelial cancers, including the prediction of driver mutations in some cancers [23, 32, 34, 36, 41, 86, 102, 135]. Furthermore, integrat- ing CNN predictions from histology with genomic information has been found to predict behavior in infiltrating gliomas better than traditional histologic grading alone [152]. Prior studies have largely trained ML classifiers on image patches derived at a 121 single level of magnification without aggregating across scales. This is in contrast to what pathologists typically do, which is use a range of magnifications in assessing tissue; i.e., pathologists scan slides at low magnification both to identify features better appreciated at low power as well as to identify regions of interest for closer examination at higher power. We therefore hypothesized that the accuracy of our prototypical classification task would be magnification-level dependent, and that ensembling multiple ML models trained at different scales would generate more robust classification. Finally, we hypothesized that neuropathologists and ML models would make different types of errors in classification, and that the aggregate assessment of a hybrid pathologist/ML model would be superior to either human or ML assessment alone. 7.2 Materials and Methods 7.2.1 Datasets In this study, we used datasets from two cohorts of infiltrating gliomas patients ob- tained from The Cancer Genome Atlas (TCGA) [203] and Weill-Cornell Medicine (WCM). (1) TCGA: We downloaded HE-stained WSI along with gender and age information from the TCGA-LGG and TCGA-GBM datasets. Only formalin-fixed paraffin-embedded (FFPE) HE-stained diagnostic slides from primary tumor sites were used. From these datasets, we obtained a total of 801 slide images (601 IDHwt and 200 IDHmut) from 372 patients (261 IDHwt and 111 IDHmut) (Table 7.1). We then split TCGA data into training, validation, and test sets, with all slides from individual patients being sorted to the same subset. To ensure IDH class 122 balance during model evaluation for straightforward interpretation, we randomly sampled 30 IDHwt slides and 30 IDHmut slides each in both the TCGA validation and test sets. All other slides in the TCGA cohort were used for training. (2) WCM: We queried the in-house clinical database at WCM for infiltrating gliomas with available HE-stained slides, with recorded IDH mutation and 1p19q codele- tion status, from 2011 to 2020. From these cases, a balanced dataset of IDHwt and IDHmut gliomas (including both astrocytomas and oligodendrogliomas) were scanned using the Aperio T2 system at 40X. This test dataset comprised 87 slides from 74 patients with IDHwt gliomas, and 87 slides from 67 patients with IDHmut gliomas.The digital images were captured with an Aperio T2 scanner. The scanned images were reviewed by author CS for adequacy, and the evaluating authors (BL and DP) were blinded to all information about the cases beyond the scanned HE slides. The WCM dataset was used as an independent external test set to evaluate ML model robustness and generalizability and to compare the ML models with human IDH prediction performance. A subset of the WCM test set was also cre- ated to exclude confounding effects generated from age and gender by propensity score matching. Specifically, we calculated IDH mutation propensity score of each patient in WCM test set based on age and gender. We then stratified propensity scores, ranging from 0 to 1, into 10 bins with equal width. The same number of samples from two IDH groups were selected as matching pairs from each propen- sity score bin. The left samples in IDHmut group were the ones having relatively high propensity scores, and the ones in IDHwt groups had relatively low propen- sity scores. This yielded a demographically balanced WCM test* dataset with 36 patients in each IDH class. 123 Table 7.1: Summary of the demographics for the TCGA training, validation, and test datasets and the WCM test datasets. No significant differences are seen in sex between the IDHmut and IDHwt groups. IDH mutant gliomas show statistically significant enrichment in younger patients, consistent with historic controls. † indicates average simulation p-value: 140 IDH WT slides in the training dataset were randomly sampled and one-way Anova was then conducted. Simulations were repeated for 1000 times. * indicates propensity score matching accounting for age and sex IDH Status Overall p value WT MUT Count (n) Training 681 (312) 541 (232) 140 (80) Slide (Patient) Validation 60 (29) 30 (13) 30 (16) Test 60 (31) 30 (16) 30 (15) TCGA Overall 801 (372) 601 (261) 200 (111) WCM Test 174 (141) 87 (74) 87 (67) WCM Test* 85 (72) 41 (36) 44 (36) Training 52.5 (16.4) 58.0 (13.1) 36.5 (14.6) Validation 41.5 (19.7) 59.9 (12.1) 26.5 (8.56) 0.131† Age (Years) Test 47.5 (21.0) 62.1 (15.7) 32.0 (13.4) Mean (SD) TCGA Overall 51.2 (17.3) 58.4 (13.2) 34.5 (14.1) <0.0001 WCM Test 52.4 (16.6) 62.7 (12.8) 41.1 (12.5) <0.0001 WCM Test* 51.7 (12.05) 54.4 (12.5) 49.0 (11.1) 0.055 Training 115 (36.9) 84 (36.2) 31 (38.8) Validation 13 (44.8) 8 (61.5) 5 (31.3) 0.821† Female Test 13 (41.9) 6 (37.5) 7 (46.7) n (%) TCGA Overall 141 (37.9) 98 (37.5) 43 (38.7) 0.921 WCM Test 63 (44.7) 38 (51.4) 25 (37.3) 0.132 WCM Test* 33 (45.8) 18 (50) 15 (41.7) 0.636 7.2.2 Image Preprocessing We first tiled all WSI into non-overlapping patches of size 256 by 256 pixels at levels of down-sampling corresponding to 2.5X, 5X, 10X, and 20X magnification (Figure 7.1). Pixel values ranging between 40 and 215 in greyscale space were treated as informative tissue, and pixels outside this range were considered unin- formative, either as background whitespace (> 215) or folded tissue (< 40). Only patches with over 75% tissue percentage were kept for further training and testing. 124 Figure 7.1: A schematic for the end-to-end process of model training and deploy- ment. WSI are tiled into patches of 256x256 size at 2.5X, 5X, 10X, and 20X magnification factors (A). In each training iteration (mini-batch), 200 randomly selected and augmented patches from a single magnification of a single WSI were passed to single-scale Densenet121 classifiers, initialized with imageNet pre-trained weights. Feature embedding vectors from each patch were then aggregated using näıve averaging, and the resulting vector was then passed to a final fully con- nected (linear) classifier (B). Following training, the predictions three versions of each single-scale model trained with different random seeds were averaged to pro- duce a single-scale ensemble, and the predictions from each single-scale ensemble were averaged to produce the multiscale ensemble (MSE) predictions. (C). All patches with significant blurriness or pen marks were excluded by thresholding RGB values obtained heuristically. 125 7.2.3 Model Training After the image preprocessing step, each WSI had four sets of patches correspond- ing to magnifications of 2.5X, 5X, 10X, and 20X. Single-scale models were trained for each scale. We used a pre-trained DenseNet-121 architecture [89], without the last dense layer, as the feature extractor to generate patch-level embeddings of length 1024. All patch-level embeddings from one slide generated in each iteration were aggregated into slide-level embeddings using average pooling. A fully con- nected layer with 1024 nodes was then implemented to take the aggregated slide- level features as input and output slide-level IDH mutation probabilities. Due to memory constraints, only 200 patches from one WSI were randomly selected and passed to the network for each training step (Figure 7.1 B). If there were less than 200 patches for one slide, we used all available patches in that mini-batch. Note the mini-batch consisted of a single WSI. To keep IDH classes balanced dur- ing training, we randomly sampled 140 IDHwt slides and used all 140 IDHmut slides in each training epoch. We used Adam as to minimize binary cross-entropy loss [35,112] with a learning rate of 0.00001, and a maximum of 100 epochs [110]. Models from the epoch with the best validation loss were used. Three separate single-scale models were trained using different random initial seeds. 7.2.4 Model Inference The trained models from the last step can be used for predicting both patch- level and slide-level IDH mutation status. We first averaged the three slide-level probabilistic predictions at a given scale to compute single-scale predictions. A multi-scale ensemble (MSE) was then computed by averaging all four single-scale 126 predictions (Figure 7.1 C). For patients with multiple slides, patient-level predic- tions were computed by averaging slide-level predictions. 7.2.5 Pathologist Evaluation The WCM test set was separately evaluated by two neuropathologists, blinded to all patient information and ancillary testing beyond the WSI, to compare the model predictions to human observers. For each case, both pathologists were asked to issue a prediction for IDH status in a semiquantitative scale, normalized to a range of 0 and 1 (i.e., 0 for a prediction of IDHwt and 1 for IDHmut, values close to 0.5 for cases with low certainty). The pathologists’ predictions were then averaged to generate a two-pathologist consensus score. The predictions from each pathologist were averaged with the MSE prediction to generate hybrid classifier scores, and the two-pathologist consensus score was averaged with the MSE predictions to generate a two-pathologist consensus-hybrid model. 7.2.6 Prediction Heatmap Eight cases in the WCM test set were selected for visualization including, covering all possible IDH status combinations of ground-truth, pathologists’ ensemble, and slide-level MSE predictions. We used a sliding window strategy to generate a MSE prediction heatmap. We set window size as 256 × 256 and step size as 256, 128, 64 and 32 for 20X, 10X, 5X, and 2.5X, respectively. Using this sliding windows process, we passed patches containing greater than 50% tissue pixels through the single-scale models. Pixel-level predictions were computed by averaging model predictions for patches that contained that pixel, excluding patches below the 50% 127 tissue threshold. These regions were then manually examined by pathologists to gain insights into the histologic features impacting predictions. 7.2.7 UMAP Visualization We randomly selected five 10X patches from each WSI in the WCM test set for UMAP visualization [8, 147]. Patch embeddings extracted by trained convolu- tional base of the best performing 10X classifier were used as patch represen- tations. We used the Python UMAP package with default hyper-parameters to obtain the UMAP representations for each patch. For visualization purposes, the first two dimensional vectors of UMAP projections were used as coordinates to show the original input patches, ground-truth IDH mutation status, ground-truth integrated molecular diagnosis (oligodendroglioma, IDHmut astrocytoma, IDHwt astrocytoma), patch-level IDH prediction scores, and slide-level IDH prediction scores from the classifier. The patches were then reviewed by the pathologists to determine the presence of human-identifiable features in each clustering, and the association between histomorphology with specific diagnoses. 7.2.8 Statistical Analysis and Software All model trainings and inferences were performed on 4 NVIDIA Titan X GPUs. Image preprocessing, model training and inference were conducted in Python, ver- sion 3.7.4. OpenSlide python was used for reading and tiling WSI. Pytorch was used for training neural networks. All statistical analyses were performed in R, version 4.0.3. Slide prediction heatmaps were plotted using the ComplexHeatmap R package [69]. Age differences were evaluated using t-test. Chi-square test was 128 used to test the gender difference between two IDH status groups. Confidence in- tervals of model performance metrics were evaluated through sample bootstrapings for 1000 times. All statistical tests were two-sided with a significance threshold of p < 0.05. 7.2.9 Image Augmentation To increase the model generalizability and reduce potential overfitting, we imple- mented several image augmentation strategies during training. Since all patches within each batch were from one WSI, color augmentations were performed on slide level for each iteration, i.e., we only used one set of color augmentation pa- rameters each iteration for all patches from each slide. We first transformed RGB patches into HSV color space. Then pixel values were augmented channel-wise as: Iaugc = αcIc + βc. Ic were pixel values in channel c. αc and βc were channel specific color augmentation factors where αc and βc were sampled from uniform distributions U(1− σ, 1 + σ) and U(−σ, σ) respectively for each slide. We set σ as 0.05 to control augmentation degree. In addition, each patch had 50% probability of being flipped either vertically or horizontally and equal probability (25% each) of being rotated by 0, 90, 180 or 270 degrees. Distinct augmentation parameters were randomly generated during patch selection for each mini-batch. 129 Figure 7.2: ROC curves for the ML classifiers, pathologists, and hybrid models on the WCM test data. Figure A compares the model performance of the single-scale ensembles and the multi-scale ensemble. (MSE). The performance of the semi- quantitative predictions of two expert neuropathologists and the two-pathologist averaged consensus are compared in Figure B. Figure C compares the predictions of the top-performing neuropathologist with the MSE, and the hybrid model gen- erated by näıve averaging of pathologist and MSE predictions. 7.3 Results 7.3.1 ML Models Accurately Predict IDH Mutation Status WSI images obtained from the publicly available TCGA database were used for training, including 801 (601 IDHwt and 200 IDHmut) slides (Table 7.1). These were split into training, validation, and test sets. As an external validation set, WSI from our institution (Weill Cornell Medicine) were used, comprising 174 (87 IDHwt and 87 IDHmt) slides. WSI were tiled into 256x256 pixel2 patches over multiple down-sampled levels corresponding to 2.5X, 5X, 10X, and 20X magni- fication (Figure 7.1; see methods). Single-scale models were trained using the DenseNet-121 CNN architecture [89] and patch-level embeddings were aggregated into slide-level embeddings via average pooling, which were then used to gener- ate slide-level IDH mutation probabilities at output. 200 patches from each WSI were randomly selected and passed to the network during each training step (Fig- ure 7.1 B). A multi-scale ensemble (MSE) was then generated by averaging all the 130 predictions over the single-scale models (Figure 7.1C; see methods for detail). Receiver operating characteristic (ROC) curves were generated for patient-level predictions of IDH status evaluated on the WCM test dataset using 1) single-scale models, 2) multiscale ensemble (MSE) ML model, 3) expert neuropathologist, and 4) hybrid neuropathologist-MSE scores. Within the ML model (Figure 7.2 A), a trend towards higher accuracy at intermediate magnifications is noted, with the highest accuracy achieved by the 10x classifier (AUC = 0.881, 95% confidence in- terval = 0.88-0.883), with diminished AUCs seen in models using the lowest (2.5x) and highest (20x) levels of magnification. No ML model demonstrated a supe- rior AUC compared to neuropathologists (Figure 7.2 B), and consensus averaging of the two neuropathologists’ semiquantitative predictions demonstrated a higher AUC than each neuropathologist individually. Averaging the top performing neu- ropathologist’s semiquantitative predictions with the MSE prediction scores to generate a human-ML hybrid classifier (Figure 7.2 C) shows a higher AUC than either the ML classifier or the pathologist alone and demonstrates performance similar to that of the two-neuropathologist consensus (hybrid classifier AUC = 0.921, 95% confidence interval = 0.920-0.923 vs. neuropathologist consensus AUC = 0.92, 95% confidence interval = 0.918-0.921). Additionally, averaging of two- neuropathologist consensus with the ML model provides an incremental increase in prediction accuracy (AUC = 0.928, 95% confidence interval 0.927-0.929). 131 c IDH Wild Type IDH Mutant 2.5X 5X 10X 20X Multi−Scale Ensemble P1 P2 P1+P2 P+MSE Predicted IDH Mutant Probability 0 0.5 1 Figure 7.3: Patient-level predictions in the WCM test data, for the pathologists and ML models. Panel A compares the semiquantitative prediction scores of the two neuropathologists (κ = 0.656, R = 0.767). Panel B compares the two- neuropathologist consensus predictions to the multiscale classifier. (κ = 0.598, R = 0.674). Panel C shows all patient-level predictions using the single-scale mod- els, multiscale ensemble, individual pathologists (P1, P2), two-pathologist consen- sus (P1+P2), and the hybrid classifier (P+WSIP1+MSE). 132 7.3.2 Single-Scale ML Models Make Distinct Errors Rela- tive to Each Other and to Humans Comparisons of patient-level predictions of the pathologists and classifiers using the WCM data are shown in Figure 7.3. Figure 7.3 A shows a scatter plot comparing the semiquantitative prediction scores of the two pathologists. Concordant predic- tions are found in the yellow quadrants, while discordant predictions appear in the pink quadrants. High densities of accurate predictions are located at the extremes of the concordant regions, while inaccurate predictions are enriched in regions of lower certainty. The Pearson coefficient r for the semiquantitative predictions of the pathologists is 0.767, while the Cohen’s kappa for the binary predictions of the pathologists is 0.656. Figure 7.3 B shows a scatter plot of the pathologist consen- sus score (averaged semiquantitative predictions of the pathologists) compared to the MSE predictions. The correlation between MSE and pathologist consensus is less than between the two pathologists (Pearson coefficient r = 0.674), and corre- spondingly there is a lower degree of concordance between the binary classifications (Cohen’s kappa = 0.598). Among discordant cases, there is a slight enrichment of IDHmut cases that are accurately predicted by the pathologists and missed by the MSE, while there is slight enrichment of IDHwt cases accurately predicted by the MSE and missed by the pathologists. Figure 7.3 D shows patient-level IDH prediction scores from the single-scale and multi-scale ensemble classifiers, pathol- ogists, and hybrid predictions, highlighting the orthogonal nature of errors made at individual levels of magnification. 133 Figure 7.4: This shows examples of the sliding windows visualizations, with rep- resentative patches from regions from 3 example cases that provide insight into features recognized by the classifier. (A) shows a low power HE image of a slide that was accurately predicted as IDHmut by the neuropathologists, but was in- correctly classified by the MSE. (B) shows a heatmap of average pixel-level IDH mutation status predictions. Selected patches from image A demonstrate higher IDHmut predictions in regions of solid tumor (C), with higher IDHwt predic- tions in regions of minimally involved brain parenchyma (D). E and F show an example of a slide from an IDHmut case, which was misclassified by both the neuropathologists and the ML classifier. Regions from this slide containing tumor with monomorphic gemistocytic cytomophology (G) and regions of minimally in- volved brain parenchyma with perineuronal and perivascular white space artifact (H) were associated with a higher prediction for IDHmut, while areas of mini- mally involved brain parenchyma without significant whitespace artifact (I) and regions with more bizarre cytology (J) were associated with a higher prediction of IDHwt status. Figures K and L show a slide from an IDHmut glioma which was accurately predicted by the ML classifier, but inaccurately predicted by the neu- ropathologists. Areas of mildly cellular tumor, both with and without whitespace artifact (M and N respectively) were associated with higher IDHmut predictions, while regions of necrosis (O) and regions of minimally involved brain parenchyma (P) were associated with higher IDHwt predictions. 7.3.3 Patch-Level Predictions Reveal Features that Drive Accurate and Inaccurate Predictions To gain insight into (1) the decision-related morphological features of the ML mod- els and (2) the types of errors made by both the classifiers and pathologists, sliding 134 patch-level IDH predictions were generated for selected slides using the MSE, three of which will be examined in further detail here (Figure 7.4). In the first infor- mative case (Figure 7.4 A-D), neuropathologists were correct in predicting IDH mutation, but the case was inaccurately predicted by the MSE to be IDHwt at the slide-level. Regions shown in yellow (Figure 7.4 C) were predicted by the MSE as consistent with IDH mutation, and were also recognized by the neuropathologists as harboring relatively hypercellular infiltrating tumor that was likely IDH-mutant. In contrast, regions encoded in blue (Figure 7.4 D) drove the overall slide-level misclassification of MSE. These regions were enriched in brain parenchyma with minimal to no infiltration by tumor cells (as determined by human examination) and were disregarded as non-contributory to the IDH classification task by the neuropathologists. Thus, although the classifier was correct in determining that these areas were not enriched for IDH-mutated tumor, the binary classification task of determining the slide’s overall IDH status was evidently hampered by the large presence of uninvolved brain. In a second case (Figure 7.4 E-J), that of an IDHmut glioma that was in- accurately classified by both the neuropathologists and the MSE, many regions harbored a relatively monomorphic gemistocytic cytomorphology (Figure 7.4 G). These regions were accurately interpreted by the classifier as consistent with IDHmut status, and in retrospect also likely would have been favored to represent IDH-mutated tumor to the neuropathologists if presented in isolation. However, one region of marked nuclear pleomorphism ( 7.4 J) was considered by the classifier as IDHwt and this same region also drove the neuropathologists’ decision to classify the entire slide as likely IDHwt. Human-determined ‘uninformative regions’ again drove inaccurate MSE classification of particular areas: regions of uninvolved brain but with increased white-space around individual neurons and vascular channels 135 due to tissue processing 7.4 H) by the MSE, while regions of relatively uninvolved brain and without significant intraparenchymal white-space ( 7.4 G) were again erroneously predicted as IDHwt, similar to the first sample case. The final example (Figure 7.4 K-P) illustrates an IDHmut glioma inaccurately predicted by the neuropathologists as IDHwt, but correctly predicted by the MSE. In this case, solid regions of tumor (7.4 M and 7.4N) were accurately predicted by the ML classifier as areas with (IDH-mutated) tumor. A large area of necrosis was present in this slide (7.4O), which drove inaccurate prediction of IDHwt by both neuropathologists, and this area in isolation was also classified as IDHwt by the MSE. Once again, regions of minimally involved normal brain (7.4P) were predicted as correlating with IDHwt by the MSE. 7.3.4 Patch-Level Embedding Vectors Reflect Diagnosti- cally Relevant Human-Identifiable Features To gain further insight into the histological features encoded by our trained ML models, 5 random patches were selected from each slide in the WCM dataset and uniform manifold approximation and projection (UMAP) was performed on the patch-level embedding vectors from the best performing 10x-scale classifier (Fig- ure 7.5 A-B). Review of the histological features of the clustered patches revealed remarkably consistent patterns across patches from distinct slides. Emergent human-identifiable features included: 1) microcystic architecture (Figure 7.5C), which is correlated with IDHmut status; 2) hypercellular regions of tumor with round, monomorphic nuclei, reminiscent of oligodendrocytes, which were appro- priately enriched for IDHmut tumors (Figure 7.5 D); 3) hypercellular tumor areas 136 Figure 7.5: UMAP coordinates of the feature embedding vector activations from patches passed through the 10x classifier. A shows some example tiles in 2D UMAP coordinates. B shows the patch-level IDH status prediction scores as predicted by the 10x classifier. Tiles from region C demonstrate microcystic architecture. Tiles from region D demonstrate hypercellular regions of infiltrating tumor, with round cytology, enriched for tumors with oligodendroglial morphology. Tiles from region E demonstrate hypercellular regions of tumor with a greater degree of nuclear spindling/elongation and nuclear pleomorphism. Tiles from region F demonstrate brain parenchyma without significant infiltration by tumor cells. 137 with spindled nuclei and greater pleomorphism, enriched for IDHwt tumors (Fig- ure 7.5E); and 4) brain parenchyma without significant human-detectable involve- ment by tumor (by HE), that were predicted by the classifier as harboring IDH mutation irrespective of the ground truth slide-level class (Figure 7.5 F). Other features apparently captured by the embedding vectors include patches with a sig- nificant amount of whitespace (Figure 7.5 A, top-right) and regions with abundant hemorrhage or necrosis (Figure 7.5 A, top center). 7.4 Discussion Just as our understanding of the molecular biology and oncogenic drivers of neo- plastic disease evolves quickly, iteratively impacting the diagnostic framework upon which clinicians rely to treat their patients, the accessibility of computing power and ML techniques are also evolving at a rapid pace. An open question in med- ical diagnostics is whether existing but previously untapped data-rich resources, such as histological images made computationally accessible though scanning and digitization, effectively encodes information that is comparable or even superior to current standard-of-care diagnostic and/or prognostic testing modalities in guiding effective patient care. To begin to attack this question, we selected a prototypical problem in glioma diagnostics, endeavoring to predict one of the most prognosti- cally significant molecular markers in glioma biology, IDH mutation, using CNN models with HE-based histological information as the sole input. Moreover, we compared the performance of this task over multiple magnification scales. As a reference point, we compared the performance of the CNN models, trained on the order of hours, with those of subspecialty-trained expert neuropathologists, trained on the order of years to decades. 138 Comparison of ML model predictions to expert pathologists shows that while similar degrees of accuracy are obtained on the classification task, the types of er- rors made were distinct, and the combination of human pathologist predictions and ML predictions results in greater classification robustness than either alone. Man- ual interrogation of the patch-level predictions from accurately and inaccurately classified slides demonstrates several likely confounders exploited by the ML mod- els which, interestingly, were found to be reproducible at all levels of magnification. The most striking sources of error in our models correlate with regions of human- interpreted low informativity within the underlying tissue. In particular, regions with abundant non-neoplastic brain tissue and/or without human-detectable evi- dence of infiltrating tumor cells were most often classified as IDHwt, while regions with increased white-space secondary to vacuolation, edema, and/or tissue arti- facts, including histologic ‘cracks’ in the tissue and cautery artifact, were most often classified as IDHmut. While areas lacking tumor are indeed IDHwt per se given the putative absence of tumor cells, the task was built around slide-level classification of de facto tumors. Our interpretation, therefore, is that classifying these regions as IDHwt on-average drove the classifier to a higher degree of accu- racy overall, despite the patch-level ‘uninformativeness’ as determined by human observers. A future direction to address the possible confounding effects of such regions is to explicitly annotate and train toward a third-class label, that of “non- neoplastic brain” through the incorporation of autopsy and epilepsy histological slides into the training set. Surprisingly, in the set of sliding window heatmaps an- alyzed, the models were not clearly driven by features that pathologists often used to predict IDH-class due to their enrichment in IDHwt tumors, such as well-formed palisading necrosis and microvascular proliferation. The presence of human-identifiable features as seen in the UMAP projections 139 demonstrates that the CNNs are capable of recognizing some of the features used by humans in the classification of gliomas. In particular, we showed that patches demonstrating microcystic architecture or oligodendroglial cytomorphology were enriched for IDHmt classification while patches with increased spindled cells and pleomorphism were enriched for IDHwt classification. In fact, histomorphologic correlates of certain driver alterations have been previously identified, such as giant-cell morphology in IDHwt glioblastomas harboring TP53 mutation, and ep- ithelioid morphology in high grade gliomas harboring BRAF mutations; however, given the heterogeneity of infiltrating gliomas, and particularly in IDHwt astrocy- tomas/glioblastomas, these morphologic correlates as assessed by human pathol- ogists have relatively poor predictive utility. The UMAP also clearly illustrated that regions of human-interpreted low informational value relative to the task were enriched for particular classes, such as normal appearing brain being enriched for IDHwt class. Again, the identification of recurrent potential confounders across these models suggests that strategies to devalue or exclude uninformative patches could further improve classification accuracy, and expanding the number of avail- able classes to include non-neoplastic samples, as alluded to above, may improve ML performance. In addition, we believe that given a sufficiently large dataset of histologic data paired with RNA transcriptome and DNA methylation profil- ing, histomorphologic correlates may be identified, however further studies will be necessary to assess for this. How to aggregate patch-level predictions into a slide-level classification is a widely studied problem – often in the multiple-instance learning literature. For ex- ample, attention mechanisms that increase the weight of highly informative patches on the final classification prediction have been found to be useful in other cancer types. However, the differing biological characteristics of tumor types that are 140 reflected in histology (for example that infiltrating gliomas typically have an ill- defined border with respect to the surrounding non-neoplastic tissue, a feature that differs significantly from that of epithelial cancers) are likely to impact the rela- tive efficacy of any particular ML algorithm, and the strategies employed are not likely to be universally applicable to models trained for relatively discrete diagnos- tic tasks. In our experiments conducted with this dataset, attention mechanisms did not provide a significant improvement on classification performance relative to näıve averaging of embedding weights (data not shown). That said, as the number of potential target outputs of the model increases, attention mechanisms may help boost performance, but future studies using a broader variety of target classes are necessary to better assess this. Our results demonstrate that the level of magnification used for input images does impact ML model accuracy, with the greatest levels of accuracy achieved at intermediate levels of magnification (corresponding to 10x objective in our study). One interpretation of this finding is that while lower levels of magnification provide a larger field of view with a greater degree of overall tissue sampling and increased architectural information, higher levels of magnification provide increased cytologic detail yet with a smaller field of view. Intermediate levels of magnification may represent a sweet spot that captures both low-power and high-power information relevant to the task. Of practical importance, some of the errors made at different levels of magnification were found to be orthogonal to each other, and to the errors made by human observers, providing a rationale for multi-scale ML models and hy- brid ML-human approaches to histological diagnosis. At the same time, we believe that designing ML models to more explicitly recapitulate the human methodology of examining the entire slide at lower power, and then selecting regions of interest to interrogate at higher power could result in more robust model prediction accu- 141 racy while also reducing the computational cost necessary to interrogate an entire image at high power. However, future studies will be necessary to confirm this. This study has demonstrated that ML models can achieve near human-level performance at predicting clinically relevant oncologic biomarkers using HE-based histological information alone, even with a completely external test set and with training times and slide exposure that is minimal compared to that needed to train human subspecialty experts. Moreover, by analyzing single magnification and multi-scale input models and by interrogating encoded features through heatmap and UMAP visualizations of patch-level predictions, crucial insights of how to iter- atively improve the ML models can be obtained. This study represents a proof-of- principle that ML models hold great promise in approaching and potentially super- seding human level performance of critical biomarker detection via deep learning of widely accessible HE slides, potentially unmasking a full-circle return to the HE as the gold standard for oncological diagnostics and prognostication. 142 CHAPTER 8 CONCLUSION As we have discussed in this thesis, reliability is a crucial aspect for successful applications of deep neural networks to real-world problem solving. We proposed several methods on this important problem. In Chapter 2, we proposed theoretically grounded and easy-to-use classes of noise-robust loss functions, the Lq loss and the truncated Lq loss, for classification with noisy labels that can be employed with any existing DNN algorithm. We empirically verified noise robustness on various datasets with both closed- and open-set noise scenarios. In Chapter 3, we reinterpreted MC dropout as ensemble averaging strategy, and attributed its poor performance in convolutional neural networks to a lack of diversity of sampled models using the error-ambiguity decomposition of the Brier score (or MSE), a widely used performance metric that captures both accuracy and calibration of probabilistic outputs. As we demonstrated empirically, om- nibus dropout, which is simple-to-implement and computationally efficient, strikes the right balance between model diversity among sampled models while retaining reasonable performance of individuals models, thereby consistently improving the quality of the ensemble’s prediction. In Chapter 4, we proposed orthogonal dropout, an easy-to-implement tech- nique that allows us to split a single high capacity neural network model into an ensemble of subnetworks. In our experiments, we demonstrated and discussed the regularization effect achieved by training pruned networks and using a randomized and frozen fully connected layer in the network. Finally, we presented exhaus- 143 tive results that show that our method consistently outperforms several of the recently proposed state-of-the-art methods for efficient ensembles. Furthermore, our method achieved accuracy and uncertainty values matching that of an explicit deep ensemble, while demanding significantly less storage. Lastly, we described several shortcomings of the proposed method for potential future exploration. In Chapter 5, we presented a two-stage teacher-student framework for fast uncertainty estimates. The proposed student training procedure is not only capable of producing uncertainty estimates at no extra cost but also leads to improved predictive performance and more informative uncertainty estimates. We believe the method gets us one step closer to the realm of trustworthy deep learning for computer vision. In chapter 6, we provided empirical evidence that diversity in teacher predic- tions is correlated with student performance in self-distillation. Inspired by this observation, we offered an amortized MAP interpretation of the popular teacher- student training strategy. The novel viewpoint provides us with insights on self- distillation and suggests ways to improve it. For example, encouraged by the results obtained with Beta smoothing, there are possibly better and/or more efficient ways to obtain priors for instance-specific regularization. Finally in chapter 7, we applied deep learning for a real-world digital whole slide histopathology image analysis problem. Specifically, we showed that combining an expert pathologist’s assessments with ML model predictions can classify IDH mu- tation status in infiltrating gliomas at a comparable level to two-expert consensus. This serves as proof of principle for the broader application of neural network mod- els in deriving clinically relevant molecular markers based on histopathology alone. We also demonstrated that ML classification performance varies with level of mag- 144 nification, and that discordant errors are made across scales, suggesting value in ensembling across levels of magnification. 145 BIBLIOGRAPHY [1] Javier Antorán, James Urquhart Allingham, and José Miguel Hernández- Lobato. Depth uncertainty in neural networks. arXiv preprint arXiv:2006.08437, 2020. [2] Devansh Arpit, Stanis law Jastrzebski, Nicolas Ballas, David Krueger, Em- manuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep net- works. arXiv preprint arXiv:1706.05394, 2017. [3] Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, and Trevor Darrell. Auxil- iary image regularization for deep cnns with noisy labels. arXiv preprint arXiv:1511.07069, 2015. [4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014. [5] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian dark knowledge. In Advances in Neural Information Pro- cessing Systems, pages 3438–3446, 2015. [6] Björn Barz and Joachim Denzler. Do we train on test data? purging cifar of near-duplicates. arXiv preprint arXiv:1902.00423, 2019. [7] Mokhtar S Bazaraa, Hanif D Sherali, and Chitharanjan M Shetty. Nonlinear programming: theory and algorithms. John Wiley & Sons, 2013. [8] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Im- manuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38–44, 2019. [9] Jan Beirlant, Edward J Dudewicz, László Györfi, and Edward C Van der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997. [10] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. The power of ensembles for active learning in image classification. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9368–9377, 2018. 146 [11] Lorenzo Bertoni, Sven Kreiss, and Alexandre Alahi. Monoloco: Monocular 3d pedestrian localization and uncertainty estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 6861–6871, 2019. [12] Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances in Neural Information Processing Systems, pages 2216– 2226, 2018. [13] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wier- stra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015. [14] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018. [15] George EP Box and David R Cox. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), pages 211–252, 1964. [16] Daniel J Brat, Kenneth Aldape, Howard Colman, Dominique Figrarella- Branger, Gregory N Fuller, Caterina Giannini, Eric C Holland, Robert B Jenkins, Bette Kleinschmidt-DeMasters, Takashi Komori, et al. cimpact-now update 5: recommended grading criteria and terminologies for idh-mutant astrocytomas. Acta neuropathologica, 139(3):603–608, 2020. [17] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High- performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021. [18] J Paul Brooks. Support vector machines with the ramp loss and the hard margin loss. Operations research, 59(2):467–479, 2011. [19] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV (1), pages 44–57, 2008. [20] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model com- pression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006. [21] Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. Dropout dis- 147 tillation. In International Conference on Machine Learning, pages 99–107, 2016. [22] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020. [23] Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Na- ture medicine, 25(8):1301–1309, 2019. [24] David Capper, David TW Jones, Martin Sill, Volker Hovestadt, Daniel Schrimpf, Dominik Sturm, Christian Koelsche, Felix Sahm, Lukas Chavez, David E Reuss, et al. Dna methylation-based classification of central nervous system tumours. Nature, 555(7697):469–474, 2018. [25] Shih-Kang Chao, Zhanyu Wang, Yue Xing, and Guang Cheng. Directional pruning of deep neural networks. Advances in Neural Information Processing Systems, 33, 2020. [26] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3514–3522, 2019. [27] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [28] Liyan Chen, Philip Gautier, and Sergul Aydore. Dropcluster: A structured dropout for convolutional networks. arXiv preprint arXiv:2002.02997, 2020. [29] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamil- tonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014. [30] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge dis- tillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4794–4802, 2019. 148 [31] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [32] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Fenyö, Andre L Moreira, Narges Razavian, and Aristotelis Tsirigos. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine, 24(10):1559–1567, 2018. [33] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credi- bility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020. [34] Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al. Clinically applicable deep learn- ing for diagnosis and referral in retinal disease. Nature medicine, 24(9):1342– 1350, 2018. [35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Im- agenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248– 255. IEEE, 2009. [36] James A Diao, Jason K Wang, Wan Fung Chui, Victoria Mountain, Sai Chowdary Gullapally, Ramprakash Srinivasan, Richard N Mitchell, Ben- jamin Glass, Sara Hoffman, Sudha K Rao, et al. Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes. Nature communications, 12(1):1–15, 2021. [37] Thomas G Dietterich. Ensemble methods in machine learning. In Interna- tional workshop on multiple classifier systems, pages 1–15. Springer, 2000. [38] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Maximum-entropy fine grained classification. In Advances in Neural Infor- mation Processing Systems, pages 637–647, 2018. [39] Nikita Durasov, Timur Bagautdinov, Pierre Baque, and Pascal Fua. 149 Masksembles for uncertainty estimation. arXiv preprint arXiv:2012.08334, 2020. [40] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997– 2017, 2019. [41] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. nature, 542(7639):115–118, 2017. [42] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html. [43] Di Feng, Lars Rosenbaum, and Klaus Dietmayer. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st International Conference on Intelligent Transporta- tion Systems (ITSC), pages 3266–3273. IEEE, 2018. [44] Davide Ferrari, Yuhong Yang, et al. Maximum lq-likelihood estimation. The Annals of Statistics, 38(2):753–783, 2010. [45] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019. [46] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Find- ing sparse, trainable neural networks. In International Conference on Learn- ing Representations, 2018. [47] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576, 2020. [48] Benôıt Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014. [49] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Confer- ence on Machine Learning, pages 1607–1616, 2018. 150 [50] Yarin Gal et al. Uncertainty in deep learning. [51] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016. [52] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016. [53] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3584–3593, 2017. [54] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017. [55] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017. [56] Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. International Conference on Learning Representations, 2019. [57] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013. [58] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regulariza- tion method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727–10737, 2018. [59] Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In AAAI, pages 1919–1925, 2017. [60] Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015. [61] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. 2016. 151 [62] Siavash Golkar, Michael Kagan, and Kyunghyun Cho. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476, 2019. [63] Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta- learning for stochastic gradient mcmc. International Conference on Learning Representations, 2019. [64] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [65] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. International Conference on Machine Learning, 2013. [66] Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011. [67] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020. [68] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, 77:354–377, 2018. [69] Zuguang Gu, Roland Eils, and Matthias Schlesner. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32(18):2847–2849, 2016. [70] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian QWeinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017. [71] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1387–1395, 2016. [72] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new perspective of noisy supervision. arXiv preprint arXiv:1805.08193, 2018. 152 [73] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015. [74] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 1135–1143, 2015. [75] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize bet- ter: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR, 2016. [76] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Fried- man. The elements of statistical learning: data mining, inference, and pre- diction, volume 2. Springer, 2009. [77] Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M Dai, and Dustin Tran. Training independent subnetworks for robust prediction. arXiv preprint arXiv:2010.06610, 2020. [78] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization. [79] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity map- pings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016. [81] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. [82] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. arXiv preprint arXiv:1802.05300, 2018. [83] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and 153 Jin Young Choi. A comprehensive overhaul of feature distillation. In Pro- ceedings of the IEEE International Conference on Computer Vision, pages 1921–1930, 2019. [84] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [85] Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classifier: the marginal value of training the last weight layer. In International Confer- ence on Learning Representations, 2018. [86] Todd C Hollon, Balaji Pandian, Arjun R Adapa, Esteban Urias, Akshay V Save, Siri Sahib S Khalsa, Daniel G Eichberg, Randy S D’Amico, Zia U Farooq, Spencer Lewis, et al. Near real-time intraoperative brain tumor diagnosis using stimulated raman histology and deep neural networks. Nature medicine, 26(1):52–58, 2020. [87] Craig Horbinski, Julia Kofler, Lindsey M Kelly, Geoffrey H Murdoch, and Marina N Nikiforova. Diagnostic use of idh1/2 mutation analysis in routine clinical testing of formalin-fixed, paraffin-embedded glioma tissues. Journal of Neuropathology & Experimental Neurology, 68(12):1319–1325, 2009. [88] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017. [89] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian QWeinberger. Densely connected convolutional networks. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4700–4708, 2017. [90] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016. [91] Po-Yu Huang, Wan-Ting Hsu, Chun-Yueh Chiu, Ting-Fan Wu, and Min Sun. Efficient uncertainty estimation for semantic segmentation in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 520–535, 2018. [92] Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017. 154 [93] Eddy Ilg, Ozgun Cicek, Silvio Galesso, Aaron Klein, Osama Makansi, Frank Hutter, and Thomas Brox. Uncertainty estimates and multi-hypotheses net- works for optical flow. In Proceedings of the European Conference on Com- puter Vision (ECCV), pages 652–667, 2018. [94] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [95] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence, 2018. [96] Siddhartha Jain, Ge Liu, Jonas Mueller, and David Gifford. Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4264–4271, 2020. [97] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Men- tornet: Regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017. [98] Zhengshen Jiang, Hongzhi Liu, Bin Fu, and Zhonghai Wu. Generalized am- biguity decompositions for classification with applications in active learning and unsupervised ensemble pruning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [99] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732. PMLR, 2017. [100] Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks from noisy labels with dropout regularization. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 967–972. IEEE, 2016. [101] Pascal D Johann, Serap Erkek, Marc Zapatka, Kornelius Kerl, Ivo Buch- halter, Volker Hovestadt, David TW Jones, Dominik Sturm, Carl Hermann, Maia Segura Wang, et al. Atypical teratoid/rhabdoid tumors are comprised of three epigenetic subgroups with distinct enhancer landscapes. Cancer cell, 29(3):379–393, 2016. [102] Jakob Nikolas Kather, Alexander T Pearson, Niels Halama, Dirk Jäger, Jeremias Krause, Sven H Loosen, Alexander Marx, Peter Boor, Frank Tacke, 155 Ulf Peter Neumann, et al. Deep learning can predict microsatellite insta- bility directly from histology in gastrointestinal cancer. Nature medicine, 25(7):1054–1056, 2019. [103] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680, 2015. [104] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st International Con- ference on Neural Information Processing Systems, pages 5580–5590, 2017. [105] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577, 2017. [106] Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on learning theory, pages 2306–2327. PMLR, 2020. [107] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex net- work: Network compression via factor transfer. In Advances in neural infor- mation processing systems, pages 2760–2769, 2018. [108] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1317–1327, 2016. [109] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980, 2014. [110] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Pro- cessing Systems, pages 2575–2583, 2015. [111] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. [112] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifi- cation with deep convolutional neural networks. Advances in neural infor- mation processing systems, 25, 2012. [113] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross valida- 156 tion and active learning. In Proceedings of the 7th International Conference on Neural Information Processing Systems, pages 231–238, 1994. [114] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross vali- dation, and active learning. In Advances in neural information processing systems, pages 231–238, 1995. [115] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate un- certainties for deep learning using calibrated regression. In International Conference on Machine Learning, pages 2796–2804, 2018. [116] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482, 2015. [117] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010. [118] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003. [119] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learn- ing. arXiv preprint arXiv:1610.02242, 2016. [120] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474, 2016. [121] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Sim- ple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017. [122] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. International Conference on Learning Representations, 2017. [123] Maria L astowska, Joanna Trubicka, Anna Sobocińska, Bartosz Wojtas, Mag- dalena Niemira, Anna Sza lkowska, Adam Kretowski, Agnieszka Karkucińska- Wieckowska, Magdalena Kaleta, Maria Ejmont, et al. Molecular identifica- tion of cns nb-foxr2, cns eft-cic, cns hgnet-mn1 and cns hgnet-bcor pediatric 157 brain tumors using tumor-specific signature genes. Acta neuropathologica communications, 8(1):1–14, 2020. [124] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015. [125] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989. [126] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2018. [127] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016. [128] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Jia Li. Learning from noisy labels with distillation. arXiv preprint arXiv:1703.02391, 2017. [129] Tongliang Liu and Dacheng Tao. Classification with noisy labels by im- portance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2016. [130] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018. [131] Ekaterina Lobacheva, Nadezhda Chirkova, Maxim Kodryan, and Dmitry Vetrov. On power laws in deep ensembles. arXiv preprint arXiv:2007.08483, 2020. [132] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vap- nik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015. [133] David N Louis, Hiroko Ohgaki, Otmar D Wiestler, Webster K Cavenee, Peter C Burger, Anne Jouvet, Bernd W Scheithauer, and Paul Kleihues. The 2007 who classification of tumours of the central nervous system. Acta neuropathologica, 114(2):97–109, 2007. 158 [134] Christos Louizos and Max Welling. Multiplicative normalizing flows for vari- ational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2218–2227. JMLR. org, 2017. [135] Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, Melissa Zhao, Maha Shady, Jana Lipkova, and Faisal Mahmood. Ai-based pathology predicts origins for cancers of unknown primary. Nature, 594(7861):106–110, 2021. [136] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017. [137] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochas- tic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015. [138] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. [139] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992. [140] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and An- drew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019. [141] Fangchang Mal and Sertac Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018. [142] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, pages 7047–7058, 2018. [143] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. arXiv preprint arXiv:1905.00076, 2019. [144] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 159 [145] Naresh Manwani and PS Sastry. Noise tolerance under risk minimization. IEEE transactions on cybernetics, 43(3):1146–1151, 2013. [146] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss func- tions for classification: theory, robustness to outliers, and savageboost. In Advances in neural information processing systems, pages 1049–1056, 2009. [147] Leland McInnes, John Healy, and James Melville. Umap: Uniform mani- fold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [148] Paul Micaelli and Amos J Storkey. Zero-shot knowledge transfer via adversar- ial belief matching. In Advances in Neural Information Processing Systems, pages 9551–9561, 2019. [149] Kimberly D Miller, Quinn T Ostrom, Carol Kruchko, Nirav Patil, Tarik Tihan, Gino Cioffi, Hannah E Fuchs, Kristin A Waite, Ahmedin Jemal, Rebecca L Siegel, et al. Brain and other central nervous system tumor statistics, 2021. CA: a cancer journal for clinicians, 71(5):381–406, 2021. [150] Thomas Minka. Estimating a dirichlet distribution, 2000. [151] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fid- jeland, Georg Ostrovski, et al. Human-level control through deep reinforce- ment learning. Nature, 518(7540):529, 2015. [152] Pooya Mobadersany, Safoora Yousefi, Mohamed Amgad, David A Gutman, Jill S Barnholtz-Sloan, José E Velázquez Vega, Daniel J Brat, and Lee AD Cooper. Predicting cancer outcomes from histology and genomics using convolutional networks. Proceedings of the National Academy of Sciences, 115(13):E2970–E2979, 2018. [153] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, pages 4696–4705, 2019. [154] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [155] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Am- 160 buj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013. [156] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. ECCV, 2012. [157] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. [158] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learn- ing, volume 2011, page 5, 2011. [159] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolu- tion network for semantic segmentation. In Proceedings of the IEEE Inter- national Conference on Computer Vision, pages 1520–1528, 2015. [160] Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749, 2021. [161] Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936, 2017. [162] Adriana Olar and Kenneth D Aldape. Using the molecular classification of glioblastoma to inform personalized treatment. The Journal of pathology, 232(2):165–177, 2014. [163] Adriana Olar, Khalida M Wani, Kristin D Alfaro-Munoz, Lindsey E Heath- cock, Hinke F van Thuijl, Mark R Gilbert, Terri S Armstrong, Erik P Sul- man, Daniel P Cahill, Elizabeth Vera-Bolanos, et al. Idh mutation status and role of who grade and mitotic index in overall survival in grade ii–iii diffuse gliomas. Acta neuropathologica, 129(4):585–596, 2015. [164] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458, 2015. [165] Quinn T Ostrom, Haley Gittleman, Gabrielle Truitt, Alexander Boscia, Carol Kruchko, and Jill S Barnholtz-Sloan. Cbtrus statistical report: pri- 161 mary brain and other central nervous system tumors diagnosed in the united states in 2011–2015. Neuro-oncology, 20(suppl 4):iv1–iv86, 2018. [166] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016. [167] Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. arXiv preprint arXiv:2006.08859, 2020. [168] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: a loss correction approach. stat, 1050:22, 2017. [169] Gabriel Pereyra, George Tucker, Jan Chorowski, L ukasz Kaiser, and Geof- frey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017. [170] Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. In Pro- ceedings of the IEEE International Conference on Computer Vision, pages 9617–9626, 2019. [171] Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In International Conference on Machine Learning, pages 5142– 5151, 2019. [172] Janis Postels, Francesco Ferroni, Huseyin Coskun, Nassir Navab, and Fed- erico Tombari. Sampling-free epistemic uncertainty estimation using ap- proximated variance propagation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2931–2940, 2019. [173] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999. [174] Rahul Rahaman and Alexandre H Thiery. Uncertainty quantification and deep ensembles. arXiv preprint arXiv:2007.08792, 2020. [175] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, 162 et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017. [176] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. [177] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11893–11902, 2020. [178] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps, pages 323–350, 2018. [179] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014. [180] Annekathrin Reinhardt, Damian Stichel, Daniel Schrimpf, Felix Sahm, An- drey Korshunov, David E Reuss, Christian Koelsche, Kristin Huang, An- nika K Wefers, Volker Hovestadt, et al. Anaplastic astrocytoma with piloid features, a novel molecular class of idh wildtype glioma with recurrent mapk pathway, cdkn2a/b and atrx alterations. Acta neuropathologica, 136(2):273– 291, 2018. [181] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learn- ing to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050, 2018. [182] David E Reuss, Yasin Mamatjan, Daniel Schrimpf, David Capper, Volker Hovestadt, Annekathrin Kratz, Felix Sahm, Christian Koelsche, Andrey Ko- rshunov, Adriana Olar, et al. Idh mutant diffuse and anaplastic astrocytomas have similar age at presentation and little difference in survival: a grading problem for who. Acta neuropathologica, 129(6):867–873, 2015. [183] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. International Conference on Learning Representations, 2018. [184] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chas- sang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. 163 [185] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016. [186] Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adap- tive sparsity by fine-tuning. Advances in Neural Information Processing Sys- tems, 33, 2020. [187] Vikash Sehwag, Shiqi Wang, Prateek Mittal, and Suman Jana. Hydra: Prun- ing adversarially robust neural networks. Advances in Neural Information Processing Systems (NeurIPS), 7, 2020. [188] Yichen Shen, Zhilu Zhang, Mert R Sabuncu, and Lin Sun. Learning the distribution: A unified distillation paradigm for fast uncertainty estimation in computer vision. arXiv preprint arXiv:2007.15857, 2020. [189] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data aug- mentation for deep learning. Journal of big data, 6(1):1–48, 2019. [190] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [191] Saurabh Singh, Derek Hoiem, and David Forsyth. Swapout: Learning an ensemble of deep architectures. In Advances in neural information processing systems, pages 28–36, 2016. [192] Samarth Sinha, Homanga Bharadhwaj, Anirudh Goyal, Hugo Larochelle, Animesh Garg, and Florian Shkurti. Dibs: Diversity inducing information bottleneck in model ensembles. arXiv preprint arXiv:2003.04514, 2020. [193] Samuel Smith, Erich Elsen, and Soham De. On the generalization benefit of noise in stochastic gradient descent. In International Conference on Machine Learning, pages 9058–9067. PMLR, 2020. [194] Suraj Srinivas and Francois Fleuret. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723– 4731, 2018. [195] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. 164 [196] Roger Stupp, Monika E Hegi, Mark R Gilbert, and Arnab Chakravarti. Chemoradiotherapy in malignant glioma: standard of care and future di- rections. Journal of Clinical Oncology, 25(26):4127–4136, 2007. [197] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014. [198] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbig- niew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2818–2826, 2016. [199] Akinori Tanaka, Akio Tomiya, and Kōji Hashimoto. Deep Learning and Physics. Springer, 2021. [200] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. arXiv preprint arXiv:1803.11364, 2018. [201] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195– 1204, 2017. [202] Michael D Taylor, Paul A Northcott, Andrey Korshunov, Marc Remke, Yoon-Jae Cho, Steven C Clifford, Charles G Eberhart, D Williams Parsons, Stefan Rutkowski, Amar Gajjar, et al. Molecular subgroups of medulloblas- toma: the current consensus. Acta neuropathologica, 123(4):465–472, 2012. [203] Katarzyna Tomczak, Patrycja Czerwińska, and Maciej Wiznerowicz. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contem- porary oncology, 19(1A):A68, 2015. [204] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 648–656, 2015. [205] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 9446–9454, 2018. 165 [206] Arash Vahdat. Toward robustness against label noise in training deep dis- criminative neural networks. In Advances in Neural Information Processing Systems, pages 5601–5610, 2017. [207] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems, pages 10–18, 2015. [208] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from noisy large-scale datasets with minimal su- pervision. In The Conference on Computer Vision and Pattern Recognition, 2017. [209] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016. [210] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. Advances in neural information processing systems, 26, 2013. [211] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regu- larization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013. [212] Fei Wang, Lawrence Peter Casalino, and Dhruv Khullar. Deep learning in medicine—promise, progress, and challenges. JAMA internal medicine, 179(3):293–294, 2019. [213] Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey, Hongyuan Zha, Le Song, and Shu-Tao Xia. Iterative learning with open-set noisy labels. arXiv preprint arXiv:1804.00092, 2018. [214] Sarah Webb. Deep learning for biology. Nature, 554(7690):555–558, 2018. [215] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. [216] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011. 166 [217] Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W Dusenberry, Jasper Snoek, Balaji Lakshminarayanan, and Dustin Tran. Combining ensem- bles and data augmentation can harm your calibration. arXiv preprint arXiv:2010.09875, 2020. [218] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative approach to efficient ensemble and lifelong learning. In International Con- ference on Learning Representations, 2019. [219] Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyper- parameter ensembles for robustness and uncertainty quantification. arXiv preprint arXiv:2006.13570, 2020. [220] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández-Lobato, and Alexander L Gaunt. Deterministic variational in- ference for robust bayesian neural networks. International Conference on Learning Representations, 2018. [221] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015. [222] Hai Yan, D Williams Parsons, Genglin Jin, Roger McLendon, B Ahmed Rasheed, Weishi Yuan, Ivan Kos, Ines Batinic-Haberle, Siân Jones, Gregory J Riggins, et al. Idh1 and idh2 mutations in gliomas. New England journal of medicine, 360(8):765–773, 2009. [223] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5628–5635, 2019. [224] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and trans- fer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017. [225] Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang. Knowledge extraction with no observable data. In Advances in Neural Information Processing Systems, pages 2705–2714, 2019. 167 [226] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision, pages 1974–1982, 2017. [227] Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Re- visit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723, 2019. [228] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: At- tacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805–2824, 2019. [229] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. [230] Sheheryar Zaidi, Arber Zela, Thomas Elsken, Chris Holmes, Frank Hutter, and Yee Whye Teh. Neural ensemble search for performant and calibrated predictions. arXiv preprint arXiv:2006.08573, 2020. [231] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. [232] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generaliza- tion. Communications of the ACM, 64(3):107–115, 2021. [233] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez- Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. [234] Shaofeng Zhang, Meng Liu, and Junchi Yan. The diversified ensemble neural network. Advances in Neural Information Processing Systems, 33, 2020. [235] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recom- mender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38, 2019. [236] Zhilu Zhang, Adrian V Dalca, and Mert R Sabuncu. Confidence calibration for convolutional neural networks using structured dropout. arXiv preprint arXiv:1906.09551, 2019. 168 [237] Chunting Zhou, Graham Neubig, and Jiatao Gu. Understanding knowl- edge distillation in non-autoregressive machine translation. arXiv preprint arXiv:1911.02727, 2019. [238] Zhi-Hua Zhou. Ensemble learning. Encyclopedia of biometrics, 1:270–273, 2009. [239] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012. [240] Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-fly native ensemble. In Advances in neural information processing systems, pages 7517–7527, 2018. 169 APPENDIX A SUPPLEMENTARY MATERIAL FOR ”LABEL-NOISE ROBUST LEARNING OF NEURAL NETWORKS WITH GENERALIZED CROSS ENTROPY LOSS” Lemma 1. limq→0 Lq(f(x), ej) = LC(f(x), ej), where Lq represents the Lq loss, and LC represents the categorical cross entropy loss. Proof. from equation 2.6, and using L’Hôpital’s rule, (1− f (x)q d) (1− f (x)q)L j dq j lim q(f(x), ej) = lim = lim q→0 q→0 q q→0 d q dq = lim−f (x)qj log(fj(x)) = − log(fj(x)) = LC(f(x), e→ j).q 0 Lemma 2. For any x and q ∈ (0, 1], the sum of Lq loss with respect to all classes is bounded by: c c− c(1−q) ∑≤ (1− fj(x)q) ≤ c− 1 . (A.1) q q q j=1 Proof. Obs∑erve that, since we have a softmax layer at the end, fj(x) ≤ 1 for all j, and cj=1 fj(x) = 1. Now, since q ∈ (0, 1], we have fj(x) ≤ fj(x)q, and (1− fj(x)∑) ≥ (1− f (x) q j ). Hence, c ∑ (1− q ∑c − c− cfj(x) ) ≤ (1 fj(x)) j=1 fj(x) c− 1= = . ∑q ∑ q q qj=1 j=1 ∑ Moreover,∑since c f (x)q ≤ cj=1 j j=1(1/c)q for all x and q ∈ (0, 1], cj=1(1 − fj(x) q) ≥ c qj=1(1− (1/c) ), and∑c c(1− fj(x)q) ∑≥ (1− (1/c)q) c− c(1−q)= . q q q j=1 j=1 170 Theorem 1. Under uniform noise with η ≤ 1− 1 , c 0 ≤ (RηL (f ∗)−RηL (f̂)) ≤ A, (A.2)q q and A′ ≤ RLq(f ∗)−RLq(f̂) ≤ 0, (A.3) where A = η[c (1−q)−1] (1−q)≥ 0, A′ = η[1−c ] ∗ q(c−1) q(c− − < 0, f is the global minimizer of RLq(f),1 ηc) and f̂ is the global minimizer of RηL (f).q Proof. Recall that for any softmax output f , RLq(f) = ED[Lq(f(x), yx)] = Ex,yx [Lq(f(x), yx)], and since for uniform noise with noise rate η, ηjk = 1− η for j = k, and ηjk = ηc−1 for j ≠ k, we have RηL (f) = ED[Lq(f(x), ỹx)] = Ex,ỹx [Lq(f(x), ỹq x)] = ExEyx|xEỹx|yx,x[Lq(f(x), ỹx)] η ∑ = ExEyx|x[(1− η)Lq(f(x), yx) + Lq(f(x), i)]c− 1 i≠ yx c η ∑ = ExEyx|x[(1− η)Lq(f(x), yx) + ( Lq(f(x), i)− Lq(f(x), y− x ))] c 1 i=1 η η ∑c = (1− η)RLq(f)− RLq(f) + ExEyx|x[ Lq(f(x), i)]c− 1 c− ηc η ∑1 i=1c = (1− )RLq(f) + ExEyx|x[ Lq(f(x), i)]c− 1 c− 1 i=1 Now, from Lemma 2, we have: − ηc η[c− c (1−q)] ηc η (1 )RLq(f) + ≤ R η (f) ≤ (1− )R (f) + . c− 1 q(c− 1) Lq Lc− 1 q q 171 We can also write the inequality in terms of RLq(f): η − η ηc η[c− c (1−q)] ηc (RL (f) )/(1− ) ≤ R η Lq(f)) ≤ (RL (f)− )/(1− )q q c− 1 q q(c− 1) c− 1 Thus, for f̂ , η ∗ η ηcR ∗L (f )−RL (f̂) ≤ A+ (1− )(RLq(f )−RLq(f̂)) ≤ A,q q c− 1 or equivalently, R (f ∗Lq )− ηc RLq(f̂) ≥ A′ + (R η (f ∗L )−R η L (f̂))/(1− ) ≥ A ′ q q c− 1 (1−q) where A = η[c −1] ′ η[1−c (1−q) ≥ 0 and A = ] c−1− − − , since η ≤ , and f ∗ is a minimizer of q(c 1) q(c 1 ηc) c R (f). Lastly, since f̂ is the minimizer ofRη (f), we have thatRηLq L L (f ∗)−RηL (f̂) ≥q q q 0, or RLq(f ∗)−RLq(f̂) ≤ 0 . This completes the proof. Remark. Note that, when q = 1, A = 0, and f ∗ is also minimizer of risk under uniform noise. Theorem 2. Under class dependent noise when ηij < (1− ηi), ∀j ̸= i, ∀i, j ∈ [c], where ηij = p(ỹ = j|y = i), ∀j ̸= i, and (1− ηi) = p(ỹ = i|y = i), if RLq(f ∗) = 0, then 0 ≤ (Rη ∗ ηL (f )−RL (f̂)) ≤ B, (A.4)q q where B = c 1−q−1ED(1− ηyx) ≥ 0, f ∗ is the global minimizer of RLq(f), and f̂ isq the global minimizer of RηL (f).q Proof. For class dependent noise, from Lemma 2, for any soft-max output function f we have ∑ RηL (f) = Eq D[(1− ηyx)Lq(f(x), yx)] + ED[ ηyxiLq(f(x), i)]∑ i ̸=yx ∑ ≤ c− 1ED[(1− ηyx)( − Lq(f(x), i))] + ED[ ηyq xi Lq(f(x), i)] i≠ y∑x i ̸=yxc− 1 = ED(1− ηyx)− ED[ (1− ηyx − ηyq xi )Lq(f(x), i)], i ̸=yx 172 and c− c1−q ∑ RηL (f) ≥ ED(1− ηyx)− ED[ (1− ηyx − ηyxi)Lq(f(x), i)].q q i ̸=yx Hence, 1−q (Rη c − 1 L (f ∗)−RηL (f̂)) ≤ ∑ ED(1− ηyx)+q q q ED (1− ηyx − ηyxi)[Lq(f̂(x), i)− L (f ∗q (x), i)]. i≠ yx Now, from our assumption that R ∗Lq(f ) = 0, we have Lq(f ∗(x), yx) = 0. This is only satisfied iff f ∗ ∗i (x) = 1 when i = yx, and fi (x) = 0 if i ̸= yx. Hence, Lq(f ∗(x), i) = 1/q ∀i ̸= yx. Moreover, by our assumption, we have (1−ηyx−ηyxi) > 0. As a result, to derive a upper bound for the expression above, we need to maximize the second term. Note that by definition of the Lq loss, Lq(f̂(x), i) ≤ 1/q ∀i ∈ [c], and hence the second term is maximized iff Lq(f̂(x), i) = 1/q ∀i ̸= yx. This implies that the maximum of the second term is non-positive, so we have 1−q η c − 1(RL (f ∗)−RηL (f̂)) ≤ ED(1− ηq q yq x ). Lastly, since f̂ is the minimizer of RηL (f), we have that R η ∗ η L (f ) − RL (f̂) ≥ 0.q q q This completes the proof. Lemma 3. For any x and q ∈ (0, 1), assuming 1/c ≤ k < 1 where c represents the number of classes, the sum of truncated Lq loss with respect to all classes is bounded by: c 1 ∑ d̃kLq( ) + (c− d̃)Lq(k) ≤ Ltrunc(f(x), ej) ≤ cLq(k), (A.5) d j=1 where d̃ = max(1, (1−q) 1/q ). k Proof. For the upper bound, b∑y definition of truncated Lq, Ltrunc(f(x), ej) ≤ Lq(k) for any x and j. Hence, cj=1 Ltrunc(f(x), ej) ≤ cLq(k). 173 For the lower bound, it can be verified that, ∑c ∑c Ltrunc(f̃(x), ej) ≤ Ltrunc(f(x), ej) j=1 j=1 where f̃(x) = (p, · · · , p, 0, · · · , 0), with p = 1/d ≥ k and d is the number of elements in f(x) with a value ≤ k. Note that since p > k, 1 ≤ d ≤ 1/k: ∑c Ltrunc(f̃(x), ej) = dLq(p) + (c− d)L 1 q(k) = dLq( ) + (c− d)Lq(k). d j=1 We can get a universal lower bound (that does not depend on f) by minimizing the above function with respect to d. To do so, we treat d to be continuous. By definition of Lq loss, and recall that 0 < q < 1, L 1 − L − 1 1min d q( ) + (c d) q(k) = min d[(1 ( )q)/q − (1− kq)/q] = min d[(kq − ( )q)]. d∈[1,1/k] d d∈[1,1/k] d d∈[1,1/k] d We can verify using the second derivative test that the above objective function is convex. As a result, we can find the minimum by taking its derivative. Doing so, 1/q we find that d = (1−q) minimizes the above objective function. Hence, the lower k bound is ∑c L 1d̃k q( ) + (c− d̃)Lq(k) ≤ Ltrunc(f(x), ej), d j=1 where d̃ = max(1, (1−q) 1/q ). k Remark. Using Lemma 3, we can prove that the proposed truncated loss leads to more noise robust training following the same arguments as in Theorem 1 and 2. 174 APPENDIX B SUPPLEMENTARY MATERIAL FOR ”IMPROVING CONFIDENCE CALIBRATION FOR CONVOLUTIONAL NEURAL NETWORKS WITH STRUCTURED DROPOUT” B.1 Brief Review of Dropout As Bayesian Approximation Let us assume a dataset D = (X,Y ) = {(xi, y )}ni i=1, where each (xi, yi) ∈ (X ×Y) is i.i.d. In this chapter, we consider the problem of k-class classification, and let X ⊆ Rd be the feature space and Y = {1, · · · , k} be the label space. A classi- fier is a function that maps input feature space to the label space f : X → Rc. We restrict our attention to functions that can be implemented as a DNN, and denote it by fw(x), where w = {W }Li i=1 corresponds to the parameters of a net- work with L-layers, and Wi corresponds to the weight matrix in the i-th layer. We define a likelihood model p(y|x,w) = softmax(fw(x)). It is common prac- tice to perform maximum likelihood to compute point estimates for w. Un- certainty estimates can be obtained through Bayesian DNNs by first assuming a prior distribution on the weights, p(w). A common choice is the zero mean Gaussian N (0, I). Bayes Theorem can then be used to obtain the posterior p(w|X,Y ) = p(Y |X,w)p(X)/p(Y∫|X), with which inference can be carried out: p(y = c|x,Dtrain) = p(y = c|x,w)p(w|Dtrain)dw. (B.1) The marginal distribution p(Y |X), and thus p(w|X,Y ) are often intractable. Variational inference uses a tractable family of distributions qθ(w) paramaterized by θ to approximate the true posterior p(w|X,Y ), by minimizing the Kullback- Leibler divergence KL(qθ(w)|p(w|X,Y )), which is equivalent to optimizing a 175 bound on the true objective [66]. To interpret dropout as a variational inference strategy [51], the approximate distribution is defined as: Wi = Θi · diag(zi,j)Kij=1, (B.2) zi,j ∼ Bernoulli(pi) for i = 1, · · · , L, j = i, · · · , Ki−1, (B.3) where θ = {Θi}L Li=1 are variational parameters to be optimized and {pi}i=1 are user- defined hyper-parameters that correspond to layerwise dropout rates. Minimizing the KL-divergence is mathematically equivalent to maximizing the following ob- jective: ∑n ∫ LV I(θ) = qθ(w) log p(yi|xi,w)dw −KL(qθ(w)|p(w)). (B.4) i=1 Using Monte Carlo integration with one sample wi ∼ qθ(w) for each training datum (x, y) to approximate the integral in the above equation, and optimizing over mini-batches of size m, the approximated objective becomes: ∑m L̂ nV I(θ) = log p(yi|xi,wi)−KL(qθ(w)|p(w)). (B.5) m i=1 As shown in [51], there is a direct correspondence between optimizing the above ob- jective and regular dropout training for DNNs. Furthermore, uncertainty estimates can be obtained through marginalizing and performing Monte Carlo integration over the approximate distr∫ibution qθ(w). This corresponds to dropout at test time:∑T p(y = c| D 1x, train) ≈ p(y = c|x,w)qθ(w)dw ≈ p(y|x,wt), (B.6) T t=1 where wt ∼ qθ(w) are dropout samples from the NN. This is referred to as the MC dropout. 176 B.2 Relationship between Different Performance Metrics Brier score, negative log-likelihood (NLL) and the expected calibration error (ECE) are three of the most commonly used metrics for evaluating the quality of uncer- tainty estimates. In this section, we discuss the relationship between them. As we noted in Section 3.3, the Brier score is equal to the normalized MSE in the context of classification. Recall, the ECE is defined as: ECE(H) = Ex[(Ey[y|H(x)]−H(x))2], (B.7) which measures the expected difference between the true class probability and the confidence of the model [116]. In addition to the error-ambiguity decomposition that we have discussed, MSE can also be decomposed as: MSE(H) = Ex[(y −H(x))2] (B.8) = Ex[(y − Ey[y|H(x)])2] + ECE(H) (B.9) = Varx[y]− Varx[Ey[y|H(x)]] + +ECE(H), (B.10) where Ey[y|H(x)] corresponds to the true probability of y = 1 conditioned on H(x). Varx[Ey[y|H(x)]] measures the variation of the true class probabilities across the level-sets of the ensemble model H [116]. Thus for this metric, the numeric values of H(x) are not important. It is minimized if H(x) is a constant and maximized when H(x) = f(y), for any bijective function f . One can there- fore view Varx[Ey[y|H(x)]] as a weak metric of accuracy that is not sensitive to calibration. Note Varx[y] does not depend on the models. Brier score thus can be seen as a metric that is influenced by both the accuracy and the ECE of the models. Similarly, NLL is a metric closely related Brier score on a log scale. Con- sequently, sometimes better uncertainty estimates in terms of NLL or Brier score 177 0.0080 SVHN SVHN dropout 0.0075 dropBlock 0.0070 dropChannel 0.965 dropLayer 0.0065 dropOmnibus 0.960 0.0060 dropout dropBlock 0.0055 0.955 dropChannel dropLayer 0.0050 dropOmnibus 0 5 10 15 20 25 30 0.950 0 5 10 15 20 25 30 Number of Models in the Ensemble Number of Models in the Ensemble CIFAR-100 CIFAR-100 dropout 0.00450 dropBlock 0.76 0.00425 dropChanneldropLayer 0.75 0.00400 dropOmnibus 0.74 dropout 0.00375 0.73 dropBlock dropChannel 0.00350 0.72 dropLayer dropOmnibus 0.00325 0.710 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of Models in the Ensemble Number of Models in the Ensemble Figure B.1: Test Brier score (left) and accuracy (right) against number of models for ensemble prediction at test time on SVHN and CIFAR-100. This corresponds to the number of different MC dropout instantiations at test time of the same model. The Model trained with omnibus dropout achieves the best in terms of accuracy and Brier score. can lead to slight drops in accuracy, as the reduction in calibration error outweighs increase in classification error. This phenomenon is indeed observed in practice as well. Figure 7 shows the plot of both NLL and accuracy against dropout rates for all dropout methods considered in the chapter. For instance, it can be seen that while increasing the dropout rate for the MC dropout model on CIFAR-100 dataset from 0.1 to 0.2 leads to a reduction in NLL, there is also quite a significant dropout in classification accuracy. Similar trends can be seen for MC dropChan- nel on CIFAR-10 as well. Nevertheless, the trade-off is not always present. To exemplify, increasing dropout rate of MC dropout on the SVHN dataset also leads to an increase in accuracy as well. In conclusion, when tuning for the optimal dropout rate in practice, it can be beneficial to look at different metrics for a holistic consideration. 178 Brier Score Brier Score Accuracy Accuracy Table B.1: Results comparing accuracy and uncertainty estimates obtained using a single model when drop rate = 0.1 for all models. The top performing result for each metric is bold-faced. MC omnibus-dropout is the best method in general. Datasets Methods Accuracy ↑ NLL ↓ Brier ↓ (×10−3) ECE ↓ (×10−2) Temp Scaling 95.7± 0.1 0.163± 0.002 6.62± 0.10 0.995± 0.160 Dropout 96.4± 0.1 0.179± 0.004 5.68± 0.07 1.34± 0.10 SVHN DropBlock 96.8± 0.1 0.133± 0.002 5.19± 0.07 1.26± 0.14 DropChannel 96.5± 0.1 0.148± 0.002 5.41± 0.04 0.663± 0.050 DropLayer 96.2± 0.1 0.154± 0.002 5.94± 0.10 1.13± 0.10 Omnibus dropout 96.8± 0.1 0.133± 0.003 5.07± 0.07 0.616± 0.077 Temp Scaling 93.9± 0.1 0.189± 0.002 9.06± 0.08 0.905± 0.114 Dropout 93.8± 0.1 0.226± 0.008 9.44± 0.10 2.30± 0.09 CIFAR10 DropBlock 93.4± 0.1 0.203± 0.003 9.89± 0.10 0.743± 0.116 DropChannel 93.7± 0.1 0.196± 0.006 9.20± 0.136 0.970± 0.171 DropLayer 94.0± 0.2 0.206± 0.001 9.09± 0.17 0.941± 0.068 Omnibus dropout 94.4± 0.1 0.173± 0.001 8.38± 0.10 0.607± 0.078 Temp Scaling 74.5± 0.3 1.00± 0.01 3.57± 0.04 4.02± 0.62 Dropout 74.8± 0.4 1.21± 0.01 3.71± 0.05 11.1± 0.4 CIFAR100 DropBlock 75.6± 0.2 1.04± 0.01 3.46± 0.02 6.98± 0.19 DropChannel 75.3± 0.2 1.02± 0.01 3.43± 0.03 5.57± 0.08 DropLayer 75.8± 0.3 1.04± 0.02 3.46± 0.04 7.42± 0.32 Omnibus dropout 76.3± 0.1 1.00± 0.01 3.37± 0.02 7.11± 0.20 B.3 Additional Results Supplementary Results on Diversity of Dropout Models. In Figure 6, we show plots of Brier score and accuracy against number of models used for prediction on SVHN and CIFAR-100 datasets. As discussed in Section 3.3.5, patterns similar to the plots obtained on the CIFAR-10 dataset in Figure 3 are also observed consistently here. The only exception is to the MC dropLayer model on the SVHN dataset, which obtains better performance on individual model but much smaller improvements in both Brier score and test accuracy compared to the other the other dropout methods. We would like to highlight out that the seemingly contradictory results is likely caused by the shallow network used, an 18-layer ResNet. As no down-sampling layers are dropped out for layer dropout, the effective number of ResNet blocks that can be dropped is very small, leading to a much smaller dropout rate compared to ther other methods. This is not an issue with deeper models 179 in which the number of downsampling blocks are much more than that of non- downsampling ones. SVHN SVHN 97.0 0.20 96.8 dropout 96.6 dropout 0.18 dropBlock dropBlock dropChannel 96.4 dropChannel dropLayer dropLayer 0.16 dropOmnibus 96.2 dropOmnibus 0.14 96.0 95.8 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate Dropout Rate CIFAR-10 CIFAR-10 0.55 94 0.50 92 0.45 90 0.40 0.35 88 0.30 86 0.25 84 0.20 82 0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30 Dropout Rate Dropout Rate CIFAR-100 CIFAR-100 1.25 76 1.20 74 1.15 72 1.10 1.05 70 1.00 68 0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30 Dropout Rate Dropout Rate Figure B.2: Plots of test time NLL (left) and accuracy (right) against dropout rate for models trained with different types of dropout on the SVHN, CIFAR-10 and CIFAR-100 datasets. We also investigate the sensitivity of the methods to the choice of dropout rate. To that end, we also report the results obtained with a single model for each method, with a fixed dropout rate of 0.1, a reasonable default value for dropout rate in general. The results are shown in Table B.1. Possibly due to the combination 180 Test NLL Test NLL Test NLL Test Accuracy Test Accuracy Test Accuracy of all dropout method, omnibus dropout seems to be also relatively insensitive to the choice of dropout rate, performing well in all the experiments. Results on Tuning the Dropout Rate. Figure 7 illustrates the plots of NLL and accuracy against dropout rate for all models on all of the datasets. As discussed in Appendix B, conflict between NLL and accuracy can occur sometimes. Interestingly, the NLL drastically increases after minima on all three datasets for dropBlock, suggesting the possibility that the block size for dropBlock may be too large towards later convolutional layers when the size of feature maps are comparable to that of block size. 181 APPENDIX C SUPPLEMENTARY MATERIAL FOR ”ENHANCING UNCERTAINTY ESTIMATES WITH EFFICIENT NEURAL NETWORK ENSEMBLES” C.1 A Brief Review of the Edge-Pop Algorithm We give a brief review of the Edge-Pop algorithm in this section. For simplicity, we describe the algorithm with a fully connected neural network. The algorithm can be easily extended to the case of CNNs. Suppose we have an L-layer fully connected NN with parameters w = {W (1), ...,W (L)}. If we let x = x(0) to be the input to the NN and x(h) to be the h-th hidden layer of the NN, then a standard NN can be defined recursively by ( ) x(h) = σ W (h)x(h−1) , 1 ≤ h ≤ L, where σ denotes some non-lienar activations functions like the ReLU activation function. Now, in order to select a subset of weights from w, for each weight in the parameters w, we learn a popup score associated with it. We denote the popup scores by s = {S(1), ..., S(L)}. Note that, each score matrix S(h) is of the same dimension as that of W (h). Then, given the set of score matrices, a set of binary masks m = {M (1), ...,M (L)} can be generated. Specifically, for each score matrix S(h), we sort the popup scores based on m(agnit)ude of th∣∣e sco∣∣ res at each layer.(h) (h) (h) With a pre-determined ratio k%, Mij = f Sij = 1 if ∣Sij ∣ is among the top 182 (h) k% highest scores in the h-th layer, and Mij = 0 otherwise. Then, during the forward pass of NN with the Edge-Pop Algorithm, binary masks are applied onto the weight matrices before the forward propagation (( ) ) x(h) = σ M (h) ◦W (h) x(h−1) , 1 ≤ h ≤ L, where ◦ denotes the Hadamard product. During the entire learning procedure of the Edge-Pop Algorithm, the weight matrices stay fixed, and only the score matrices are updated with gradient descent. Note that, due to the use of binary masks, direct computation of the gradient is impossible. As such, the straight-through gradient estimator is used instead so that the thresholding function f(·) is replaced by the identity function instead. (h) This allows us to approximate the gradient for Sij by ∂L ∂x(h) ∂L (h) (h−1) = W x , ∂x(h) (h) ∂x(h) ij i∂Sij where L denotes the cross-entropy loss. Given the gradient estimator, the popup scores can then be updated via stochastic gradient descent. Lastly, we note that a naive random initialization of popup scores as proposed by Ramanujan et al. [177] can lead to significantly worse performance. To this end, we instead choose to initialize the popup scores based on the weights of the trained NNs. this is inspired by the recent success of magnitude-based pruning techniques. As such, for each layer h, we initialize the scores by (h) (h) Wij Sij = ,max(|W (h)|) where max(|W (h)|) denotes the maximum magnitude of the matrix W (h) so that all popup scores are normalized between [−1, 1]. Similar approach was also adopted by Sehwag et al. [187]. 183 C.2 Additional Ablation Studies Optimal Number of Subnetworks in MIMO Networks We experimentally investigate the number of subnetworks that can be fit into a MIMO network with the ResNet18 model using CIFAR10 and CIFAR100. We use the identical training procedure for all models as described in Section 5, varying only the number of sub- networks, as a direct comparison against our proposed orthogonal dropout method. Note that for MIMO models, changing the number of subnetworks amounts to sim- ply adjusting the number of input images and the number of linear classification layers; three subnetworks correspond to a network with three inputs and three outputs. Results are summarized in Figure C.1. A direct comparison of MIMO networks against our proposed orthogonal dropout strategy reveals that we can fit much more models into a network of the same capacity. Indeed, for the ResNet18 model, having even three subnetworks in the MIMO networks can significantly degrade accuracy performance. We hypothesize that this is due to the way subnetworks are implemented in MIMO networks. During the training of MIMO networks, multiple inputs are concatenated together and fed into the networks simultaneously. When the size of the subnetworks grows, the number of channels in the concatenated inputs also grows proportionally, thereby making the simultaneous training of the subnetworks harder. After all, each input in the stack of inputs is independent of one another. Yet, there is no explicit constraint/regularization in MIMO networks to enforce such independence. As such, when the number of subnetworks becomes large, it can be hard for networks to capture such independence between the input images by themselves. 184 CIFAR-10 CIFAR-10 0.30 95 0.28 94 0.26 0.24 93 0.22 92 0.20 0.18 91 0.16 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 (a) (b) CIFAR-10 CIFAR-100 0.040 77.5 75.0 0.035 72.5 0.030 70.0 0.025 67.5 0.020 65.0 0.015 62.5 0.010 60.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Number of Models (c) (d) CIFAR-100 CIFAR-100 0.14 1.4 MIMO 0.12 1.3 Ours 0.10 1.2 0.08 1.1 0.06 1.0 0.04 0.9 0.02 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Number of Models Number of Models (e) (f) Figure C.1: Plot of accuracy/NLL/ECE against number of models in the ensem- bles. For the proposed method, the number of models is varied by changing the size of each subnetwork and all the orthogonal dropout ensembles are of the same size. For MIMO networks, the number of models is varied by changing the number of inputs and outputs (classifier layers) of the networks. 185 NLL ECE Accuracy ECE Accuracy NLL Table C.1: Comparison against deep ensembles with reduced convolutional kernel size, so that deep ensemble has the same number of parameters as orthogonal dropout. ”FC” stands for fixed classification. CIFAR10 CIFAR100 Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓) rescaled Ensemble + FC 94.6% 0.174 0.0104 76.0% 0.911 0.0256 Orthogonal Dropout + FC 95.1% 0.157 0.0082 77.7% 0.864 0.0191 Table C.2: Comparison against baseline methods when all methods have fixed classification layer. CIFAR10 CIFAR100 Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓) Dropout + FC 94.5% 0.187 0.0185 73.6% 1.12 0.0863 Batch Ensemble + FC 94.6% 0.210 0.0303 74.6% 1.02 0.0898 MIMO + FC 94.6% 0.182 0.0146 75.1% 0.988 0.0384 Masksemble + FC 93.5% 0.203 0.0090 73.6% 0.969 0.0143 Orthogonal Dropout + FC 95.1% 0.157 0.0082 77.7% 0.864 0.0191 Additional Experiments with Deep Ensembles To further demonstrate the effectiveness of the proposed method, we conduct an comparison against an ensem- ble of 5 independently trained√networks using ResNet, but each with convolutional filters reduced by a factor of 1/5, so that in total, the size of this explicit ensem- ble is the same as that of an orthogonal dropout model. We report the comparison in Table C.1. As seen clearly, our proposed method is capable of significantly outperforming it. Additional Experiments with Fixed Classifier Layer To further demon- strate that fixing the classification layer is not the main source of increase in performance, we conduct an additional experiment and fix the classification layer for all baseline methods using ResNet-18 and CIFAR datasets. Results of the experiments are summarized in Table C.2. The proposed method significantly outperforms all other SOTA methods. 186 Table C.3: Comparison against other types of dropout. CIFAR10 CIFAR100 Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓) Dropout 94.4% 0.191 0.0202 73.3% 1.11 0.0902 DropChannel 94.3% 0.239 0.0343 73.9% 1.21 0.136 DropoutBlock 94.5% 0.163 0.0139 73.3% 1.226 0.137 Orthogonal Dropout 95.1% 0.157 0.0082 77.7% 0.864 0.0191 Additional Experiments with Dropout We conduct additional ablation study to compare against other forms of more advanced dropout methods like DropChannel and DropBlock. Results of the experiments are summarized in Ta- ble C.3. 187 APPENDIX D SUPPLEMENTARY MATERIAL FOR”TOWARDS A DEEPER UNDERSTANDING OF KNOWLEDGE DISTILLATION” D.1 On Label Smoothing and Predictive Uncertainty Reg- ularization We first give a derivation on the equivalence of label smoothing regularization and Eq. 6.7. With some simple rearrangement of the terms, ∑n ∑n ∑k L 1LS = − log[z]y + β − log[z]i c i=1 ∑( ki=1 c=1 )n k + β ∑ β = − (1 + β) log[z]y + log[z]c . k(1 + β) i k(1 + β) i=1 c≠ yi The above objective is clearly equivalent to the label smoothing regularization with 1− ϵ = k+β , up to a constant factor of (1 + β). k(1+β) Label smoothing regularizes predictive uncertainty. The amount of regulariza- tion is controlled by the amount of smoothing applied. Evidently, the objective does not regularize confidence diversity. Indeed, assuming a NN with capacity capable of fitting the entire training data, predictions on training data will be pushed arbitrarily close to the smoothed soft label. Empirical evidence for this form of overfitting can be seen from experiments done by Müller et al. [153], in which the authors demonstrated that applying label smoothing leads to hampered distillation performance. The authors hypothesize that this is likely due to erasure of ”relative information between logits” when label smoothing is applied, hinting at the overfitting of predictions to the smoothed labels. 188 A closely related regularization technique is to explicitly regularize on predictive uncertainty: ∑n ∑n ∑k LPU = − 1 log[z]y + β [z]c log[z]i c.n i=1 j=1 c=1 Prior papers [38, 169] have demonstrated that directly regularizing predictive un- certainty can lead to better performance than label smoothing. However, we note that the above objective does not regularize confidence diversity either. In fact, it can be easily solved, with the method of Lagrange multiplier, that the optima for the objective above is achieved when [z]y = 1 − − where Wi βW (exp( 1/β)(k 1)/β)+1 1−[z] corresponds to the Lambert W function, and [z] = yic − for all c ̸= yi, for allk 1 sample pairs (xi, yi). As such, the global optima obtained by directly regularizing predictive uncertainty is identical to that of label smoothing. In practice, differ- ences between the two can arise due to the details of the optimization procedure (like early stopping), and/or due to model capacity. D.2 Additional Experiments with Temperature Scaling on Student Models To examine the effect of not applying temperature scaling on student models, we conduct an experiment to compare models trained with and without temperature scaling on student models for distillation loss with the ResNet-34 on the CIFAR- 100 dataset, using the training objective of Eq. 6.3. On top of the hyper-parameter α = 0.4 used for experiments in Section 6.7, we also include results with α = 0.1, a widely used value for knowledge distillation in prior work [30]. We vary the amount of temperature scaling applied to illustrate the effect of different temperatures have on student models. 189 alpha=0.1 alpha=0.1 75.0 0.125 74.5 0.100 Scale both 0.075 74.0 Scale teacher only Scale both 0.050 73.5 Scale teacher only 0.025 2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Temperature Temperature (a) (b) alpha=0.4 alpha=0.4 76 0.20 75 0.15 0.10 74 Scale both Scale both Scale teacher only 0.05 Scale teacher only 2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Temperature Temperature (c) (d) Figure D.1: Left: Test accuracies of ResNet-34 models on the CIFAR-100 dataset when varying temperature. Right: ECE of ResNet-34 models on the CIFAR- 100 dataset when varying temperature. ”Scale both” corresponds to the origi- nally proposed distillation objective in which both teacher and student models are temperature-scaled during training. ”Scale teacher only” corresponds to only tem- perature scaling teacher models during distillation. The green flat line represents the performance achieved by the teacher model trained with cross-entropy loss. Plots of test accuracy and ECE against amount of temperature scaling applied are shown in Fig. D.1. Firstly, we observe that models trained with student scaling have ECE almost identical to that of the teacher models. As a direct contrast, we see that the student models trained without student scaling perform much better in terms of calibration error in general over its teacher. Note that the relatively large ECE when α = 0.4 and T > 3 is likely due to overly unconfident teacher predictions. In addition, we highlight that, with the optimal hyper-parameters of α and T used, student models trained without student scaling can also outperform 190 Accuracy Accuracy ECE ECE significantly in terms of test accuracy. We acknowledge that there can be conflicts between the performance of ECE and accuracy, as seen from superior test accuracy but poor ECE achieved for α = 0.4 and T = 4.0. In practice, we can use the negative log likelihood, a metric influenced by both ECE and accuracy, to find the optimal α and T . Lastly, we note that, both α and T alter the amount of predictive uncertainty and confidence diversity in teacher predictions at the same time. This coupled effect could be the reason for the observed conflict between ECE and accuracy. We leave it as a future work to explore alternative ways to decouple the two measures for more efficient and effective parameter search. We believe a decoupled set of parameters can lead to models with better calibration and accuracy at the same time. D.3 Additional Experiments on Sequential Self-Distillation with Different Temperatures 76.0 75.5 4.0 Test NLL 0 75.0 3.5 4 1 74.5 3.0 3 2 74.0 2.5 3 73.5 22.0 4 73.0 1 72.5 Test Accuracy 1.5 Predictive Uncertainty 5 Confidence Diversity 72.0 1.0 0 60 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Generations Generations Generations Generations 70 4.5 Test NLL 04 1 Confidence Diversity4.0 60 3.5 3 2 50 3.0 3 40 2.5 2 4 30 2.0 1 5 20 Test Accuracy 1.5 Predictive Uncertainty 60 0 1 2 3 4 5 1.0 0 1 2 3 4 5 0 1 2 3 4 5 7 0 1 2 3 4 5 Generations Generations Generations Generations Figure D.2: Results for sequential self-distillation over 5 generations are shown above for different temperatures. Top: temperature T = 2.0; Bottom: tempera- ture T = 3.0. The same temperatures are used throughout the entire sequential distillation process. Model obtained at the (i − 1)-th generation is used as the teacher model for training at the i-th generation. Accuracy and NLL are obtained on the test set using the student model, whereas the predictive uncertainty and confidence diversity are evaluated on the training set with teacher predictions. 191 To further verify the observation on predictive uncertainty and confidence diver- sity made empirically in Section 4, we conduct additional sequential self-distillation experiments with different values of temperature. Figure D.2 summarizes the re- sults when temperature is 2 (top) and 3 (bottom) respectively. As seen clearly, test accuracy and NLL performance correlate strongly with that of confidence di- versity, further demonstrating the importance of confidence diversity for greater generalizability in neural networks. D.3.1 Additional Experiments with Different Amount of Label Smoothing ϵ In order to verify that the conclusions drawn from our empirical experiments hold more generally, we conduct additional experiments varying the amount of label smoothing ϵ. Additional smoothing parameters of ϵ = 0.1 and ϵ = 0.3 are used. As a fair comparison, given the label smoothing parameter ϵ, hyper-parameters for Beta smoothing and self-distillation are adjusted so that the amount of label smoothing for samples on average is the same as that of label smoothing. Experi- mental results are summarized in Figure D.3. Observe that the general trend in terms of both the accuracy and calibration holds across different values of ϵ. 192 CIFAR-100, ResNet-34 CIFAR-100, ResNet-34 75 76 74 75 0.125 0.100 0.22 0.075 0.20 LS B SD LS B SD (a) (b) CUB-200, ResNet-34 CUB-200, ResNet-34 58 55 57 50 56 0.10 0.24 0.22 0.05 0.20 LS B SD LS B SD (c) (d) Tiny-ImageNet, ResNet-34 Tiny-ImageNet, ResNet-34 58 58 56 57 56 0.16 0.1 0.14 0.0 LS B SD LS B SD (e) (f) Figure D.3: Experimental Results performed on CIFAR-100, CUB-200 and the Tiny-Imagenet dataset with different amount of label smoothing. Left: ϵ = 0.1, Right: ϵ = 0.3. ”CE”, ”LS”, ”B” and ”SD” refers to ”Cross Entropy”, ”Label Smoothing”, ”Beta Smoothing” and ”Self-Distillation” respectively. The top rows of each experiment show bar charts of accuracy on test set for each experiment conducted, while the bottom are bar charts of expected calibration error. 193 ECE Accuracy ECE ECEAccuracy Accuracy ECE ECE ECE Accuracy Accuracy Accuracy CIFAR-100, ResNet-34 CUB-200, ResNet-34 74 54 53 72 52 0.2 0.2 0.1 0.0 B ST 0.0 B ST (a) (b) Tiny-ImageNet, ResNet-34 56.0 55.5 55.0 0.2 0.1 0.0 B ST (c) Figure D.4: Additional results to compare Beta smoothing against self-training explicitly with the EMA predictions. ”B” and ”ST” refer to ”beta smoothing” and ”self-training” respectively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. D.4 Additional Experiments with Self-Training Using EMA-Predictions The proposed beta smoothing involves the use of EMA predictions to rank the confidence of samples within each minibatch during training in order to achieve instance-specific regularization. To further demonstrate that the gain in accuracy and calibration obtained through beta smoothing mainly comes from instance- specific regularization, we compare Beta smoothing against explicit self-training using the EMA predictions in which the EMA predictions are directly used as soft 194 ECE Accuracy ECE Accuracy ECE Accuracy labels to compute cross-entropy loss. We follow the training procedure as described in [201] for self-training with EMA predictions. Results using ResNet for all the datasets considered in this chapter are summarized in Figure D.4. Beta smoothing outperforms self-training using EMA predictions on all of the experiments con- ducted in terms of both accuracy and calibration. As such, while EMA predictions can be used as a reliable proxy to rank the relative confidence of samples, the predictions themselves are sub-optimal when used as teachers directly. D.5 Additional Experiments with CIFAR-10 When Vary- ing Trainset Size Recent results show relatively small gain when performing knowledge distillation on the CIFAR-10 dataset [30, 49]. Our perspective of distillation as regulariza- tion provides a plausible explanation for this observation. Like all other forms of regularization, its effect diminishes with increasing the size of training data. We experimentally verify the claim by training ResNet-34 models with a varying number of training samples. The experiment are repeated 3 times. Fig. ?? sum- marizes the results. As expected, increasing sample size leads to an increase in test accuracy for both of the models. Nevertheless, the relative improvement in the accuracy of the student model compared to the teacher decreases as the size of the training set increases, indicating that distillation is a form of regularization. 195 95 90 85 80 75 70 Teacher Student 65 5000 10000 15000 20000 25000 30000 35000 40000 45000 Training Set Size (a) [width=1.0]labelsmoothing/scaledCIFARimprovementresults.pdf (b) Figure D.5: Left: Test accuracies of ResNet-34 models on the CIFAR-10 dataset for the teacher and student models when the training set size is varied. Right: The relative improvements in accuracy when the training set size is varied. D.6 Additional Experiments with CIFAR-100 When Vary- ing Weight Decay To further demonstrate that distillation is a regularization process, we also con- duct an additional experiment on the CIFAR-100 dataset using ResNet-34, varying only the weight decay hyper-parameter. Intuitively, larger weight decay regular- ization makes NNs less prone to overfitting, which should, in turn, reduced the additional benefits obtainable from self-distillation, if it is indeed a form of reg- ularization. To keep the quality of priors identical across all student models, we 196 Test Accuracy 78 76 74 72 Student Teacher 70 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 Weight Decay (a) 3.5 3.0 2.5 2.0 1.5 1.0 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 Weight Decay (b) Figure D.6: Left: Test accuracies of ResNet-34 models on the CIFAR-100 dataset for the teacher and student models when the weight decay hyper-parameter is varied. Right: The relative improvements in accuracy when the weight decay hyper-parameter is varied. use the same teacher model obtained from using a weight decay of 10−4 for all distillation. Our results are summarized in Fig. D.6. It is evident that increas- ing the weight decay hyper-parameter leads to much smaller improvement in test accuracy. Interestingly, we see a noticeable gain in accuracy for baselines models trained with cross-entropy when adjusting the weight decay term, contradicting some of the recent findings that weight decay is ineffective for neural networks. 197 Accuracy Improvement Accuracy D.7 Additional Experiments on Beta Smoothing CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12 75.5 77.5 75.0 77.0 74.5 76.5 0.14 0.10 0.12 0.08 0.10 0.06 LS RB B LS RB B (a) (b) CUB-200, ResNet-34 CUB-200, DenseNet-121-12 55.0 60 52.5 58 0.10 0.10 0.05 0.05 LS RB B LS RB B (c) (d) Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12 56.5 58 56.0 57 55.5 0.06 0.10 0.04 0.05 LS RB B LS RB B (e) (f) Figure D.7: Ablation study on Beta smoothing. ”LS”, ”RB” and ”B” refers to ”Label Smoothing”, ”Random Beta Smoothing” and ”Beta Smoothing” respec- tively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of ex- pected calibration error. We conduct an ablation study on the proposed Beta smoothing regularization in order to demonstrate the importance of relative ranking. To do so, we run ex- 198 ECE Accuracy ECE Accuracy ECE Accuracy ECE ECE ECE Accuracy Accuracy Accuracy periments with the identical setup as described in Section 6.7 for Beta smoothing with completely randomly assigned soft label noise from Beta distribution instead. We term this the “random Beta smoothing”. Results are shown in Fig. D.7. For convenience, we also include results obtained with regular label smoothing as a benchmark comparison. As seen clearly, the proposed Beta smoothing with rank- ing obtained from EMA predictions leads to much better results in general in terms of both accuracy and ECE, suggesting that naively encouraging confidence diver- sity does not lead to significant improvements, and the relative confidence among different samples is also an important aspect in order to obtain better student mod- els. This ablation study also serves as indirect evidence for why self-distillation still outperforms Beta smoothing - with a pre-trained model, much more reliable relative confidence among training samples can be obtained. D.8 Additional Experiments on the Effect of Quality of Teachers We also perform an additional experiment with the identical setup as described in Section 6.7 on cross distillation of the ResNet and the DenseNet models, in which a ResNet-34 teacher is used to train the DenseNet-100 student and vice versa in an attempt to examine the effect of better/worse priors in self-distillation. Hyper- parameters are fixed in this case such that the predictive uncertainty and diversity associated with the label predictions remain the same as that for self-distillation. Results are summarized in Fig. D.8. As seen clearly from consistently better/worse performance of cross distillation for ResNet/DenseNet, better teachers lead to better performance. Thus, in addition to diversity among teacher predictions, the 199 CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12 79 77 76 78 75 77 0.15 0.10 0.10 0.05 0.05 SD CD SD CD (a) (b) CUB-200, ResNet-34 CUB-200, DenseNet-121-12 58 60.5 56 60.0 54 59.5 0.04 0.030 0.03 0.025 0.02 0.020 SD CD SD CD (c) (d) Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12 60.0 59 57.5 58 57 0.1 0.05 0.04 0.0 SD CD SD CD (e) (f) Figure D.8: Additional results on cross-distillation. ”SD” and ”CD” refers to ”self- distillation” and ”cross-distillation” respectively. The top rows of each experiment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. quality of the instance-specific prior used is also important for better generalization performance. Lastly, we also see an apparent benefit in terms of model calibration when a better teacher model is used. 200 ECE Accuracy ECE ECEAccuracy Accuracy ECE ECE Accuracy Accuracy ECE Accuracy The interpretation of distillation as sample-specific regularization provides us with a reasonable explanation of why deeper NNs are potentially better teachers. With greater capacity, deeper networks can learn better representations that cap- ture more closely the true underlining relative confidence among samples, thereby generating better priors and hence better performance. When too expressive mod- els are used, however, there can be so much overfitting to the ground truth labels that the meaningful rankings are destroyed, despite better accuracy. Recent find- ings experimentally corroborate our argument [30]. Similar observations were also made when label smoothing is applied [153]. From the regularization perspective, distillation can be also applied to very deep networks for potential improvements, and shallower teacher models can also serve as teacher models for deeper student networks. D.9 Additional Experiments on Varying γ In addition, we consider a simple variation to distillation loss by varying γ. How- ever, directly adjusting γ can be problematic in practice. To understand the ef- fect of changing γ, suppose we have some γ such that [αx]c − 1 < 0 for some c ∈ {1, ..., k}. Since the minimization objective with respect to this class is −([αx]c − 1) log([z]c), the closer the [z]c to 0, the smaller the loss function. This leads to numerical issues as the overall loss function can be pushed to negative infinity by forcing [z]c arbitrarily close to zero. To circumvent the numerical problem during optimization, we make the obser- vation that the above objective is essentially equivalent to setting the particular element with [αx]c − 1 < 0 to zero. As such, adjusting the threshold γ enables 201 CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12 76 79 78 75 77 0.15 0.10 0.10 0.05 SD PD SD PD (a) (b) CUB-200, ResNet-34 CUB-200, DenseNet-121-12 56 61 60 55 59 0.04 0.04 0.02 0.02 SD PD SD PD (c) (d) Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12 58 59.0 57 58.5 0.150 0.050 0.125 0.025 0.100 SD PD SD PD (e) (f) Figure D.9: Additional results on pruned distillation. ”SD” and ”PD” refer to ”self-distillation” and ”pruned-distillation” respectively. The top rows of each ex- periment show bar charts of accuracy on the test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error. us to prune out the smallest elements of the teacher predictions. To further force the pruned elements to zero, a new softmax probability vector is computed with the remaining elements. In practice, setting the optimal γ can be challenging. We instead choose to prune out a fixed percentage of classes for all samples. For 202 ECE ECE ECE Accuracy Accuracy Accuracy ECE Accuracy ECE ECEAccuracy Accuracy instance, pruning 50% of the classes for a 100-class classification amounts to us- ing only the top 50 most confident samples to compute softmax and setting the remaining to zero. We term this method the pruned-distillation. We show some preliminary results with pruned-distillation with 50% of the classes pruned during distillation in Fig. D.9. While the performance overall re- mains the same for the CIFAR-100 and Tiny-Imagenet datasets, a slight improve- ment can be seen for CUB-200 in terms of both the accuracy and ECE, suggesting the method as an easy-to-implement adjustment with no harm. 203