RELIABLE DEEP LEARNING WITH APPLICATION
TO DIGITAL HISTOPATHOLOGY IMAGE ANALYSIS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Zhilu Zhang
May 2022
© 2022 Zhilu Zhang
ALL RIGHTS RESERVED
RELIABLE DEEP LEARNING WITH APPLICATION TO DIGITAL
HISTOPATHOLOGY IMAGE ANALYSIS
Zhilu Zhang, Ph.D.
Cornell University 2022
Deep learning has achieved tremendous success over the past decade, pushing the
limit in various application domains such as computer vision and natural language
processing. Despite the advancements, recent work has demonstrated potential
risks associated with modern neural networks, hindering the reliability of such
deep learning systems for real-world applications. In this thesis, we consider sev-
eral challenges associated with the reliable application of neural networks. Specif-
ically, the broad term of reliability is broken down into two aspects. Firstly, it
has been demonstrated that typical deep learning systems are prone to overfit-
ting to noisy labels commonly present in large-scale datasets, thereby leading to
sub-optimal performances. To alleviate this problem, we propose a novel loss func-
tion and demonstrate its robustness against label noise. Secondly, prior work has
highlighted problems in the uncertainty quantification of neural networks. This
can significantly hamper the interpretability of neural network predictions. In this
thesis, we discuss several strategies that enable us to obtain neural networks with
better uncertainty estimations. Lastly, as a case study, we apply deep learning
to a real-world problem of a large-scale whole-slide histopathology image classifi-
cation task, and demonstrate the effectiveness of such a deep learning system for
real-world medical application.
BIOGRAPHICAL SKETCH
Zhilu Zhang was born in Wuhan, China. He moved to Singapore in 2007 and
completed his high school at Temasek Junior College, Singapore. He traveled to
the US to further pursue his academic career and obtained a Bachelor of Arts degree
with major in Mathematics and Physics from Carleton College, MN. Following his
college graduation, Zhilu joined Cornell University as a Ph.D. student in the Fall
of 2016. During his Ph.D. career, he has interned at several companies such as
Amazon Web Services, Waymo, and ByteDance as research interns. After six years
at Cornell, he is graduating in May 2022. In the short term, he looks forward to
beginning working as an applied scientist at Amazon Web Services.
iii
ACKNOWLEDGEMENTS
First of all, I am deeply indebted to my advisor, Mert R. Sabuncu. As a novice
fresh out of college, I am extremely fortunate to have met Mert and to be given
the precious opportunity to learn and work alongside him in a field I knew little
about back then. Throughout my Ph.D. career, I have learned so much in every
single aspect from him. I am, and will always be, immensely grateful for having
been able to work under his supervision for my Ph.D.
As Mert’s second Ph.D. student, our lab started small. I am hugely grateful to
Evan M. Yu and Meenakshi Khosla for all the support and accompany during the
initial years of my Ph.D. and all the countless nights spent together in the lab. I
am also thankful to all my lab mates and friends for their support: Heejong Kim,
Alan Wang, Batuhan Karaman, Gia H. Ngo, Tianyu Ma, Matthew Pool, Carmen
Khoo, Victor Butoi, Cagla Bahadir, and Zijin Gu.
Lastly, I would like to express my gratitude to my parents, Guojun and Hong,
my brother, Zhiyuan, and my wife, Vianne. Without their tremendous under-
standing and encouragement in the past few years, it would have been impossible
for me to complete this journey.
iv
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction and Background 1
1.1 A Brief Recap of Deep Learning . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Feed Foward Neural Networks . . . . . . . . . . . . . . . . . 3
1.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . 8
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Vulnerability to Noises . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Unreliable Model confidence . . . . . . . . . . . . . . . . . . 10
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Label-Noise Robust Learning of Neural Networks with General-
ized Cross Entropy Loss 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Generalized Cross Entropy Loss for Noise-Robust Classifications . . 18
2.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Lq Loss for Classification . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Truncated Lq Loss . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Toward a Better Understanding of Lq Loss . . . . . . . . . . 27
2.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 31
3 Improving Confidence Calibration for Convolutional Neural Net-
works with Structured Dropout 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 An Analysis of the Performance of MC Dropout . . . . . . . . . . . 36
3.3.1 MC Dropout as Ensembles of Dropout Models . . . . . . . . 36
3.3.2 Decomposing the Performance of Ensembles . . . . . . . . . 38
3.3.3 Performance of MC Dropout and Model Diversity . . . . . . 39
3.3.4 Omnibus Dropout . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.5 Enhanced Ensemble Diversity with Structured Dropout . . . 43
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Bayesian Active Learning . . . . . . . . . . . . . . . . . . . . . . . . 49
v
4 Enhancing Uncertainty Estimates with Efficient Neural Network
Ensembles 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Fixing MC Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Toward Enhancing Model Diversity . . . . . . . . . . . . . . 58
4.4.2 Enhancing Individual Model Performance . . . . . . . . . . 60
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.3 How Many Subnetworks Can We Fit? . . . . . . . . . . . . . 70
5 Accelerating Uncertainty Estimates Computation with Uncertainty-
Aware Distribution Distillation 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Preliminary: Dropout for Bayesian Deep Learning . . . . . . 77
5.3.2 A Teacher-Student Paradigm for Sample-free Uncertainty
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Pixel-Wise Depth Estimation . . . . . . . . . . . . . . . . . 92
5.4.3 Ablation Study on Additional Augmentation . . . . . . . . . 94
5.4.4 Distilling from Deep Ensemble . . . . . . . . . . . . . . . . . 94
5.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 Towards a Deeper Understanding of Knowledge Distillation 96
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Teacher-Student Training Objective . . . . . . . . . . . . . . 100
6.4 Multi-Generation Self-Distillation: A Close Look . . . . . . . . . . . 101
6.4.1 Predictive Uncertainty . . . . . . . . . . . . . . . . . . . . . 102
6.4.2 Confidence Diversity . . . . . . . . . . . . . . . . . . . . . . 102
6.4.3 Sequential Self-Distillation Experiment . . . . . . . . . . . . 104
6.5 An Amortized MAP Perspective of Self-Distillation . . . . . . . . . 106
6.5.1 Label Smoothing as MAP . . . . . . . . . . . . . . . . . . . 107
6.5.2 Self-Distillation as MAP . . . . . . . . . . . . . . . . . . . . 108
6.5.3 On the Relationship between Label Smoothing and Self-
Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6 Beta Smoothing Labels . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7 Empirical Comparison of Distillation and Label Smoothing . . . . . 112
vi
6.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 113
6.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.8 Discussion and Future Directions . . . . . . . . . . . . . . . . . . . 116
7 A Case Study of Deep learning to Digital Pathology Image Anal-
ysis 118
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.2 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . 124
7.2.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.4 Model Inference . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.5 Pathologist Evaluation . . . . . . . . . . . . . . . . . . . . . 127
7.2.6 Prediction Heatmap . . . . . . . . . . . . . . . . . . . . . . 127
7.2.7 UMAP Visualization . . . . . . . . . . . . . . . . . . . . . . 128
7.2.8 Statistical Analysis and Software . . . . . . . . . . . . . . . 128
7.2.9 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . 129
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3.1 ML Models Accurately Predict IDH Mutation Status . . . . 130
7.3.2 Single-Scale ML Models Make Distinct Errors Relative to
Each Other and to Humans . . . . . . . . . . . . . . . . . . 133
7.3.3 Patch-Level Predictions Reveal Features that Drive Accurate
and Inaccurate Predictions . . . . . . . . . . . . . . . . . . . 134
7.3.4 Patch-Level Embedding Vectors Reflect Diagnostically Rel-
evant Human-Identifiable Features . . . . . . . . . . . . . . 136
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8 Conclusion 143
Bibliography 146
A Supplementary Material for ”Label-Noise Robust Learning of
Neural Networks with Generalized Cross Entropy Loss” 170
B Supplementary Material for ”Improving Confidence Calibration
for Convolutional Neural Networks with Structured Dropout” 175
B.1 Brief Review of Dropout As Bayesian Approximation . . . . . . . . 175
B.2 Relationship between Different Performance Metrics . . . . . . . . . 177
B.3 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
C Supplementary Material for ”Enhancing Uncertainty Estimates
with Efficient Neural Network Ensembles” 182
C.1 A Brief Review of the Edge-Pop Algorithm . . . . . . . . . . . . . . 182
C.2 Additional Ablation Studies . . . . . . . . . . . . . . . . . . . . . . 184
vii
D Supplementary Material for”Towards a Deeper Understanding of
Knowledge Distillation” 188
D.1 On Label Smoothing and Predictive Uncertainty Regularization . . 188
D.2 Additional Experiments with Temperature Scaling on Student Models189
D.3 Additional Experiments on Sequential Self-Distillation with Differ-
ent Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
D.3.1 Additional Experiments with Different Amount of Label
Smoothing ϵ . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
D.4 Additional Experiments with Self-Training Using EMA-Predictions 194
D.5 Additional Experiments with CIFAR-10 When Varying Trainset Size 195
D.6 Additional Experiments with CIFAR-100When VaryingWeight Decay196
D.7 Additional Experiments on Beta Smoothing . . . . . . . . . . . . . 198
D.8 Additional Experiments on the Effect of Quality of Teachers . . . . 199
D.9 Additional Experiments on Varying γ . . . . . . . . . . . . . . . . . 201
viii
LIST OF TABLES
2.1 Average test accuracy and standard deviation (5 runs) on exper-
iments with closed-set noise. We report accuracies of the epoch
where validation accuracy is maximum. Forward T and T̂ rep-
resent forward correction with the true and estimated confusion
matrices, respectively [168]. q = 0.7 was used for all experiments
with Lq loss and truncated Lq loss. Best 2 accuracies are bold faced. 28
2.2 Average test accuracy on experiments with CIFAR-10. We repli-
cated the exact experimental setup as in [213]. The reported accu-
racies are the average last epoch accuracies after training for 100
epochs. η = 40%. CCE, Forward and method by Wang et al. are
adapted for direct comparison. . . . . . . . . . . . . . . . . . . . . 31
3.1 Results on benchmark datasets comparing accuracy and uncer-
tainty estimates produced by different types of methods. The
top performing result for each metric is bold-faced. MC omnibus-
dropout is consistently the best method. The numbers in bracket
next to dropout methods corresponds to the optimal drop rate
found by grid search. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Results for ResNet models on various datasets. Best results for
efficient ensembles are highlighted in bold. Fixed classification layer
is used for orthogonal Dropout. See Table 4.3 and the Appendix
for further ablation study on this. . . . . . . . . . . . . . . . . . . . 64
4.2 Results for Wide ResNet28-10. Asterisk symbol (*) represents re-
sults adapted directly from [77]. Best results for efficient ensembles
are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Ablation study of the proposed method. orthogonal dropout meth-
ods are trained without dropout mask optimization. ”MO” corre-
sponds to ”mask optimization” and ”FC” corresponds to ”Fixed
Classifier”. ”Ind Acc” denotes the averaged individual model ac-
curacy in an ensemble, while ”Ens Acc” represents the ensemble
accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 Results on the segmentation problem. The “T”, “S” and “AU”
corresponds to the teacher and student model, and the aleatoric
uncertainty respectively. “T+AU” corresponds to a teacher model
trained with the aleatoric uncertainty. “DD” corresponds to the
student trained using Dropout Distillation [21]. Best performing
results for each teacher-student pair are bold-faced. . . . . . . . . . 87
5.2 Results on the depth estimation. The “T”, “S” and “AU” corre-
sponds to the teacher and student model, and the aleatoric un-
certainty respectively. “T+AU” corresponds to a teacher model
trained with the aleatoric uncertainty. . . . . . . . . . . . . . . . . 93
ix
5.3 Top-4 Rows : Impact of adding augmentation in training on quality
of uncertainty produced on the CamVid and NYU datasets. ”T”
and ”S” represents teacher and student models, and ”AUG” cor-
responds to augmentation. Last Row : Uncertainty performance of
student model when a deep ensemble with five NNs is used as the
teacher model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.1 Summary of the demographics for the TCGA training, validation,
and test datasets and the WCM test datasets. No significant differ-
ences are seen in sex between the IDHmut and IDHwt groups. IDH
mutant gliomas show statistically significant enrichment in younger
patients, consistent with historic controls. † indicates average sim-
ulation p-value: 140 IDH WT slides in the training dataset were
randomly sampled and one-way Anova was then conducted. Simu-
lations were repeated for 1000 times. * indicates propensity score
matching accounting for age and sex . . . . . . . . . . . . . . . . . 124
B.1 Results comparing accuracy and uncertainty estimates obtained us-
ing a single model when drop rate = 0.1 for all models. The
top performing result for each metric is bold-faced. MC omnibus-
dropout is the best method in general. . . . . . . . . . . . . . . . . 179
C.1 Comparison against deep ensembles with reduced convolutional ker-
nel size, so that deep ensemble has the same number of parameters
as orthogonal dropout. ”FC” stands for fixed classification. . . . . 186
C.2 Comparison against baseline methods when all methods have fixed
classification layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.3 Comparison against other types of dropout. . . . . . . . . . . . . . 187
x
LIST OF FIGURES
2.1 (a), (b) Test accuracy against number of epochs for training with
CCE (orange) and MAE (blue) loss on clean data with (top)
CIFAR-10 and (mid) CIFAR-100 datasets. (c) Average softmax
prediction for correctly (solid) and wrongly (dashed) labeled train-
ing samples, for CCE (orange) and Lq (q = 0.7, blue) loss on
CIFAR-10 with uniform noise (η = 0.4). . . . . . . . . . . . . . . . 22
2.2 The test accuracy and validation loss against number of epochs for
training with Lq loss at different values of q. . . . . . . . . . . . . . 29
3.1 From left to right (1) Accuracy of MC dropout and deep ensemble
(2) the relative improvements in accuracy of deep ensemble andMC
dropout (3) Brier score of MC dropout and deep ensemble against
number of models (4) the relative improvements in in Brier score
of deep ensemble and MC dropout against number of models. . . . 38
3.2 Interrater Agreement (IA) of models with different types of dropout
with 0.1 dropout rate on the SVHN, CIFAR-10 and -100 datasets.
The lower the IA, the more diverse the predictions of the models.
Y-axis indicates different methods. MC dropout produces models
with much larger IA, hence less model diversity, than structured
dropout techniques in most of the cases. . . . . . . . . . . . . . . . 38
3.3 Test Brier score (left) and accuracy (right) against number of mod-
els for ensemble prediction at test time on CIFAR-10. This corre-
sponds to the number of different MC dropout instantiations at test
time of the same model. The Model trained with omnibus dropout
achieves the best in terms of accuracy and Brier score. . . . . . . . 41
3.4 Reliability diagrams of predictions produced by difference models. . 45
3.5 Left : Test accuracy against number of training samples for models
with different methods of dropout and Variation Ratios as the ac-
quisition function on CIFAR-10. Right : Relative improvements in
test accuracy over that of the first iteration with different methods
of dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Bar plots of accuracy of individual orthogonal dropout subnetworks
of ResNet models. ”i-th” model represents the i-th subnetwork
obtained using Algorithm 1 sequentially. . . . . . . . . . . . . . . . 67
4.2 Plot of accuracy/NLL/ECE against number of models in the en-
sembles. For orthogonal dropout, number of models is varied by
changing the size of each subnetwork and all the orthogonal dropout
ensembles are of the same size. ”FC” corresponds to ”Fixed Clas-
sifier”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xi
5.1 An illustration of the proposed method. Given a trained teacher,
a deterministic student is used to approximately parameterize the
predictive distribution of the teacher model, enabling sample-free
uncertainty estimation. . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Example predictions on CamVid. Each uncertainty map shows
the sum of aleatoric and epistemic uncertainty. Same for all the
following example plots. . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Example predictions on Pascal VOC2012. . . . . . . . . . . . . . . 85
5.4 (a)-(c): Comparison of performance against the running time for
both the teacher (with the aleatoric uncertainty) and student model
using the CamVid dataset. (d) Speed-up ratios of uncertainty esti-
mates for the CamVid dataset with the Bayesian SegNet compared
to Huang et al. [91] and Postels et al. [172], two other sample-free
uncertainty estimation methods. . . . . . . . . . . . . . . . . . . . 88
5.5 Performance of models trained with CamVid and evaluated on
Cityscapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Top: Relative means of BALD for samples of seen and unseen
classes during training compared to the “Reference” models, which
refer to models trained with both seen and unseen classes. Bottom:
Distribution of BALD for samples of seen and unseen classes during
training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Example predictions on CamVid when “pedestrian” and “bicyclist”
are held out during training. “Reference” refers to models trained
with all classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Example predictions on NYU. . . . . . . . . . . . . . . . . . . . . . 93
6.1 Results for sequential self-distillation over 10 generations are shown
above. Model obtained at the (i − 1)-th generation is used as the
teacher model for training at the i-th generation. Accuracy and
NLL are obtained on the test set using the student model, whereas
the predictive uncertainty and confidence diversity are evaluated
on the training set with teacher predictions. . . . . . . . . . . . . . 104
6.2 Results with teacher predictions scaled by varying temperature T .
The flat lines in the plots correspond to the largest/smallest values
achieved over 10 generations of sequential distillation with T = 1 in
the previous experiments for accuracy, predictive uncertainty and
confidence diversity/NLL. . . . . . . . . . . . . . . . . . . . . . . . 106
6.3 Experimental Results performed on CIFAR-100, CUB-200 and the
Tiny-Imagenet dataset. ”CE”, ”LS”, ”B” and ”SD” refers to
”Cross Entropy”, ”Label Smoothing”, ”Beta Smoothing” and ”Self-
Distillation” respectively. The top rows of each experiment show
bar charts of accuracy on test set for each experiment conducted,
while the bottom rows are bar charts of expected calibration error. 115
xii
7.1 A schematic for the end-to-end process of model training and de-
ployment. WSI are tiled into patches of 256x256 size at 2.5X, 5X,
10X, and 20X magnification factors (A). In each training itera-
tion (mini-batch), 200 randomly selected and augmented patches
from a single magnification of a single WSI were passed to single-
scale Densenet121 classifiers, initialized with imageNet pre-trained
weights. Feature embedding vectors from each patch were then ag-
gregated using näıve averaging, and the resulting vector was then
passed to a final fully connected (linear) classifier (B). Following
training, the predictions three versions of each single-scale model
trained with different random seeds were averaged to produce a
single-scale ensemble, and the predictions from each single-scale
ensemble were averaged to produce the multiscale ensemble (MSE)
predictions. (C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 ROC curves for the ML classifiers, pathologists, and hybrid mod-
els on the WCM test data. Figure A compares the model perfor-
mance of the single-scale ensembles and the multi-scale ensemble.
(MSE). The performance of the semiquantitative predictions of two
expert neuropathologists and the two-pathologist averaged consen-
sus are compared in Figure B. Figure C compares the predictions
of the top-performing neuropathologist with the MSE, and the hy-
brid model generated by näıve averaging of pathologist and MSE
predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3 Patient-level predictions in the WCM test data, for the pathologists
and ML models. Panel A compares the semiquantitative prediction
scores of the two neuropathologists (κ = 0.656, R = 0.767). Panel
B compares the two-neuropathologist consensus predictions to the
multiscale classifier. (κ = 0.598, R = 0.674). Panel C shows all
patient-level predictions using the single-scale models, multiscale
ensemble, individual pathologists (P1, P2), two-pathologist con-
sensus (P1+P2), and the hybrid classifier (P+WSIP1+MSE). . . . 132
xiii
7.4 This shows examples of the sliding windows visualizations, with
representative patches from regions from 3 example cases that pro-
vide insight into features recognized by the classifier. (A) shows
a low power HE image of a slide that was accurately predicted
as IDHmut by the neuropathologists, but was incorrectly classified
by the MSE. (B) shows a heatmap of average pixel-level IDH mu-
tation status predictions. Selected patches from image A demon-
strate higher IDHmut predictions in regions of solid tumor (C),
with higher IDHwt predictions in regions of minimally involved
brain parenchyma (D). E and F show an example of a slide from
an IDHmut case, which was misclassified by both the neuropathol-
ogists and the ML classifier. Regions from this slide containing
tumor with monomorphic gemistocytic cytomophology (G) and re-
gions of minimally involved brain parenchyma with perineuronal
and perivascular white space artifact (H) were associated with a
higher prediction for IDHmut, while areas of minimally involved
brain parenchyma without significant whitespace artifact (I) and
regions with more bizarre cytology (J) were associated with a higher
prediction of IDHwt status. Figures K and L show a slide from an
IDHmut glioma which was accurately predicted by the ML clas-
sifier, but inaccurately predicted by the neuropathologists. Areas
of mildly cellular tumor, both with and without whitespace arti-
fact (M and N respectively) were associated with higher IDHmut
predictions, while regions of necrosis (O) and regions of minimally
involved brain parenchyma (P) were associated with higher IDHwt
predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5 UMAP coordinates of the feature embedding vector activations
from patches passed through the 10x classifier. A shows some ex-
ample tiles in 2D UMAP coordinates. B shows the patch-level IDH
status prediction scores as predicted by the 10x classifier. Tiles
from region C demonstrate microcystic architecture. Tiles from re-
gion D demonstrate hypercellular regions of infiltrating tumor, with
round cytology, enriched for tumors with oligodendroglial morphol-
ogy. Tiles from region E demonstrate hypercellular regions of tumor
with a greater degree of nuclear spindling/elongation and nuclear
pleomorphism. Tiles from region F demonstrate brain parenchyma
without significant infiltration by tumor cells. . . . . . . . . . . . 137
B.1 Test Brier score (left) and accuracy (right) against number of mod-
els for ensemble prediction at test time on SVHN and CIFAR-100.
This corresponds to the number of different MC dropout instan-
tiations at test time of the same model. The Model trained with
omnibus dropout achieves the best in terms of accuracy and Brier
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
xiv
B.2 Plots of test time NLL (left) and accuracy (right) against dropout
rate for models trained with different types of dropout on the
SVHN, CIFAR-10 and CIFAR-100 datasets. . . . . . . . . . . . . . 180
C.1 Plot of accuracy/NLL/ECE against number of models in the en-
sembles. For the proposed method, the number of models is varied
by changing the size of each subnetwork and all the orthogonal
dropout ensembles are of the same size. For MIMO networks, the
number of models is varied by changing the number of inputs and
outputs (classifier layers) of the networks. . . . . . . . . . . . . . . 185
D.1 Left: Test accuracies of ResNet-34 models on the CIFAR-100
dataset when varying temperature. Right: ECE of ResNet-34 mod-
els on the CIFAR-100 dataset when varying temperature. ”Scale
both” corresponds to the originally proposed distillation objective
in which both teacher and student models are temperature-scaled
during training. ”Scale teacher only” corresponds to only temper-
ature scaling teacher models during distillation. The green flat line
represents the performance achieved by the teacher model trained
with cross-entropy loss. . . . . . . . . . . . . . . . . . . . . . . . . 190
D.2 Results for sequential self-distillation over 5 generations are shown
above for different temperatures. Top: temperature T = 2.0; Bot-
tom: temperature T = 3.0. The same temperatures are used
throughout the entire sequential distillation process. Model ob-
tained at the (i− 1)-th generation is used as the teacher model for
training at the i-th generation. Accuracy and NLL are obtained
on the test set using the student model, whereas the predictive un-
certainty and confidence diversity are evaluated on the training set
with teacher predictions. . . . . . . . . . . . . . . . . . . . . . . . . 191
D.3 Experimental Results performed on CIFAR-100, CUB-200 and the
Tiny-Imagenet dataset with different amount of label smoothing.
Left: ϵ = 0.1, Right: ϵ = 0.3. ”CE”, ”LS”, ”B” and ”SD” refers to
”Cross Entropy”, ”Label Smoothing”, ”Beta Smoothing” and ”Self-
Distillation” respectively. The top rows of each experiment show
bar charts of accuracy on test set for each experiment conducted,
while the bottom are bar charts of expected calibration error. . . . 193
D.4 Additional results to compare Beta smoothing against self-training
explicitly with the EMA predictions. ”B” and ”ST” refer to ”beta
smoothing” and ”self-training” respectively. The top rows of each
experiment show bar charts of accuracy on the test set for each
experiment conducted, while the bottom rows are bar charts of
expected calibration error. . . . . . . . . . . . . . . . . . . . . . . . 194
xv
D.5 Left: Test accuracies of ResNet-34 models on the CIFAR-10 dataset
for the teacher and student models when the training set size is
varied. Right: The relative improvements in accuracy when the
training set size is varied. . . . . . . . . . . . . . . . . . . . . . . . 196
D.6 Left: Test accuracies of ResNet-34 models on the CIFAR-100
dataset for the teacher and student models when the weight de-
cay hyper-parameter is varied. Right: The relative improvements
in accuracy when the weight decay hyper-parameter is varied. . . . 197
D.7 Ablation study on Beta smoothing. ”LS”, ”RB” and ”B” refers
to ”Label Smoothing”, ”Random Beta Smoothing” and ”Beta
Smoothing” respectively. The top rows of each experiment show bar
charts of accuracy on the test set for each experiment conducted,
while the bottom rows are bar charts of expected calibration error. 198
D.8 Additional results on cross-distillation. ”SD” and ”CD” refers
to ”self-distillation” and ”cross-distillation” respectively. The top
rows of each experiment show bar charts of accuracy on the test
set for each experiment conducted, while the bottom rows are bar
charts of expected calibration error. . . . . . . . . . . . . . . . . . 200
D.9 Additional results on pruned distillation. ”SD” and ”PD” refer to
”self-distillation” and ”pruned-distillation” respectively. The top
rows of each experiment show bar charts of accuracy on the test
set for each experiment conducted, while the bottom rows are bar
charts of expected calibration error. . . . . . . . . . . . . . . . . . 202
xvi
CHAPTER 1
INTRODUCTION AND BACKGROUND
Ever since the seminal work by Krizhevsky et al. [112], deep learning has demon-
strated incredible capabilities in various fields in computer science like natural lan-
guage processing and computer vision, significantly advancing the state-of-the-art
and pushing the boundaries in such fields [124]. This resurgence in interest in neu-
ral networks has even been spread to many fields outside of computer science such
as physics, biology, and medicine [199, 212, 214]. With such demonstrated capa-
bilities and potentials, the reliability and interpretability of neural networks come
under careful scrutiny. Indeed, on top of good predictive performance achievable
with such neural network systems, it is also paramount to be able to understand
how reliable and interpretable such machine-generated predictions are. Recently,
numerous works have raised concerns on the vulnerability of deep learning for real-
world application purposes [64,232], highlighting the potential dangers of applying
deep learning for sensitive application domains like medical image diagnosis and
autonomous driving systems. In this thesis, we focus primarily on two aspects
of the broad term ”reliability” of neural networks. Before discussing the chal-
lenges associated with reliability, however, we start by offering a brief overview
of deep learning in Section 1.1. We then further elaborate in section 1.2 on two
of the challenges associated with the reliability of deep neural networks. Finally,
we summarize in Section 1.3 the proposed methods to tackle the aforementioned
problems.
1
1.1 A Brief Recap of Deep Learning
Before giving a brief introduction to deep neural networks, we start by consid-
ering a simple linear model. Suppose we are given a set of input-output pairs
{(x1,y1), ..., (xp,yp)}, where xi ∈ Rn and yi ∈ Rm for all i = 1, .., p. Such input-
output pairs can be anything with correlations. For instance, the inputs xi could
be the height of human beings, and the outputs yi could be the weight of human
beings. In modern computer vision applications, the inputs xi could be vectors
that represent the pixels of images, and yi could be the corresponding classes of
the input images. In a real-world medical imaging application, the inputs could be
MRI images of patients, and the outputs could be booleans that represent whether
the patients are cancerous. In short, the goal of the task is to predict y given a
new x that is not present in the set of data points already given.
In order to solve this prediction problem, a mathematical function, or a model,
can be used to accomplish this task. In the simplest, we can assume that the
input-output pairs follow a simple linear relationship. In this case, a simple linear
transformation
ŷ ≜ f(x) = xW + b, (1.1)
where W ∈ Rn×m and b ∈ Rm, can be used to map inputs x to their corresponding
outputs y.
Given the above linear model, the natural question to ask is how to obtain
the optimal parameters W and b. Indeed, different parameters W and b define
distinctive linear transformations, and to find a model that is capable of providing
accurate predictions given inputs x, we need to be able to search for the set of
parameters. Such parameters can be obtained by solving an optimization problem
2
with a cost function. For a regression problem where the outputs y are real-valued
scalars (e.g. weights of human beings), a common choice of the cost function is
the least square cost function defined as:
∑
L 1= ||ŷ 2
p p
− yp|| . (1.2)
p
For notational convenience, if we include the constant variable 1 in x and include
the bias b in the weightW , Eqn 1.1 can be written more compactly as f(x) = xW .
As such, we can denote the above cost function involving a summation using
matrix multiplication L = (y − XW )T (y − XW ). Taking the derivative of
the cost function, it is not difficult to see that the optimal solution to the above
optimization problem is W = (XTX)−1XTy [76].
1.1.1 Feed Foward Neural Networks
Despite the simplicity, the linear model described above is limiting. Indeed, it
would only be able to accurately predict if and only if the input-output pairs
follow a linear relationship. Unfortunately, our world is a complex system, and a
linear model would fail miserably in most complicated tasks. To solve this problem,
a natural question to ask is, how can we model more complex relationships? We
can use more complicated functions.
Many methods have been proposed to model non-linear relationships. The feed-
forward neural networks are one of the popular approaches. The idea is shockingly
straightforward. Instead of having one linear transformation, we have multiple
nested linear transformations one after another, with some non-linear functions
in-between each one of the linear transformations. In this way, the overall function
of nested linear transformations would be highly non-linear and expressive, capable
3
of capturing very complex relationships.
For simplicity, let us first consider a neural network with one hidden layer or
two linear transformations. Similar to the case of linear regression, given an input
x, a linear transformation is first used to transform input to xW +b. Then, a non-
linearity activation σ is applied before a second linear transformation is applied.
Many types of activation are used in practice. Some popular examples include
ReLU, sigmoid, and tanh [176]. Note that such non-linear activation functions are
crucial in making the overall neural network function non-linear. Indeed, without
them, neural networks would merely be sequences of linear transformations, which
are also linear transformations collectively. Overall, mathematically, such a two-
layer feed-forward neural network is defined by
ŷ ≜ f(x) = σ(xW 1 + b)W 2. (1.3)
These models are called feed-forward because information flows linearly layer by
layer from inputs x to outputs ŷ. There are no feedback connections in which
intermediate hidden layer outputs of the model are fed back into itself.
Despite the unsophistication, neural networks are powerful models capable of
representing very complex functions when we stack a lot of these linear trans-
formations together. Rigorous analysis has shown that deep neural networks are
universal function approximaters [106,136,167].
1.1.2 Regularization
More complex models are not necessarily always better. While deeper neural
networks are capable of representing more complex functions, it can also lead to
4
a phenomenon known as overfitting [76]. Overfitting happens when your model
predicts very well only on the training data (data points used to obtain the optimal
parameters of the model) but performs much worse on data not seen during the
optimization process. A central problem in machine learning is how to make a
model perform satisfactorily not only on the training data but also on the test
set. Indeed, predicting well on these unseen data points is the primary goal of a
machine learning model. To achieve this goal, many strategies have been proposed
to reduce test error or improve generalization performance. These strategies are
known collectively as regularization. Often, regularization can potentially come at
the cost of increased training errors. We scratch the surface and briefly discuss
several methods for regularizing neural networks.
Perhaps the simplest form of regularization is the “parameter norm penalties”.
Widely used in most machine learning models [76], this form of regularization works
by limiting the effective capacity of the models. This is done by adding a norm-
based penalty term to the loss function of optimization. This norm is computed
with respect to the model parameters so that the search space of parameters is
restricted to values closer to zero. For instance, one of the commonly used penalty
terms is the L2 norm regularization. In this case, for a two-layer neural network,
the overall loss function would then become
L 1
∑
= ||σ(xW + b)W − y ||2 + λ ||W ||2 + λ ||W ||2 + λ ||b ||21 2 p 1 1 2 2 3 1 , (1.4)p
p
where λ1, λ2 and λ3 are hyper-parameters called weight decay terms. Such hyper-
parameters can be selected based on a holdout set of training data not used directly
to obtain the optimal weights of neural networks. This dataset is commonly known
as the validation set.
Another recently proposed regularization method most commonly used for neu-
5
ral networks is dropout regularization [195,210]. Similar to L2 norm regularization,
dropout regularization also restricts the effective capacity of neural networks. This
is done by randomly dropping out a subset of the parameters of neural networks
during training. Effectively, a smaller model is used for learning the training data,
thereby regularizing the model. Removal of these parameters is typically sampled
i.i.d. with Bernoulli distribution.
One of the best ways to improve the generalization performance is arguably
to train our models with more data. However, in most scenarios, data can be
expensive to obtain. In such scenarios, we can synthetically generate more data
as a form of regularization. In this way, our models can overfit less to the lim-
ited amount of training data. Such data generation processes often leverage the
inherent structure of the data so that the inputs are perturbed in ways that do
not change the corresponding outputs. For instance, in the case of natural images,
we can rely on the limited amount of translation, rotations, and size invariance of
images relative to their output label pairs as a way to generate additional training
data [189]. In general, these methods are collectively known as data augmentation.
1.1.3 Optimization
With a neural network model and a loss function, our next step is to find the set of
parameters that predicts outputs accurately given inputs. Unlike the linear model,
due to the non-convexity of the objective function, the loss functions of neural
networks do not have closed-form solutions. Instead, we rely on gradient descent
to optimize the loss function [14]. In general, suppose we have a loss function L(θ),
where θ ∈ Rd denotes all the parameters associated with the model. The main idea
of gradient descent involves updating the parameters θ in the opposite direction
6
of the gradient of the loss function ∇θL(θ) with respect to the parameters. To
visualize, we follow the direction of the slope of the surface created by the loss
function downhill until we reach a valley. Then, the set of parameters θ obtained
at this valley should be one with a small loss, and hence good generalization
performance.
Traditionally, the gradient ∇θL(θ) is computed with the entire training data.
This is termed batch gradient descent. At the i-th iteration of the gradient descent,
parameters theta are updated with the update rule
θi = θi−1 − η∇θL(θi−1), (1.5)
where η denotes a hyper-parameter called the learning rate. η control the mag-
nitude of the step size taken at each iteration. Due to the non-linearity of the
loss function, a large learning rate can hinder convergence and lead to unstable
training. On the other hand, a small learning rate can lead to slow convergence of
the model.
In the modern era of big data, it is often computationally infeasible to com-
pute gradient with respect to the entire training set. To overcome the problem,
Stochastic gradient descent (SGD) can be used instead. Only a small subset (a
mini-batch) of the training data is used to compute a coarse estimate of the batch
gradient ∇θL(θ) at each iteration of gradient update. Interestingly, it is recently
observed that such stochasticity is often beneficial in helping deep neural networks
better [75, 99,193].
Optimization is a crucial aspect of machine learning. Better and faster opti-
mization algorithms enable us to obtain models of better qualities. Many methods
have been proposed to improve the vanilla gradient descent algorithm discussed
above. For instance, SGD with momentum [173] was proposed as a way to speed
7
up the convergence of SGD. It helps accelerate SGD in the relevant direction and
dampens oscillations in the unwanted directions. This is done by adding a fraction
γ of the gradient vector of the past time step to the current gradient vector:
vi = γvi−1 + η∇θL(θi−1) (1.6)
θi = θi−1 − vi. (1.7)
There are many other more recently proposed algorithms for optimization. Some
of the popular ones include Adam, RMSprop, and AdaGrad [109,185].
1.1.4 Convolutional Neural Networks
The primary focus of this thesis is on computer vision and medical imaging tasks.
Within this domain, a group of neural networks called the convolutional neural
networks (CNNs) has achieved tremendous success [68]. Unlike feed-forward dense
neural networks, CNNs typically have a smaller number of parameters and thus
are easier to optimize. They are a specialized kind of neural network for processing
data that has a known grid-like topology. Being shift invariant or space invariant,
they are designed with a built-in inductive bias of natural images. At a high
level, the parameter efficiency is achieved by the shared-weight architecture of the
convolution kernels or filters that slide along input features and provide translation-
equivariant responses known as feature maps [125,164].
Mathematically, convolutional layers are defined by kernel matricesK. Suppose
we have an input I. Within each convolutional layer, the convolution operation
∑∑∑
h(x, y, z) = (K ∗ I)(x, y, z) = I(x+ i, y + j, z + k)K(i, j, k) (1.8)
i j k
is performed to produce the output of each convolutional layer. Together with
8
nonlinear activation function functions such as ReLU, a simple CNN can then be
formed by stacking numerous convolutional layers together.
Following the tremendous initial success of CNNs for solving computer vision
tasks [112], numerous efforts have been made to design better CNNs that are easier
to train and converge faster and generalize better. Some of the popular approaches
include the use of skip connections in such CNNs [80, 89, 190]. More recently, re-
searchers have started exploring the possibility of casting CNN architecture design
itself as an optimization problem to automatically obtain high-performing CNNs
without the need for human heuristics [40].
1.2 Challenges
Despite the recent success in deep learning [112, 124], there remain many chal-
lenges that need to be overcome for it to be safely adopted for many real-world
applications. One of such obstacles is the reliability of deep learning. Indeed, in
addition to applications like recommendation systems [235] in which safety and
reliability are less of a concern, deep learning is being applied to many sensitive
domains like medical image diagnosis [41, 175, 178] and autonomous driving sys-
tems [67]. Reliability of deep learning is crucial for trustworthy application in these
aforementioned domains. Nevertheless, modern neural networks have several sig-
nificant shortcomings that can severely hamper such dependable adoption of deep
learning systems. In this thesis, we focus on two such aspects.
9
1.2.1 Vulnerability to Noises
First of all, modern neural networks are prone to noises present in data. For
example, it has been shown that adversarially perturbing the inputs can easily
fool a trained neural network model and significantly alter the prediction outcome
of models [64, 228]. Moreover, even naturally present noises can be very harmful
to the generalizability of neural network systems [81]. Such vulnerability to noises
in inputs poses significant risks.
While vulnerability to input noises is a significant topic worth further studying,
in this thesis, we focus on noises present in the labels of the dataset. In today’s big
data era, noises in labels are arguably inevitable in most scenarios. To exemplify, a
recent study has found that even some of the most widely used benchmark datasets
like CIFAR10 and ImageNet contain numerous erroneously labeled samples [160].
As we will further discuss in Chapter 2, as a result of the expressivity of modern
neural networks, such noises present in the dataset can significantly hamper the
generalization performance of learned models. As such, devising algorithms that
enable us to train neural networks reliably in the presence of such noises in labels
can be extremely beneficial for real-world applications.
1.2.2 Unreliable Model confidence
In addition to accurate test-time predictions, neural networks also need to produce
reliable model confidence intervals or uncertainty estimates. This is especially im-
portant for sensitive application domains like medical image analysis. For instance,
in order for physicians to safely interpret the predictions made by a neural network
system and make corresponding treatment plans for potential cancer patients, the
10
model also needs to produce a trustworthy confidence interval on the predictions
made. As we will discuss in Chapter 3, this is exactly what a modern neural
network system lacks. Typically, the uncertainty estimates produced are often
overconfident and uninformative [50].
1.3 Contributions
In this thesis, we propose several novel methodologies to tackle the aforementioned
challenges associated with deep learning. Here, we give a high-level summary of
the methodologies.
In Chapter 2, we discuss a novel theoretically grounded set of noise-robust
loss functions to tackle the problem of label-noise robust learning of deep neural
networks. The proposed loss functions can be readily applied with any existing
neural network architectures and algorithms.
Recently, a Bayesian perspective has suggested that dropout regularization can
be employed to obtain better probabilistic predictions at test time [51]. The au-
thors of the paper termed the method Monte Carlo dropout. In Chapter 3, we
take a step further and explore the use of various structured dropout techniques
to further improve calibration of predictions. We also propose an omnibus dropout
strategy that combines various structured dropout methods. We demonstrate the
using structured dropout yield predictions with consistently better model confi-
dence calibration.
Monte Carlo (MC) dropout [51] is a simple and efficient ensembling method
that can improve the accuracy and confidence calibration of high-capacity deep
11
neural network models. However, MC dropout is not as effective as more compute-
intensive methods such as deep ensembles [120]. To bridge the gap between MC
dropout and deep ensembles, we propose in Chapter 4 a simple, pruning-based ap-
proach to compute non-overlapping dropout masks, which allows us to compute an
ensemble of subnetworks. The proposed method can be seen as a computationally
efficient alternative to deep ensembles.
As we will discuss in Chapter 5, another shortcoming of MC dropout is that it
requires multiple forward passes through the network during inference and there-
fore can be too resource-intensive to be deployed in real-time applications. To solve
the latency, we leverage a concept called knowledge distillation and propose a sim-
ple, easy-to-optimize method for learning the conditional predictive distribution
of a pre-trained dropout model. This allows us to obtain sample-free uncertainty
estimation in computer vision tasks, thereby significantly reducing the inference
cost.
In addition to providing fast sample-free uncertainty estimation, we empirically
observe that knowledge distillation also helps improving the generalization perfor-
mance of neural networks. In Chapter 6, we provide a deeper understanding on
the observed phenomeon. Specifically, we offer a new interpretation for teacher-
student training as amortized MAP estimation, such that teacher predictions en-
able instance-specific regularization. Our framework allows us to theoretically
relate self-distillation to label smoothing, a commonly used technique that regu-
larizes predictive uncertainty, and suggests the importance of predictive diversity
in addition to predictive uncertainty.
Fianlly in Chapter 7, we apply neural network for a real-world histopathology
image classification task, and demonstrate that a carefully trained neural network
12
system can perform on par with pathologists with years of training.
13
CHAPTER 2
LABEL-NOISE ROBUST LEARNING OF NEURAL NETWORKS
WITH GENERALIZED CROSS ENTROPY LOSS
As discussed in Chapter 1, deep neural networks (DNNs) have achieved tremen-
dous success in a variety of applications across many disciplines. Yet, their superior
performance comes with the expensive cost of requiring correctly annotated large-
scale datasets. Moreover, due to DNNs’ rich capacity, errors in training labels
can hamper performance. To combat this problem, mean absolute error (MAE)
has recently been proposed as a noise-robust alternative to the commonly-used
categorical cross entropy (CCE) loss. However, as we show in this chapter, MAE
can perform poorly with DNNs and challenging datasets. In this chapter, we
present a theoretically grounded set of noise-robust loss functions that can be
seen as a generalization of MAE and CCE. Proposed loss functions can be readily
applied with any existing DNN architecture and algorithm, while yielding good
performance in a wide range of noisy label scenarios. We report results from ex-
periments conducted with CIFAR-10, CIFAR-100 and FASHION-MNIST datasets
and synthetically generated noisy labels.
2.1 Introduction
The resurrection of neural networks in recent years, together with the recent emer-
gence of large scale datasets, has enabled super-human performance on many clas-
sification tasks [112, 151, 159]. However, supervised DNNs often require a large
number of training samples to achieve a high level of performance. For instance,
the ImageNet dataset [35] has 3.2 million hand-annotated images. Although crowd-
14
sourcing platforms like Amazon Mechanical Turk have made large-scale annotation
possible, some error during the labeling process is often inevitable, and mislabeled
samples can impair the performance of models trained on these data. Indeed,
the sheer capacity of DNNs to memorize massive data with completely randomly
assigned labels [231] proves their susceptibility to overfitting when trained with
noisy labels. Hence, an algorithm that is robust against noisy labels for DNNs is
needed to resolve the potential problem. Furthermore, when examples are cheap
and accurate annotations are expensive, it can be more beneficial to have datasets
with more but noisier labels than less but more accurate labels [105].
Classification with noisy labels is a widely studied topic [48]. Yet, relatively
little attention is given to directly formulating a noise-robust loss function in the
context of DNNs. Our work is motivated by Ghosh et al. [59] who theoretically
showed that mean absolute error (MAE) can be robust against noisy labels under
certain assumptions. However, as we demonstrate below, the robustness of MAE
can concurrently cause increased difficulty in training, and lead to performance
drop. This limitation is particularly evident when using DNNs on complicated
datasets. To combat this drawback, we advocate the use of a more general class
of noise-robust loss functions, which encompass both MAE and CCE. Compared
to previous methods for DNNs, which often involve extra steps and algorithmic
modifications, changing only the loss function requires minimal intervention to
existing architectures and algorithms, and thus can be promptly applied. Fur-
thermore, unlike most existing methods, the proposed loss functions work for both
closed-set and open-set noisy labels [213]. Open-set refers to the situation where
samples associated with erroneous labels do not always belong to a ground truth
class contained within the set of known classes in the training data. Conversely,
closed-set means that all labels (erroneous and correct) come from a known set of
15
labels present in the dataset.
The main contributions of this chapter are two-fold. First, we propose a novel
generalization of CCE and present a theoretical analysis of proposed loss func-
tions in the context of noisy labels. And second, we report a thorough empir-
ical evaluation of the proposed loss functions using CIFAR-10, CIFAR-100 and
FASHION-MNIST datasets, and demonstrate significant improvement in terms of
classification accuracy over the baselines of MAE and CCE, under both closed-set
and open-set noisy labels.
The rest of the chapter is organized as follows. Section 2.2 discusses existing
approaches to the problem. Section 2.3 introduces our noise-robust loss functions.
Section 2.4 presents and analyzes the experiments and result.
2.2 Related Work
Numerous methods have been proposed for learning with noisy labels with DNNs
in recent years. Here, we briefly review the relevant literature. Firstly, Sukhbaatar
and Fergus [197] proposed accounting for noisy labels with a confusion matrix so
that the cross entropy loss becomes
N
1 ∑ 1 ∑N ∑cL(θ) = − log p(ỹ = ỹn|xn, θ) = − log( p(ỹ = ỹn|y = i)p(y = i|xn, θ)),
N N
n=1 n=1 i
(2.1)
where c represents number of classes, ỹ represents noisy labels, y represents the
latent true labels and p(ỹ = ỹn|y = i) is the (ỹn, i)’th component of the confusion
matrix. Usually, the real confusion matrix is unknown. Several methods have
been proposed to estimate it [61, 72, 82, 100, 168]. Yet, accurate estimations can
16
be hard to obtain. Even with the real confusion matrix, training with the above
loss function might be suboptimal for DNNs. Assuming (1) a DNN with enough
capacity to memorize the training set, and (2) a confusion matrix that is diagonally
dominant, minimizing the cross entropy with confusion matrix is equivalent to
minimizing the original CCE loss. This is because the right hand side of Eq. 2.1
is minimized when p(y = i|xn, θ) = 1 for i = ỹn and 0 otherwise, ∀ n.
In the context of support vector machines, several theoretically motivated noise-
robust loss functions like the ramp loss, the unhinged loss and the savage loss have
been introduced [18, 146, 207]. More generally, Natarajan et al. [155] presented a
way to modify any given surrogate loss function for binary classification to achieve
noise-robustness. However, little attention is given to alternative noise robust loss
functions for DNNs. Ghosh et al. [59, 60] proved and empirically demonstrated
that MAE is robust against noisy labels. This chapter can be seen as an extension
and generalization of their work.
Another popular approach attempts at cleaning up noisy labels. Veit et al. [208]
suggested using a label cleaning network in parallel with a classification network to
achieve more noise-robust prediction. However, their method requires a small set
of clean labels. Alternatively, one could gradually replace noisy labels by neural
network predictions [179, 200]. Rather than using predictions for training, North-
cutt et al. [161] offered to prune the correct samples based on softmax outputs. As
we demonstrate below, this is similar to one of our approaches. Instead of pruning
the dataset once, our algorithm iteratively prunes the dataset while training until
convergence.
Other approaches include treating the true labels as a latent variable and the
noisy labels as an observed variable so that EM-like algorithms can be used to
17
learn true label distribution of the dataset [105,206,221]. Techniques to re-weight
confident samples have also been proposed. Jiang et al. [97] used a LSTM network
on top of a classification model to learn the optimal weights on each sample, while
Ren, et al. [181] used a small clean dataset and put more weights on noisy samples
which have gradients closer to that of the clean dataset. In the context of binary
classification, Liu et al. [129] derived an optimal importance weighting scheme
for noise-robust classification. Our method can also be viewed as re-weighting
individual samples; instead of explicitly obtaining weights, we use the softmax
outputs at each iteration as the weightings. Lastly, Azadi et al. [3] proposed a reg-
ularizer that encourages the model to select reliable samples for noise-robustness.
Another method that uses knowledge distillation for noisy labels has also been
proposed [128]. Both of these methods also require a smaller clean dataset to
work.
2.3 Generalized Cross Entropy Loss for Noise-Robust
Classifications
2.3.1 Preliminaries
We consider the problem of k-class classification. Let X ⊂ Rd be the feature space
and Y = {1, · · · , c} be the label space. In an ideal scenario, we are given a clean
dataset D = {(xi, yi)}ni=1, where each (xi, yi) ∈ (X ×Y). A classifier is a function
that maps input feature space to the label space f : X → Rc. In this chapter, we
consider the common case where the function is a DNN with the softmax output
layer. For any loss function L, the (empirical) risk of the classifier f is defined as
18
RL(f) = ED[L(f(x), yx)] , where the expectation is over the empirical distribution.
The most commonly used loss for classification is cross entropy. In this case, the
risk becomes:
∑n ∑c1
RL(f) = ED[L(f(x;θ), yx)] = − yij log fj(xi;θ), (2.2)n
i=1 j=1
where θ is the set of parameters of the classifier, yij corresponds to the j’th
element of one-hot encoded label of the sample xi, yi = ey ∈∑{0, 1}c such thati
1⊤yi = 1 ∀ i, and fj denotes the j’th element of f . Note that,
n
j=1 fj(xi;θ) = 1,
and fj(xi;θ) ≥ 0, ∀j, i,θ, since the output layer is a softmax. The parameters of
DNN can be optimized with empirical risk minimization.
We denote a dataset with label noise by Dη = {(xi, ỹ )}ni i=1 where ỹi’s are
the noisy labels with respect to each sample such that p(ỹi = k|yi = j,xi) =
(x
η i
)
jk . In this chapter, we make the common assumption that noise is conditionally
independent of inputs given the true labels so that
p(ỹi = k|yi = j,xi) = p(ỹi = k|yi = j) = ηjk.
In general, this noise is defined to be class dependent. Noise is uniform with noise
rate η, if ηjk = 1− η for j = k, and η ηjk = − for j ̸= k. The risk of classifier withc 1
respect to noisy dataset is then defined as RηL(f) = EDη [L(f(x), ỹx)].
Let f ∗ be the global minimizer of the risk RL(f). Then, the empirical risk
minimization under loss function L is defined to be noise tolerant [145] if f ∗ is a
global minimum of the noisy risk RηL(f).
A loss function is called symmetric if, for some constant C,
∑c
L(f(x), j) = C, ∀x ∈ X , ∀f. (2.3)
j=1
19
The main contribution of Ghosh et al. [60] is they proved that if loss function
is symmetric and η < c−1 , then under uniform label noise, for any f , RηL(f
∗) −
c
RηL(f) ≤ 0. Hence, f ∗ is also the global minimizer for R
η
L and L is noise tolerant.
Moreover, if R (f ∗L ) = 0, then L is also noise tolerant under class dependent noise.
Being a nonsymmetric and unbounded loss function, CCE is sensitive to label
noise. On the contrary, MAE, as a symmetric loss function, is noise robust. For
DNNs with a softmax output layer, MAE can be computed as:
LMAE(f(x), ej) = ||ej − f(x)||1 = 2− 2fj(x). (2.4)
With this particular configuration of DNN, the proposed MAE loss is, up to a
constant of proportionality, the same as the unhinged loss Lunh(f(x), ej) = 1 −
fj(x) [207].
2.3.2 Lq Loss for Classification
In this section, we will argue that MAE has some drawbacks as a classification
loss function for DNNs, which are normally trained on large scale datasets us-
ing stochastic gradient based techniques. Let’s look at the gradient of the loss
functions:
∑ ∑n  n 1∂L(f(x i;θ), yi) ∑i=1
− ∇ f (x ;θ) for CCE
fy (xi;θ) θ yi i
= i (2.5)
∂θ
i=1  n
i=1−∇θfy (xi;θ) for MAE/unhinged loss.i
Thus, in CCE, samples with softmax outputs that are less congruent with provided
labels, and hence smaller fy (xi;θ) or larger 1/fy (xi;θ), are implicitly weighedi i
more than samples with predictions that agree more with provided labels in the
gradient update. This means that, when training with CCE, more emphasis is
20
put on difficult samples. This implicit weighting scheme is desirable for training
with clean data, but can cause overfitting to noisy labels. Conversely, since the
1/fy (xi;θ) term is absent in its gradient, MAE treats every sample equally, whichi
makes it more robust to noisy labels. However, as we demonstrate empirically, this
can lead to significantly longer training time before convergence. Moreover, with-
out the implicit weighting scheme to focus on challenging samples, the stochasticity
involved in the training process can make learning difficult. As a result, classifica-
tion accuracy might suffer.
To demonstrate this, we conducted a simple experiment using ResNet [79]
optimized with the default setting of Adam [109] on the CIFAR datasets [111]. Fig.
2.1(a) shows the test accuracy curve when trained with CCE and MAE respectively
on CIFAR-10. As illustrated clearly, it took significantly longer to converge when
trained with MAE. In agreement with our analysis, there was also a compromise
in classification accuracy due to the increased difficulty of learning useful features.
These adverse effects become much more severe when using a more difficult dataset,
such as CIFAR-100 (see Fig. 2.1(b)). Not only do we observe significantly slower
convergence, but also a substantial drop in test accuracy when using MAE. In fact,
the maximum test accuracy achieved after 2000 epochs, a long time after training
using CCE has converged, was 38.29%, while CCE achieved an higher accuracy
of 39.92% after merely 7 epochs! Despite its theoretical noise-robustness, due to
the shortcoming during training induced by its noise-robustness, we conclude that
MAE is not suitable for DNNs with challenging datasets like ImageNet.
To exploit the benefits of both the noise-robustness provided by MAE and the
implicit weighting scheme of CCE, we propose using the the negative Box-Cox
21
(a) (b)
(c)
Figure 2.1: (a), (b) Test accuracy against number of epochs for training with
CCE (orange) and MAE (blue) loss on clean data with (top) CIFAR-10 and (mid)
CIFAR-100 datasets. (c) Average softmax prediction for correctly (solid) and
wrongly (dashed) labeled training samples, for CCE (orange) and Lq (q = 0.7,
blue) loss on CIFAR-10 with uniform noise (η = 0.4).
transformation [15] as a loss function:
L (1− fj(x)
q)
q(f(x), ej) = , (2.6)
q
where q ∈ (0, 1]. Using L’Hôpital’s rule, it can be shown that the proposed loss
function is equivalent to CCE for limq→0 Lq(f(x), ej), and becomes MAE/unhinged
loss when q = 1. Hence, this loss is a generalization of CCE and MAE. Relatedly,
Ferrari and Yang [44] viewed the maximization of Eq. 2.6 as a generalization of
22
maximum likelihood and termed the loss function Lq, which we also adopt.
Theoretically, for any input x, the sum of Lq loss with respect to all classes is
bounded by:
c− c(1−q) ∑c≤ (1− fj(x)q) ≤ c− 1 . (2.7)
q q q
j=1
Using this bound and under uniform noise with η ≤ 1 − 1 , we can show (see
c
Appendix)
A ≤ (RLq(f ∗)−RLq(f̂)) ≤ 0, (2.8)
where A = η[1−c
(1−q)]
− − < 0, f
∗ is the global minimizer of RLq(f), and f̂ is the globalq(c 1 ηc)
minimizer of RηL (f). The larger the q, the larger the constant A, and the tighterq
the bound of Eq. 2.8. In the extreme case of q = 1 (i.e., for MAE), A = 0 and
RLq(f̂) = R
∗
Lq(f ). In other words, for q values approaching 1, the optimum of
the noisy risk will yield a risk value (on the clean data) that is close to f ∗, which
implies noise tolerance. It can also be shown that the difference (Rη ∗ ηL (f )−RL (f̂))q q
is bounded under class dependent noise, provided R (f ∗Lq ) = 0 and qij < qii ∀i ̸= j
(see Thm 2 in Appendix).
The compromise on noise-robustness when using Lq over MAE prompts an
easier learning process. Let’s look at the gradients of Lq loss to see this:
∂Lq(f(xi;θ), yi) 1
= fy (xi;θ)
q(− ∇θfy (xi;θ)) = −f q−1y (xi;θ) ∇θfy (xi;θ),
∂θ i f (x ;θ) i i iyi i
where fy (xi;θ) ∈ [0, 1] ∀ i and q ∈ (0, 1). Thus, relative to CCE, Lq loss weighsi
each sample by an additional fy (xi;θ)
q so that less emphasis is put on samples with
i
weak agreement between softmax outputs and the labels, which should improve
robustness against noise. Relative to MAE, a weighting of f (x ;θ)q−1y i on eachi
sample can facilitate learning by giving more attention to challenging datapoints
23
with labels that do not agree with the softmax outputs. On one hand, larger q
leads to a more noise-robust loss function. On the other hand, too large of a q can
make optimization strenuous. Hence, as we will demonstrate empirically below,
it is practically useful to set q between 0 and 1, where a tradeoff equilibrium is
achieved between noise-robustness and better learning dynamics.
2.3.3 Truncated Lq Loss
∑
Since a tighter bound in cj=1 L(f(x, j)) would imply stronger noise tolerance, we
propose the truncated Lq loss:
Lq(k) if fj(x) ≤ k
Ltrunc(f(x), ej) =  (2.9)Lq(f(x), ej) if fj(x) > k
where 0 < k < 1, and Lq(k) = (1− kq)/q. Note that, when k → 0, the truncated
Lq loss becomes the normal Lq loss. Assuming k ≥ 1/c, the sum of truncated Lq
loss with respect to all classes is bound∑ed by (see Appendix):c
L 1d q( ) + (c− d)Lq(k) ≤ Ltrunc(f(x), ej) ≤ cLq(k), (2.10)
d
j=1
where d = max(1, (1−q)
1/q
). It can be verified that the difference between upper
k
and lower bounds for the truncated Lq loss, Lq(k), is smaller than that for the Lq
loss of Eq. 2.7, if
1 c(1−q) − 1
d[Lq(k)− Lq( )] < . (2.11)
d q
As an example, when k ≥ 0.3, the above inequality is satisfied for all q and c.
When k ≥ 0.2, the inequality is satisfied for all q and c ≥ 10. Since the derived
bounds in Eq. 2.7 and Eq. 2.10 are tight, introducing the threshold k can thus lead
to a more noise tolerant loss function.
24
If the softmax output for the provided label is below a threshold, truncated Lq
loss becomes a constant. Thus, the loss gradient is zero for that sample, and it
does not contribute to learning dynamics. While Eq. 2.10 suggests that a larger
threshold k leads to tighter bounds and hence more noise-robustness, too large of
a threshold would precipitate too many discarded samples for training. Ideally, we
would want the algorithm to train with all available clean data and ignore noisy
labels. Thus the optimal choice of k would depend on the noise in the labels.
Hence, k can be treated as a (bounded) hyper-parameter and optimized. In our
experiments, we set k = 0.5 that yields a tighter bound for truncated Lq loss, and
which we observed to work well empirically.
A potential problem arises when training directly with this loss function. When
the threshold is relatively large (e.g., k = 0.5 in a 10-class classification problem),
at the beginning of the training phase, most of the softmax outputs can be sig-
nificantly smaller than k, resulting in a dramatic drop in the number of effective
samples. Moreover, it is suboptimal to prune samples based on softmax values at
the beginning of training. To circumvent the problem, observe that, by definition
of the trun∑cated Lq loss:n ∑n
argmin Ltrunc(f(xi;θ), yi) = argmin viLq(f(xi;θ), yi) + (1− vi)Lq(k),
θ i=1 θ i=1
(2.12)
where vi = 0 if fy (xi) ≤ k and vi = 1 otherwise, and θ represents the parametersi
of the cl∑assifier. Optimizing the above loss is the sa∑me as optimizing the following:n n ∑n
argmin viLq(f(xi;θ), yi)− viLq(k) = argmin wiLq(f(xi;θ), yi)− Lq(k) wi,
θ i=1 θ,w∈[0,1]
n
i=1 i=1
(2.13)
because for any θ, the optimal wi is 1 if Lq(f(xi;θ), yi) ≤ Lq(k) and 0 if
Lq(f(xi;θ), yi) > Lq(k). Hence, we can optimize the truncated Lq loss by optimiz-
25
Algorithm 1: ACS for Training with Lq Loss
Input: Noisy dataset Dη, total iterations T , threshold k
Output: Optimized NN parameters θ
(0)
Initialize wi = 1 ∀ i ; ∑ (0) ∑
Update θ(0) = argmin nθ i=1∑wi Lq(f(xi;θ), yi)− L
n (0)
q(k) i=1wi ;
while t < T do
Update w(t) = argmin n
∑
w i=1wiLq(f(x ;θ
(t−1)
i ), yi)− Lq(k) ni=1wi ;
// Pruning Step
Update θ(t)
∑ (t) ∑ (t)
= argmin nθ i=1wi Lq(f(x
n
i;θ), yi)− Lq(k) i=1 wi
ing the right hand side of Eq. 2.13. If Lq is convex with respect to the parameters θ,
optimizing Eq. 2.13 is a biconvex optimization problem, and the alternative convex
search (ACS) algorithm [7] can be used to find the global minimum. ACS itera-
tively optimizes θ and w while keeping the other set of parameters fixed. Despite
the high non-convexity of DNNs, we can apply ACS to find a local minimum. We
refer to the update of w as ”pruning”. At every step of iteration, pruning can be
carried out easily by computing f(xi;θ
(t)) for all training samples. Only samples
with fy (xi;θ
(t)) ≥ k and L
i q(f(xi;θ), yi) ≤ Lq(k) are kept for updating θ during
that iteration (and hence wi = 1 ). The additional computational complexity from
the pruning steps is negligible. Interestingly, the resulting algorithm is similar to
that of self-paced learning [117].
2.4 Experiments
The following setup applies to all of the experiments conducted. Noisy datasets
were produced by artificially corrupting true labels. 10% of the training data was
retained for validation. To realistically mimic a noisy dataset while justifiably
analyzing the performance of the proposed loss function, only the training and
validation data were contaminated, and test accuracies were computed with respect
26
to true labels. A mini-batch size of 128 was used. All networks used ReLUs in
the hidden layers and softmax layers at the output. All reported experiments were
repeated five times with random initialization of neural network parameters and
randomly generated noisy labels each time. We compared the proposed functions
with CCE, MAE and also the confusion matrix-corrected CCE, as shown in Eq. 2.1.
Following [168], we term this ”forward correction”. All experiments were conducted
with identical optimization procedures and architectures, changing only the loss
functions.
2.4.1 Toward a Better Understanding of Lq Loss
To better grasp the behavior of Lq loss, we implemented different values of q and
uniform noise at different noise levels, and trained ResNet-34 with the default set-
ting of Adam on CIFAR-10. As shown in Fig. 2.2, when trained on clean dataset,
increasing q not only slowed down the rate of convergence, but also lowered the
classification accuracy. More interesting phenomena appeared when trained on
noisy data. When CCE (q = 0) was used, the classifier first learned predictive
patterns, presumably from the noise-free labels, before overfitting strongly to the
noisy labels, in agreement with Arpit et al.’s observations [2]. Training with in-
creased q values delayed overfitting and attained higher classification accuracies.
One interpretation of this behavior is that the classifier could learn more about
predictive features before overfitting. This interpretation is supported by our plot
of the average softmax values with respect to the correctly and wrongly labeled
samples on the training set for CCE and Lq (q = 0.7) loss, and with 40% uniform
noise (Fig. 2.1(c)). For CCE, the average softmax for wrongly labeled samples
remained small at the beginning, but grew quickly when the model started overfit-
27
ting. Lq loss, on the other hand, resulted in significantly smaller softmax values for
wrongly labeled data. This observation further serves as an empirical justification
for the use of truncated Lq loss as described in section 2.3.3.
We also observed that there was a threshold of q beyond which overfitting never
kicked in before convergence. When η = 0.2 for instance, training with Lq loss with
q = 0.8 produced an overfitting-free training process. Empirically, we noted that,
the noisier the data, the larger this threshold is. However, too large of a q hampers
the classification accuracy, and thus a larger q is not always preferred. In general,
q can be treated as a hyper-parameter that can be optimized, say via monitoring
validation accuracy. In remaining experiments, we used q = 0.7, which yielded a
good compromise between fast convergence and noise robustness (no overfitting
was observed for η ≤ 0.5).
Table 2.1: Average test accuracy and standard deviation (5 runs) on experiments
with closed-set noise. We report accuracies of the epoch where validation accuracy
is maximum. Forward T and T̂ represent forward correction with the true and esti-
mated confusion matrices, respectively [168]. q = 0.7 was used for all experiments
with Lq loss and truncated Lq loss. Best 2 accuracies are bold faced.
Uniform Noise Class Dependent Noise
Datasets Loss Functions Noise Rate η Noise Rate η
0.2 0.4 0.6 0.8 0.1 0.2 0.3 0.4
CCE 93.24± 0.12 92.09± 0.18 90.29± 0.35 86.20± 0.68 94.06± 0.05 93.72± 0.14 92.72± 0.21 89.82± 0.31
MAE 80.39± 4.68 79.30± 6.20 82.41± 5.29 74.73± 5.26 74.03± 6.32 63.03± 3.91 58.14± 0.14 56.04± 3.76
FASHION Forward T 93.64 ± 0.12 92.69 ± 0.20 91.16 ± 0.16 87.59± 0.35 94.33 ± 0.10 94.03 ± 0.11 93.91 ± 0.14 93.65 ± 0.11
MNIST Forward T̂ 93.26± 0.10 92.24± 0.15 90.54± 0.10 85.57± 0.86 94.09 ± 0.10 93.66 ± 0.09 93.52 ± 0.16 88.53± 4.81
Lq 93.35 ± 0.09 92.58± 0.11 91.30± 0.20 88.01 ± 0.22 93.51± 0.17 93.24± 0.14 92.21± 0.27 89.53± 0.53
Trunc Lq 93.21± 0.05 92.60 ± 0.17 91.56 ± 0.16 88.33 ± 0.38 93.53± 0.11 93.36± 0.07 92.76± 0.14 91.62 ± 0.34
CCE 86.98 ± 0.44 81.88 ± 0.29 74.14 ± 0.56 53.82 ± 1.04 90.69 ± 0.17 88.59 ± 0.34 86.14 ± 0.40 80.11 ±1.44
MAE 83.72 ± 3.84 67.00 ± 4.45 64.21 ± 5.28 38.63 ± 2.62 82.61 ± 4.81 52.93 ± 3.60 50.36 ± 5.55 45.52 ± 0.13
Forward T 88.63 ± 0.14 85.07 ± 0.29 79.12 ± 0.64 64.30 ± 0.70 91.32 ± 0.21 90.35 ± 0.26 89.25 ± 0.43 88.12 ± 0.32
CIFAR-10
Forward T̂ 87.99± 0.36 83.25± 0.38 74.96± 0.65 54.64± 0.44 90.52± 0.26 89.09± 0.47 86.79± 0.36 83.55 ± 0.58
Lq 89.83 ± 0.20 87.13 ± 0.22 82.54 ± 0.23 64.07± 1.38 90.91 ± 0.22 89.33± 0.17 85.45± 0.74 76.74± 0.61
Trunc Lq 89.7 ± 0.11 87.62 ± 0.26 82.70 ± 0.23 67.92 ± 0.60 90.43± 0.25 89.45 ± 0.29 87.10 ± 0.22 82.28± 0.67
CCE 58.72 ± 0.26 48.20 ± 0.65 37.41 ± 0.94 18.10 ± 0.82 66.54± 0.42 59.20± 0.18 51.40± 0.16 42.74± 0.61
MAE 15.80 ± 1.38 9.03 ± 1.54 7.74 ± 1.48 3.76 ± 0.27 13.38± 1.84 11.50± 1.16 8.91± 0.89 8.20± 1.04
Forward T 63.16 ± 0.37 54.65 ± 0.88 44.62 ± 0.82 24.83 ± 0.71 71.05 ± 0.30 71.08 ± 0.22 70.76 ± 0.26 70.82 ± 0.45
CIFAR-100
Forward T̂ 39.19± 2.61 31.05± 1.44 19.12± 1.95 8.99± 0.58 45.96± 1.21 42.46± 2.16 38.13± 2.97 34.44± 1.93
Lq 66.81 ± 0.42 61.77 ± 0.24 53.16 ± 0.78 29.16 ± 0.74 68.36± 0.42 66.59± 0.22 61.45± 0.26 47.22± 1.15
Trunc Lq 67.61 ± 0.18 62.64 ± 0.33 54.04 ± 0.56 29.60 ± 0.51 68.86 ± 0.14 66.59 ± 0.23 61.87 ± 0.39 47.66 ± 0.69
28
Figure 2.2: The test accuracy and validation loss against number of epochs for
training with Lq loss at different values of q.
29
2.4.2 Datasets
CIFAR-10/CIFAR-100: ResNet-34 was used as the classifier optimized with the
loss functions mentioned above. Per-pixel mean subtraction, horizontal random flip
and 32×32 random crops after padding with 4 pixels on each side was performed as
data preprocessing and augmentation. Following [90], we used stochastic gradient
descent (SGD) with 0.9 momentum, a weight decay of 10−4 and learning rate of
0.01, and divided it by 10 after 40 and 80 epochs (120 in total) for CIFAR-10, and
after 80 and 120 (150 in total) for CIFAR-100. To ensure a fair comparison, the
identical optimization scheme was used for truncated Lq loss. We trained with the
entire dataset for the first 40 epochs for CIFAR-10 and 80 for CIFAR-100, and
started pruning and training with the pruned dataset afterwards. Pruning was
done every 10 epochs. To prevent overfitting, we used the model at the optimal
epoch based on maximum validation accuracy for pruning. Uniform noise was
generated by mapping a true label to a random label through uniform sampling.
Following Patrini, et al. [168] class dependent noise was generated by mapping
TRUCK → AUTOMOBILE, BIRD → AIRPLANE, DEER → HORSE, and CAT
↔ DOG with probability η for CIFAR-10. For CIFAR-100, we simulated class-
dependent noise by flipping each class into the next circularly with probability
η.
We also tested noise-robustness of our loss function on open-set noise using
CIFAR-10. For a direct comparison, we followed the identical setup as described
in [213]. For this experiment, the classifier was trained for only 100 epochs. We
observed validation loss plateaued after about 10 epochs, and hence started pruning
the data afterwards at 10-epoch intervals. The open-set noise was generated by
using images from the CIFAR-100 dataset. A random CIFAR-10 label was assigned
30
Table 2.2: Average test accuracy on experiments with CIFAR-10. We replicated
the exact experimental setup as in [213]. The reported accuracies are the average
last epoch accuracies after training for 100 epochs. η = 40%. CCE, Forward and
method by Wang et al. are adapted for direct comparison.
Noise type CCE [213] Forward [213] Wang, et al. [213] MAE Lq Trunc Lq
CIFAR-10 + CIFAR-100 (open-set noise) 62.92 64.18 79.28 75.06 71.10 79.55
CIFAR-10 (closed-set noise) 62.38 77.81 78.15 74.31 64.79 79.12
to these images.
FASHION-MNIST: ResNet-18 was used. The identical data preprocessing,
augmentation, and optimization procedure as in CIFAR-10 was deployed for train-
ing. To generate a realistic class dependent noise, we used the t-SNE [138] plot
of the dataset to associated classes with similar embeddings, and mapped BOOT
→ SNEAKER , SNEAKER → SANDALS, PULLOVER → SHIRT, COAT ↔
DRESS with probability η.
2.4.3 Results and Discussion
Experimental results with closed-set noise is summarized in Table 3.1. For uniform
noise, proposed loss functions outperformed the baselines significantly, including
forward correction with the ground truth confusion matrices. In agreement with
our theoretical expectations, truncating the Lq loss enhanced results. For class
dependent noise, in general Forward T offered the best performance, as it relied
on the knowledge of the ground truth confusion matrix. Truncated Lq loss pro-
duced similar accuracies as Forward T̂ for FASHION-MNIST and better results
for CIFAR datasets, and outperformed the other baselines at most noise levels
for all datasets. While using Lq loss improved over baselines for CIFAR-100, no
improvements were observed for FASHION-MNIST and CIFAR-10 datasets. We
31
believe this is in part because very similar classes were grouped together for the
confusion matrices and consquently the DNNs might falsely put high confidence
on wrongly labeled samples.
In general, classification accuracy for both uniform and class dependent noise
would be further improved relative to baselines with optimized q and k values and
more number of epochs. Based on the experimental results, we believe the proposed
approach would work well when correctly labeled data can be differentiated from
wrongly labeled data based on softmax outputs, which is often the case with large-
scale data and expressive models. We also observed that MAE performed poorly
for all datasets at all noise levels, presumably because DNNs like ResNet struggled
to optimize with MAE loss, especially on challenging datasets such as CIFAR-100.
Table 2.2 summarizes the results for open-set noise with CIFAR-10. Following
Wang et al. [213], we reported the last-epoch test accuracy after training for 100
epochs. We also repeated the closed-set noise experiment with their setup. Using
Lq loss noticeably prevented overfitting, and using truncated Lq loss achieved bet-
ter results than the state-of-the-art method for open-set noise reported in [213].
Moreover, our method is significantly easier to implement. Lastly, note that the
poor performance of Lq loss compared to MAE is due to the fact that test accu-
racy reported here is long after the model started overfitting, since a shallow CNN
without data augmentation was deployed for this experiment.
32
CHAPTER 3
IMPROVING CONFIDENCE CALIBRATION FOR
CONVOLUTIONAL NEURAL NETWORKS WITH STRUCTURED
DROPOUT
While neural networks achieve impressive classification accuracy across differ-
ent tasks, they can suffer from poor calibration of their predictions. A Bayesian
perspective has suggested that dropout, a regularization strategy often used dur-
ing training, can be employed to obtain better probabilistic predictions at test
time [51]. However, empirical results so far have not been encouraging, particu-
larly with convolutional networks. In this chapter, through the lens of ensemble
learning, we associate this unsatisfactory performance with the correlation be-
tween the models sampled with regular dropout. Motivated by this, we explore
the use of various structured dropout techniques to promote model diversity and
improve calibration of predictions. We also propose an omnibus dropout strategy
that combines various structured dropout methods. Using the SVHN, CIFAR-10
and CIFAR-100 datasets, we empirically demonstrate the superior performance of
omnibus dropout, and show its merit in a Bayesian active learning application.
3.1 Introduction
Deep neural networks (NNs) achieve state-of-the-art classification accuracy in
many applications. However, in real world scenarios, like medical diagnosis and
autonomous driving, reliable probabilistic predictions are often crucial and need
to be considered in assessing performance. Most modern NNs are trained with
maximum likelihood to produce point estimates that are often over-confident [70].
Bayesian techniques can be used with neural networks to obtain well-calibrated
33
predictions [139, 157]. Monte Carlo (MC) dropout [51], a cheap approximate in-
ference technique which obtains uncertainty by performing dropout [195] at test
time, is a popular Bayesian method for uncertainty estimates.
Despite its improvement, MC dropout can still produce over-confident pre-
dictions [121], particularly with convolutional architectures. In this chapter, we
propose a simple yet effective solution to this problem. Inspired by the recent
success of explicit ensembles of neural networks obtained using random initial-
izations [10], we reiterate the original notion of dropout as ”an extreme form of
model combination with extensive parameter sharing” [195], and interpret MC
dropout as an ensemble of models. Borrowing machinery from ensemble learn-
ing, we then attribute the poor performance of MC dropout to its limited model
diversity compared to that of explicit ensembles. This perspective reveals how
structured dropout methods [58, 204] can improve performance by promoting di-
versity. While the importance of diversity has been demonstrated by others, prior
works consider explicit ensembles of different models. To the best of our knowl-
edge, this is the first chapter to examine structured dropout as a way to enhance
diversity in an ensemble obtained from a single model. As discussed below, we also
propose to combine different structured dropout methods, which we call omnibus
dropout. We empirically verify that omnibus dropout can yield models with su-
perior performance on the SVHN, CIFAR-10 and CIFAR-100 datasets compared
to not only MC dropout, but also some of the most widely adopted baselines like
temperature scaling [70]. Furthermore, we demonstrate its merit in a Bayesian
active learning experiment [54].
Summary of Contributions.
34
• We experimentally illustrate that the poor performance of standard MC
dropout is primarily attributable to lack of model diversity among predic-
tions.
• We show that using structured dropout can significantly boost model diver-
sity, and hence the calibration of NNs.
• We propose the omnibus dropout, a simple combination of different types
of structured dropout methods, to further enhance the performance of MC
dropout.
• We demonstrate the effectiveness of the proposed method with various bench-
mark datasets.
3.2 Related Work
Dropout was first introduced as a stochastic regularization technique for NNs [195].
Inspired by the success of dropout, numerous variants have been proposed [55,58,
65, 90, 191, 204, 211]. Unlike regular dropout, most of these methods drop parts
of the NNs in a structured manner. For instance, DropBlock [58] applies dropout
to small patches of the feature map in convolutional networks, SpatialDrop [204]
drops out entire channels, and Stochastic Depth Net [90] drops out entire ResNet
blocks. These methods were proposed to boost test time accuracy. In this chapter,
we show that these structured dropout techniques can be successfully applied to
obtain better uncertainty estimates as well.
Dropout can be thought of as performing approximate Bayesian inference [52]
and offer estimates of uncertainty. Many other approximate Bayesian inference
techniques have also been proposed for NNs [110, 134]. However, these meth-
35
ods can demand a sophisticated implementation, are often harder to scale, and
can suffer from sub-optimal performance [12]. Another popular alternative to ap-
proximate the intractable posterior is Markov Chain Monte Carlo (MCMC) [157].
More recently, stochastic gradient versions of MCMC were also proposed to allow
scalability [63,137,216]. Nevertheless, these methods are often computationally ex-
pensive, and sensitive to the choice of hyper-parameters. Lastly, there have been
efforts to approximate the posterior with Laplace approximation [139, 183]. A
related approach, the SWA-Gaussian [140] is another technique for Gaussian pos-
terior approximation using the Stochastic Weight Averaging (SWA) algorithm [95].
There are also non-Bayesian techniques to obtain calibrated uncertainty. For
instance, temperature scaling [70] has been empirically shown to be effective in
calibrating the predictions. A related line of work uses an ensemble of several
randomly-initialized NNs [121]. The method, known as deep ensembles, requires
training and saving multiple NNs. An ensemble of snapshots of the trained model
at different iterations can help obtain better uncertainty estimates [56]. Compared
to an explicit ensemble, this approach requires training only one model. Never-
theless, models at different iterations must all be saved to deploy the algorithm,
which can be computationally demanding.
3.3 An Analysis of the Performance of MC Dropout
3.3.1 MC Dropout as Ensembles of Dropout Models
Assume a dataset D = {(xi, yi)}ni=1, where each (xi, yi) ∈ (X × Y) is i.i.d. We
consider the problem of k-class classification, and let X ⊆ Rd be the input space
36
and Y = {1, ..., k} be the label space1. A classifier is a function that maps input
features to labels f : X → Rk. We restrict our attention to NN functions fw(x) :
X → Rk, where w = {W Li}i=1 corresponds to the parameters of a network with
L-layers, and Wi corresponds to the weight matrix in the i-th layer. We define
a likelihood model p(y|x,w) = softmax(fw(x)). Maximum likelihood estimation
can be performed to compute point estimates for w.
Recently, Gal and Ghahramani [51] proposed a novel viewpoint of dropout
as approximate Bayesian inference (See Appendix A for a brief review). This
perspective offers a simple way to marginalize out model weights at test time to
obtain better calibrated predictions∫:
p(y = c|x,Dtrain) = p(y = c|x,w)p(w|Dtrain) dw
∑T (3.1)
≈ 1 p(y|x,w(t)),
T
t=1
where w(t) ∼ q(w|Dtrain) is assumed to be independently drawn layer-wise weight
(t)
matrices: Wi ∼ Ŵi·diag(Bernoulli(p)), Ŵi is the parameter matrix learned during
training, and p is the dropout rate. In this chapter, we view each dropout sample
w(t) in Equation 3.1 as an individual model such that MC dropout is performing
ensemble averaging. As we will see in the following sections, this ensemble learning
perspective on dropout inference provides us with principled ways to enhance MC
dropout. Lastly, note using structured dropout in lieu of regular dropout amounts
to only a change of the approximate distribution q(w|Dtrain) in Equation 3.1, so
that we are performing Bayesian variational inference with a different class of
approximate distributions. For instance, in the channel-level dropout, we sample
one Bernoulli random variable for each channel. [!htb]
1Extension to regression tasks is straightforward but left out of this chapter.
37
95.5 1e 2 1e 31.25
95.0 1.2 dropout1.00 0
94.5 1.1 deep ensembledropout 0.75
94.0 1.0
1
deep ensemble 0.50
93.5 0.25 0.9 2
93.0 0.00 0.8
0.25 0.7 3
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Number of Models Number of Models Number of Models Number of Models
Figure 3.1: From left to right (1) Accuracy of MC dropout and deep ensemble (2)
the relative improvements in accuracy of deep ensemble and MC dropout (3) Brier
score of MC dropout and deep ensemble against number of models (4) the relative
improvements in in Brier score of deep ensemble and MC dropout against number
of models.
SVHN CIFAR-10 CIFAR-100
dropout
dropBlock
dropChannel
dropLayer
dropOmnibus
deep ensemble
0.575 0.625 0.675 0.725 0.50 0.55 0.60 0.65 0.65 0.70 0.75 0.80
Interrater Agreement Interrater Agreement Interrater Agreement
Figure 3.2: Interrater Agreement (IA) of models with different types of dropout
with 0.1 dropout rate on the SVHN, CIFAR-10 and -100 datasets. The lower
the IA, the more diverse the predictions of the models. Y-axis indicates different
methods. MC dropout produces models with much larger IA, hence less model
diversity, than structured dropout techniques in most of the cases.
3.3.2 Decomposing the Performance of Ensembles
First proposed by Krogh and Vedelsby [114], the error-ambiguity decomposition
enables one to quantify the performance of ensembles with resp∑ect to individual
models. Let {ht}Tt=1 be an ensemble of T classifiers, and H(x) = t ht(x)/T is the
ensemble prediction. In classification problems, ht(x) is often a probability vector
such that hi(x) = p(y = i|x,w ). In MC dropout h (x) = p(y|x,w(t)t t t ). Model
ambiguity can be then defined as:
α(ht|x) = ||ht(x)−H(x)||22,
which quantifies the difference between an individual model and the ensemble
average.
The Brier score measures both the accuracy and calibration of probabilistic
38
Accuracy
Accuracy Increase
Brier Score
Brier Decrease
classifications, and is proportional to mean squared error (MSE), which can be
decomposed as:
MSE(H) = Ex[MSE(H|x)]
(3.2)
= Ex[MSE(h|x)]− Ex[α(h|x)],
where MSE(ht|x) = ||y − ht(x)||2, y is the one-hot encoded vector of the correct
label y,
1 ∑T
MSE(h|x) = MSE(ht|x),
∑T t (3.3)T1
α(h|x) = α(ht|x)
T
t
correspond to the average MSE, and ensemble diversity (average ambiguity), re-
spectively. Equation 3.2 suggests that the more accurate and the more diverse
the models, the better performance will be achieved by the ensemble. We use
MSE instead of the negative log likelihood (NLL), another commonly used mea-
sure for quality of uncertainty estimates, due to mathematical convenience. The
two metrics are closely related, and insights obtained from MSE carry over to
NLL. In general, MSE or NLL can be seen as comprehensive measures influenced
by both the accuracy and the calibration of the model. We give a brief discussion
in Appendix B on the relationship between these metrics.
3.3.3 Performance of MC Dropout and Model Diversity
The discussion of the previous section provides us with a potential recipe to en-
hance MC dropout. To illustrate the importance of diversity, we conduct an ex-
periment using ResNet-50 on CIFAR-10 to compare MC dropout with an explicit
ensemble of five NNs (details can be found in Section 3.4). As we see from Fig-
ure 3.1, individual models in deep ensemble, on average, perform better than the
39
ones in MC dropout, likely because of the reduced effective capacity of the latter.
Furthermore, the performance of the ensembles improve with more models. Yet
the improvement is larger for deep ensemble because of increased ensemble diver-
sity, since we know from Equation 3.2 that the decrease in the Brier score in this
analysis is attributable to the increase in ensemble diversity. The lack of diver-
sity among MC dropout models is largely because neighboring pixel features are
often correlated in convolutional layers. Thus, even with dropout, similar infor-
mation propagates through the network in every iteration, leading to very similar
predictions. Although model diversity can be encouraged naively by increasing
dropout rates, doing so often leads to reduced MSE of individual models, thereby
hampering ensemble performance, as can be seen from Eq. 3.2.
3.3.4 Omnibus Dropout
While model diversity can be promoted via explicit ensembles, they demand more
computational resources, which can be prohibitively expensive. Though typically
more number of samples is needed for dropout based methods at test time, unlike
deep ensembles, dropout uncertainty can be obtained sequentially, which has a
lower memory requirement.
In this chapter, we hypothesize that the main cause in the lack of diversity with
MC dropout for CNNs is the locally similar features present in image data. As
such, in order to enhance diversity in an ensemble obtained from a single model, we
examine the use of structured dropout, which drops information from contiguous
regions of feature maps so that more divergent information is propagated to subse-
quent layers during training at each iteration. Specifically, we compare dropout at
the patch-level which randomly drops out small patches of feature maps [58], the
40
0.015
dropout
0.014 dropBlock
0.013 dropChannel
dropLayer
0.012 dropOmnibus
0.011
0.010
0.009
0.008
0 5 10 15 20 25 30
Number of Models in the Ensemble
0.94
0.93
dropout
0.92 dropBlock
dropChannel
dropLayer
0.91 dropOmnibus
0 5 10 15 20 25 30
Number of Models in the Ensemble
Figure 3.3: Test Brier score (left) and accuracy (right) against number of models
for ensemble prediction at test time on CIFAR-10. This corresponds to the number
of different MC dropout instantiations at test time of the same model. The Model
trained with omnibus dropout achieves the best in terms of accuracy and Brier
score.
41
Accuracy Brier Score
channel-level which drops out entire channels of feature maps at random [204], and
layer-level which drops out entire layers of CNNs at random [90]. We denote these
as dropBlock, dropChannel and dropLayer respectively. We identify the test-time
sampling of models trained with the aforementioned structured dropout methods
as MC dropBlock, MC dropChannel, MC dropLayer. Intuitively, using such struc-
tured dropout techniques in lieu of standard dropout can be a simple and yet
effective way to enhance diversity in an ensemble sampled through dropout. For
instance, at the channel level, since entire channels in the hidden layers are ran-
domly dropped out each time, there would be much more diversity associated with
the resulting predictions.
As we demonstrate experimentally in the following section, the increased diver-
sity of structured dropouts can come at the cost of reduced performance of individ-
ual models. Moreover, given considerable choices of dropout strategies available, it
can be hard to pick the best one. Therefore, we propose a simple omnibus dropout
strategy, which combines all the aforementioned methods. The implementation of
omnibus dropout involves the sequential execution of the nested group of dropout
methods: dropLayer, dropChannel, dropBlock and regular dropout. As such, for
each of the layers in the CNN, we first sample a Bernoulli random variable to
determine if this particular layer in dropped out. If the layer is retained, we then
proceed to determine the channel level dropout, and then the patch level dropout
followed lastly by standard dropout. In practice however, all the random variables
at different levels are sampled independently from one another. In order to cali-
brate the overall dropout rate of omnibus dropout, given a predetermined overall
dropout rate (e.g. 0.1), we can calculate the corresponding individual dropout rate
at each level so that overall only 10% of the features are dropped out on average.
We use a same dropout rate for all the dropout methods. Empirically we find
42
this simple choice to work well. As our results show, omnibus dropout yields good
performance by promoting model diversity without hampering individual model
performance.
3.3.5 Enhanced Ensemble Diversity with Structured
Dropout
We empirically investigate model diversity achieved with various forms of afore-
mentioned dropout methods. For a fair comparison, we fix the dropout rate for all
methods to 0.1 so that all models have the same effective number of parameters.
There are numerous measures to quantify diversity of ensembles [239]. We use
Interrater Agreement (IA) [118], d∑efined as:
1 n ρ(xk)(T − ρ(xk))
κ = 1− T k=1 , (3.4)
n(T − 1)p̄(1− p̄)
where T is the number of individual classifiers, n is the number of test samples,
ρ(xk) is the number of models that classify the k-th sample correctly, and p̄ is
average classification accuracy across classifiers. When all classifiers perfectly agree
on the test set κ = 1, and smaller values indicate more diverse predictions. Figure
3.2 summarizes IA for sampled models trained on different datasets with different
dropout methods. We also compare the results with deep ensemble. The number
of models used to compute IA, T , is fixed to five for all approaches. In general,
IA for MC dropout is much higher than structured dropout techniques. On the
other hand, structured dropout can yield ensembles that are as diverse as the
computationally expensive method of deep ensemble, confirming our expectation
that dropping out correlated information can produce sampled models with more
ambiguity. Note that the large IA for MC dropLayer on SVHN is likely caused
43
by a relatively small model used for that problem - an 18-layer ResNet. Lastly,
note that while MC omnibus-dropout yields models much more diverse than MC
dropout, it is often not the most diverse one either.
The moderate diversity of MC omnibus-dropout, we believe, is the key to its
effectiveness. To better understand its behavior, we study the performance metrics
as a function of number of sampled models in the ensemble. Figure 3.3 shows the
Brier score (left) and accuracy (right) against number of models for the CIFAR-10
dataset (Similar results observed for SVHN and CIFAR-100. See Appendix C).
Firstly, as seen from Figure 3.3 (left), while the performance of individual models
sampled from MC dropout is one of the best, the gain in Brier score with a larger
number of test-time MC samples is much smaller compared to structured dropout
techniques. On the other hand, though a larger diversity indeed leads to much
sharper improvements as number of sampled models increases, the Brier scores
(hence MSE) of individual models sampled from MC dropBlock, MC dropChannel
andMC dropLayer are much larger than that ofMC dropout, suggesting a trade-off
between diversity and the performance of individual sample models. MC omnibus-
dropout which enjoys the benefits from both structured and regular dropouts, is
able to not only achieve good performance on one sampled model (with a Brier
score close to MC dropout), but also good model diversity as evident by a sig-
nificantly larger decrease in Brier score as number of models increases. Similar
observations can be made from the accuracy plot of Figure 3.3 (right).
44
SVHN
0.25 deterministic
0.20 tempScalingdropout
0.15 dropBlockdropChannel
0.10 dropLayerdropOmnibus
0.05 Deep Ensemble
0.00
0.05
0.100.5 0.6 0.7 0.8 0.9 1.0
Confidence
CIFAR-10
0.20
0.15
0.10
0.05
0.00
0.05
0.5 0.6 0.7 0.8 0.9 1.0
Confidence
CIFAR-100
0.4
0.3
0.2
0.1
0.0
0.1
0.2 0.4 0.6 0.8 1.0
Confidence
Figure 3.4: Reliability diagrams of predictions produced by difference models.
45
Confidence - Accuracy
Table 3.1: Results on benchmark datasets comparing accuracy and uncertainty
estimates produced by different types of methods. The top performing result for
each metric is bold-faced. MC omnibus-dropout is consistently the best method.
The numbers in bracket next to dropout methods corresponds to the optimal
drop rate found by grid search.
Datasets Methods Accuracy ↑ NLL ↓ Brier ↓ (×10−3) ECE ↓ (×10−2)
Temp Scaling 95.7± 0.1 0.163± 0.002 6.62± 0.10 0.995± 0.160
Dropout (0.35) 96.7± 0.1 0.128± 0.001 5.11± 0.06 0.934± 0.045
SVHN DropBlock (0.1) 96.8± 0.1 0.133± 0.002 5.19± 0.07 1.26± 0.14
DropChannel (0.2) 96.7± 0.1 0.130± 0.001 5.15± 0.06 0.799± 0.032
DropLayer (0.25) 96.3± 0.1 0.144± 0.002 5.69± 0.05 0.846± 0.250
Omnibus dropout (0.15) 96.9± 0.1 0.127± 0.001 4.97± 0.09 1.15± 0.06
Temp Scaling 93.9± 0.1 0.189± 0.002 9.06± 0.08 0.905± 0.114
Dropout (0.2) 93.1± 0.1 0.224± 0.003 10.2± 0.1 1.64± 0.07
CIFAR10 DropBlock (0.1) 93.4± 0.1 0.203± 0.003 9.89± 0.10 0.743± 0.116
DropChannel (0.15) 93.7± 0.1 0.193± 0.002 9.34± 0.9 0.812± 0.104
DropLayer (0.1) 94.0± 0.2 0.206± 0.001 9.09± 0.17 0.941± 0.068
Omnibus dropout (0.1) 94.4± 0.1 0.173± 0.001 8.38± 0.10 0.607± 0.078
Temp Scaling 74.5± 0.3 1.00± 0.01 3.57± 0.04 4.02± 0.62
Dropout (0.2) 74.1± 0.4 1.18± 0.01 3.71± 0.05 9.18± 0.23
CIFAR100 DropBlock (0.15) 73.7± 0.5 1.04± 0.02 3.66± 0.05 4.46± 0.97
DropChannel (0.15) 74.9± 0.5 0.996± 0.02 3.46± 0.04 3.17± 0.11
DropLayer (0.25) 75.7± 0.2 1.01± 0.01 3.42± 0.03 2.90± 0.24
Omnibus dropout (0.25) 75.3± 0.2 0.929± 0.005 3.40± 0.02 1.65± 0.21
3.4 Experiments
We empirically evaluate the performance of MC dropBlock, MC dropChannel, MC
dropLayer and MC omnibus-dropout, and compare them to MC dropout and tem-
perature scaling. We include in Appendix C further experiments with explicit
dropout ensembles and their comparison to deep ensembles.
Model. Layer-level dropout requires skip connections so that there is still
information flow through the network after dropping out an entire layer. Some
of the examples include the FractalNet [122] and the ResNet [79]. We use the
PreAct-Resnet [80] for all our experiments. We refer to the preAct-ResNet trained
without dropout as a deterministic model. MC dropout, MC dropBlock and MC
dropChannel models are implemented through inserting the corresponding dropout
layers with a constant p before each convolutional layer. A block size of 3×3 is used
46
for MC dropBlock. We follow [58] to match up the effective dropout rate of MC
dropBlock to the desired dropout rate p. MC dropLayer is implemented through
randomly dropping out entire ResNet blocks at a constant rate p. We empirically
observe that, dropping out downsampling ResNet blocks during testing is harmful
to the quality of uncertainty estimates. This is in agreement with experiments
of [209]2. Hence, downsampling blocks are only dropped out during training. MC
omnibus-dropout is implemented by including all types of aforementioned dropouts,
each with the same dropout rate. For a full Bayesian treatment, we also insert
a dropout layer before the fully connected layer at the end of the NNs. For all
models with dropout of all types, we sample 30 times at test-time for Monte Carlo
estimation.
Datasets. We conduct experiments using the SVHN [158], CIFAR-10 and
CIFAR-100 [111] datasets with standard train/test-set split. Validation sets of
10000 and 5000 samples are used for SVHN and the CIFARs. We use the 18-, 50-
and 101-layer PreAct-ResNet for SVHN, CIFAR-10 and CIFAR-100.
Training. We perform preprocessing and data augmentation using per-pixel
mean subtraction, horizontal random flip and 32× 32 random crops after padding
with 4 pixels on each side. We use stochastic gradient descent (SGD) with 0.9
momentum, a weight decay of 10−4 and learning rate of 0.01, and divided it by 10
after 125 and 190 epochs (250 in total) for SVHN and CIFAR-10, and after 250
and 375 (500 in total) for CIFAR-100.
Evaluation. All the results are computed on the test set using the model at the
optimal epoch based on validation accuracy. We use the Brier score, negative log-
likelihood (NLL), expected calibration error (ECE), and Classification accuracy
2In their experiments, ResNet blocks are only dropped out during testing, but not training.
47
to evaluate performance (see Appendix B for definitions). Following [154], we
partition predictions into 20 equally spaced bins and take a weighted average of the
bins’ accuracy and confidence difference to estimate ECE. To visualize calibration
performance, we also plot the reliability diagrams [140], which are plots of the
difference between accuracy and confidence against confidence. The closer the
curve to the X-axis, the more calibrated the model predictions.
Results. Table 3.1 summarizes the performance of various models using met-
rics mentioned previously. To ensure a fair comparison, we treat the dropout rate
as a hyper-parameter and conduct a linear grid search with 0.05 interval for op-
timal dropout rate based on NLL. The optimal dropout rates are shown in the
table next to methods. Standard deviations are obtained on five models with
random initializations for all dropout models. As seen from Table 3.1 and Fig-
ure 3.4, all forms of structured dropout models offer better uncertainty estimates
than MC dropout in general. Overall, MC omnibus-dropout is consistently the
best performing model. Moreover, we also perform experiments with five explicit
ensembles of models trained together with all types of dropout for further com-
parison against deep ensembles, and most of the dropout models outperform deep
ensembles trained without dropout. Again, omnibus dropout is consistently one of
the best methods (See Appendix C). Lastly, as evident from moderately increased
classification accuracy over deterministic temperature scaling models, all types of
dropout methods can be incorporated into architectures with no accuracy penalty.
We believe the relatively good performance of MC dropout on SVHN compared
to CIFARs is because the former task is easier so that the model can still predict
accurately at an aggressive dropout rate of 0.35 at which even regular dropout
can produce acceptably diverse sampled models. In contrast, as observed in our
48
experiments, while using larger dropout rates for the more difficult CIFAR datasets
can lead to more calibrated predictions, accuracy and NLL suffer due to drop in
MSE of individual models (see Appendix C). Lastly, we believe the results for MC
dropBlock can be improved by optimizing the choice of block size. A pre-fixed
block size of 3 × 3 can be too small for the upstream convolutional layers where
the size of feature maps are much larger than the block size, and too large for the
last few downstream layers where the feature maps are comparable to the block
size, as supported by sharp increases in NLL after the optimal dropout rate.
3.5 Bayesian Active Learning
To further demonstrate the merit of omnibus dropout, we consider the downstream
task of Bayesian active learning on CIFAR-10. Active learning involves first train-
ing on a small amount of labeled data. Then, an acquisition function based on
the outputs of models is used to select a small subset of unlabeled data so that
an oracle can provide labels for these queried data. Samples that a model is the
least confident about are usually selected for labeling, in order to maximize the
information gain. The model is then retrained with the additional labeled data
that is provided. The above process can be repeated until a desired accuracy is
achieved or the labeling resources are exhausted.
In our experiment, we train models with structured dropout at different scales
using the identical setup as described in the beginning of this section, except
that only 2000 training samples are used initially. To match up model capacity,
the dropout rate is set to 0.1 for all methods. We also compare again a deter-
ministic model. After the first iteration, we acquire 1000 samples from a pool
49
BALD
0.90
0.85
0.80
0.75 deterministic
dropout
0.70 dropBlock
dropChannel
0.65 dropLayer
dropOmnibus
0.60 0.2 0.4 0.6 0.8 1.0
Number of Training Samples 1e4
Entropy
0.90
0.85
0.80
0.75
0.70
0.65
0.2 0.4 0.6 0.8 1.0
Number of Training Samples 1e4
Variation
0.90
0.85
0.80
0.75
0.70
0.65
0.2 0.4 0.6 0.8 1.0
Number of Training Samples 1e4
Figure 3.5: Left : Test accuracy against number of training samples for models
with different methods of dropout and Variation Ratios as the acquisition function
on CIFAR-10. Right : Relative improvements in test accuracy over that of the first
iteration with different methods of dropout.
50
Test Accuracy
of ”unlabeled” data, and combine the acquired samples with the original set of
labeled images to retrain the models. Following [∑54], we consider three acquisi-
tion functions: Max Entropy, H[y|x,Dtrain] = − c p(y = c|x,Dtrain) log p(y =
c|x,Dtrain), the BALD metric (Bayesian Active Learning by Disagreement),
I[y,w|x,Dtrain] = H[y|x,Dtrain] − Ep(w|Dtrain)[H[y|x,w]], and the Variation Ratios
metric, variation-ratio[x] = 1 − maxy p(y,x,Dtrain). We repeat the acquisition
process eight times so that in the last iteration, the training set contains 10000
images. To mimic a real world scenario in which number of labeled samples is
small, we do not use a validation set, and the accuracies reported for this exper-
iment correspond to the last-epoch accuracies. We repeat experiments five times
for consistency.
Figure 3.5 shows the test accuracy against number of training samples for
different models. In general, MC omnibus-dropout yields the best performance by
far. Interestingly, MC omnibus-dropout is able to outperform all other methods
consistently by a significant margin after the first iteration when all samples are
randomly selected. In addition, it can be seen that, after the first iteration when all
2000 training images are randomly selected, the test accuracy using MC dropout is
on par with that of other structured dropout methods. However, as more labeled
data are added, the relative increase in accuracy is more significant for models using
structured dropout compared to that of using regular dropout. This suggests that
the uncertainty estimates obtained with structured dropout are more useful for
assessing ”what the model doesn’t know”, thereby allowing for the selection of
samples to be labeled in a way that better helps improve performance. Note also
that the comparative gain in accuracy by MC omnibus-dropout during the later
stages of the learning process is not as large. We suspect this can be caused by
the saturation effect on test accuracy.
51
CHAPTER 4
ENHANCING UNCERTAINTY ESTIMATES WITH EFFICIENT
NEURAL NETWORK ENSEMBLES
In the previous chapter, We presented several ways to improve upon Monte dropout
for uncertainty method. Monte Carlo (MC) dropout [51] is a simple and efficient
ensembling method that can improve the accuracy and confidence calibration of
high-capacity deep neural network models. However, MC dropout is not as ef-
fective as more compute-intensive methods such as deep ensembles [120]. This
performance gap can be attributed to the relatively poor quality of individual
models in the MC dropout ensemble and their lack of diversity. These issues can
in turn be traced back to the coupled training and substantial parameter shar-
ing of the dropout models. Motivated by this perspective, we propose a strategy
in this chapter to compute an ensemble of subnetworks, each corresponding to
a non-overlapping dropout mask computed via a pruning strategy and trained
independently. We show that the proposed subnetwork ensembling method can
perform as well as standard deep ensembles in both accuracy and uncertainty es-
timates, yet with a computational efficiency similar to MC dropout. Lastly, using
several computer vision datasets like CIFAR10/100, CUB200, and Tiny-Imagenet,
we experimentally demonstrate that subnetwork ensembling also consistently out-
performs recently proposed approaches that efficiently ensemble neural networks.
4.1 Introduction
An effective way to improve model accuracy and confidence calibration in deep
learning is ensembling. One efficient technique that leverages this idea is ”Monte
52
Carlo (MC) dropout” [51] which extends the popular dropout technique used for
regularization during training [195]. In MC Dropout, test-time inference involves
multiple forward passes through the model, each executed with a different ran-
dom dropout mask as in during the training phase. This yields an ensemble of
predictions which are then averaged.
While MC dropout can improve a baseline model, it is still inferior to explicit
ensembles of neural networks trained independently with random initialization
(called deep ensembles) [120]. Using the perspective of the error-ambiguity de-
composition [238], we can attribute this performance gap to the relatively poor
performance of individual models and/or limited diversity in the MC dropout
ensemble. We further hypothesize these issues are largely due to the extensive
parameter sharing among MC dropout models.
With this perspective in mind, we explore the idea of creating an ensemble
of subnetworks in which a pre-determined number of non-overlapping dropout
masks are used. We present an easy-to-implement greedy optimization procedure
that sequentially computes dropout masks via a recent dropout-mask optimization
technique and trains each subnetwork independently. The resulting algorithm
enables us to obtain a diverse ensemble of non-overlapping subnetworks within
one neural network. That is, we are able create many models out of one1. We
demonstrate that subnetwork ensembling consistently outperforms MC dropout
and several other recently proposed approaches that efficiently ensemble neural
networks in terms of both accuracy and uncertainty estimates. We also show that
our proposed approach achieves results on par with that of deep ensembles, yet
with the much better test-time computational efficiency.
1Hence our title: Ex uno plures.
53
Summary of Contributions.
1. We present the novel idea of ensembling non-overlapping subnetworks within
one neural network architecture.
2. We propose a simple sequential pruning based procedure to enhance the
performance of subnetwork ensembling.
3. We demonstrate and discuss the regularization effect achieved by training
pruned networks and using a randomized and frozen fully connected layer in
the network.
4. Our experiments demonstrate that subnetwork ensembling outperforms MC
dropout and several state-of-the-art methods for efficient ensembling.
4.2 Related Works
Ensemble Learning An ensemble of models has long known to be an effective
method to boost the performance of machine learning models [37, 238]. More re-
cently, with the growing interest in deep learning, ensembles of neural networks
have gained much attention. Notably, Lakshminarayanan et al. [120] demonstrated
that simple ensembles of neural networks (NNs) trained independently (called
“Deep Ensembling”) can offer improved predictive uncertainty and accuracy. In
fact, somewhat surprisingly, deep ensembles often outperform more sophisticated
Bayesian NNs. Building on this, several recent works have attempted to under-
stand the unexpected effectiveness of deep ensembles [45,131,174,217].
To further enhance the performance of ensembles of neural networks, previous
methods have also pinpointed the importance of model diversity [98,113,234], and
54
explored ways to promote diversity in ensembles. For instance, Sinha et al. [192]
proposed an Information Bottleneck-based approach to explicitly stimulate diver-
sity among predictions. In related work, Jain et al. [96] utilized out-of-distribution
samples to encourage diversity among models. Enhanced diversity can also be
obtained by ensembling across different architectures of NNs through neural archi-
tecture search [230] or by varying the hyperparameters of models [219].
Our proposed method can be regarded as an approach for efficient ensembles
of NNs. Several techniques have been previously proposed in this direction. For
instance, BatchEnsemble [218] makes use of rank-one matrices to approximate
weight matrices in NNs for fast ensembling of models. Lately, a multi-input multi-
output (MIMO) [77] configuration was discovered to be an effective method to
utilize a single model’s capacity to train multiple subnetworks. Another recent
work [39], similar to ours, uses a set of pre-determined masks in place of the
stochastic sampling of MC dropout for improved uncertainty estimation. However,
their masks are randomly generated at the beginning of training and frozen. There
is also significant sharing of parameters among the models, potentially reducing
diversity and thus leading to sub-optimal performance.
Neural Network Pruning Network pruning aims to compress neural networks
by reducing the number of parameters present in the model [46,71,73,74,126,127,
130]. It often involves selecting and discarding parameters from a pre-trained net-
work, after which the compressed network is fine-tuned. While earlier approaches
choose parameters based on their magnitudes [46,73,74], numerous other selection
criteria have also been explored recently [25,186,187]. Inspired by recent success in
pruning, our proposed training procedure makes use of an importance score-based
optimization approach for dropout mask optimization [177, 187]. There have also
55
been efforts to compress a network at initialization, omitting pre-training [47,78].
However, post-training pruning methods typically outperform these methods. In
addition to network compression, network pruning has been used for goals like
multi-task learning [144] and continual learning [62]. In this work, we demonstrate
pruning can be an effective regularization strategy and can further be used to
obtain a subnetwork ensemble with a performance matching that of an explicit
ensemble of full networks.
Dropout First introduced as a technique to regularize networks, dropout in-
volves independent random removal of neurons or weights during training with a
pre-determined probability [195]. Later, Gal and Ghahramani [51] showed that
dropout can be applied at test-time, called Monte Carlo (MC) dropout, which
can be viewed as an approximate Bayesian technique and yields better estimates
of uncertainty. Several improvements have also been proposed to improve MC
dropout [1, 28, 53, 104, 236]. Unlike MC dropout which generates random masks
on the fly at every iteration, we demonstrate that pruning-derived dropout masks
can lead to significantly better performance.
4.3 Preliminaries
Problem Setup We consider the problem of k-class classification, though the
proposed method can be trivially extended to the regression setting. Suppose we
are given a dataset D = {xi, yi}nn=1 where each feature-label pair (xi, yi) ∈ X ×Y ,
and X ⊆ Rd and Y = {1, .., k} denotes the feature space and label space respec-
tively. Typically, a neural network (NN) fw(x) parameterized by w can be used
to map input features to corresponding labels for classification purpose. We define
56
a likelihood model p(y|x;w) = Cat (softmax (fw(x))), a categorical distribution
with parameters softmax (fw(x)) ∈ ∆(k). Here ∆(k) denotes the k-dimensional
probability simplex. Typically, maximum likelihood estimation (MLE) is per-
formed on the train dataset to obtained the optimal parameters for the NN. At test
time, p(y|x;w) is supposed to reflect the uncertainty in the predictions of the net-
work. However, modern NNs are often poorly calibrated, yielding overly-confident
predictions [70].
Deep Ensembles Previous work [120] demonstrates deep ensembles, or ensem-
bles of independently-trained NNs, as an effective remedy to the calibration prob-
lem. Given an ensemble of models (each with its parameter wi), an aggregated
prediction can be obtained with
1 ∑N
p(y|x) = p(y|x;wi). (4.1)
N
n=1
Leveraging the under-specification property [33] of modern neural networks to-
gether with stochasticity provided through random initialization of NNs and
stochastic optimization, deep ensembles often lead to a drastic improvement in
terms of both accuracy and quality of uncertainty estimates. However, a crucial
shortcoming of deep ensembles is the computational overhead; an ensemble of five
models would roughly cost five times more resources, including storage. This can
be prohibitive in real-world applications with computational constraints.
MC Dropout MC dropout can be used as an efficient alternative for ensembling.
Instead of training several models independently, one can train with dropout, where
a randomly sampled Bernoulli mask is applied to the weights during forward and
backward propagation. At test time, an ensemble of models can then be obtained
57
”for free” via multiple forward passes with random instantiations of the dropout
masks. Compared to deep ensembles, MC dropout incurs no additional memory
cost. Nevertheless, predictions produced through MC dropout are often outper-
formed by deep ensembles in both accuracy and uncertainty estimates.
4.4 Fixing MC Dropout
The error-ambiguity decomposition shows that the performance of an ensemble
is determined by two factors: the average performance of individual models that
make up the ensemble and the degree of diversity across model predictions [98,113].
Based on this perspective, the gap between MC dropout and deep ensemble can
be due to two reasons. Firstly, as we see in our Ablation Study below, individual
models sampled through dropout, on average, are no better than the independently
trained models in a deep ensemble. This is probably due to the common training
used for all models and the reduced effective capacity of the models. Moreover,
MC dropout exhibits significantly less diversity than deep ensembles, likely due
to the high degree of parameter sharing between models and, again, the common
training paradigm. In this section, we propose changes to MC dropout to obtain
an ensemble of non-overlapping and independently trained subnetworks.
4.4.1 Toward Enhancing Model Diversity
In a standard dropout training scheme, only a very small portion of the parameters
is dropped out (typically, dropout rates are set to be around 10% to 20%.). As a
direct consequence, there is extensive parameter sharing among models sampled
58
through MC dropout, leading to poor model diversity. While model diversity
can be naively increased by choosing a higher dropout rate, this often results in
much worse individual models, thus hampering the overall performance. Moreover,
compared to models in a deep ensemble which enjoy a completely independent
optimization procedure, in dropout, a random model is generated on the fly at
every training iteration. This can also increase the correlation of predictions among
MC dropout models.
Orthogonal Dropout To enhance model diversity, instead of drawing random
Bernoulli masks on the fly at each iteration, we propose to use a set of fixed,
non-overlapping dropout masks during training. We term this the ”orthogonal
dropout.” These masks can be generated by simply randomly partitioning every
layer’s weights into k non-overlapping sets. For instance, we can randomly partition
a standard neural network into an ensemble of five subnetworks each with 20% of
the weights. Since the dropout masks are non-overlapping by construction, we can
completely decouple the training of the subnetworks. With orthogonal dropout,
each dropout subnetwork is effectively an independent model, thereby allowing us
to achieve much more model diversity. To further decouple the dropout models, we
can also maintain an independent set of batchnorm layers [94] for each subnetwork.
Similar to MC dropout, at inference, we can apply dropout masks to weights of NNs
before forward propagation and aggregate predictions of the subnetworks following
Equation 1.
Unlike MC dropout which contains effectively an infinite number of models
to be sampled from, with orthogonal dropout, there is a predetermined, fixed
amount of dropout subnetworks for ensembling. Nevertheless, as we show in our
experiments, the gain from the decoupled training procedure and resulting diversity
59
offsets the limitation of the relatively small ensemble size.
We note that orthogonal dropout incurs some additional memory cost. Since
the dropout masks are fixed before training, we need to keep track of these. More-
over, for a NN with k subnetworks, we might need k sets of batchnorm layers and k
sets of bias terms in any fully connected (FC) layers 2. Nevertheless, compared to
the number of parameters in modern architectures, this additional memory cost is
negligible. The additional cost of batchnorm layers can also be mitigated through
batchnorm-free NNs [17]. Finally, as we present below, we can share FC layers
between subnetworks, which further reduces memory burden.
4.4.2 Enhancing Individual Model Performance
Naive orthogonal dropout implementation with randomly generated dropout masks
can cause lackluster individual model performance (see Ablation Study below).
Indeed, for an orthogonal dropout ensemble with 5 subnetworks, each dropout
subnetwork contains 20% of the parameters compared to a model in a deep en-
semble, likely causing an accuracy drop in individual models and hence the overall
ensemble performance.
Orthogonal Dropout Mask Optimization Our central goal is to maximize
orthogonal dropout subnetwork performance given a predetermined level of spar-
sity. A natural solution to this would be to optimize the dropout masks and
weights simultaneously. Let mi ∈ {0, 1}n for i = 1, ..., k denote binary dropout
masks applied to n weights. Then, a general optimization objective can be given
2We do not have bias terms for convolutional layers, as is common in many deep architectures
60
Algorithm 2: Orthogonal Dropout Optimization
Input: NN parameters w; subnetwork size k.
Output: Optimized NN parameters w; dropout subnetwork masks
{m1, ...,mk}.
Initialize w0 = w, m0 = 1 (an identity mask of all 1’s) ;
for i = 1 : k do
wi = wi−1 ◦ (1−mi−1) ; // mask parameters already in use
Randomly initialize wi;
Minimize E(x,y)∼D [L (p(y|x;wi), y)] w.r.t wi ; // pre-training step
Apply the modified edge-pop algorithm to find optimal mi given wi ;
// pruning step
Minimize E(x,y)∼D [L (p(y|x;wi ◦mi), y)] w.r.t wi ◦mi ;
// finetuning step
end
by: [ (
1 ∑ )]k
min E(x,y)∼D L p(y|x;w ◦mi), y
w,m1,...,mk k
i=1 (4.2)
n
s.t. = k and mi ⊥ mk,∀i ̸= j ∈ {1, ..., k},∥mi∥0
where ◦ denotes the Hadamard product, ∥·∥0 denotes the number of non-zero
elements in a vector, and ⊥ indicates orthogonality (i.e., the vector product is
zero). The first condition ensures that all the dropout subnetworks contain the
same number of parameters while the second condition enforces the masks to be
non-overlapping.
The above optimization is infeasible to solve in practice. We make two observa-
tions that allows us to propose an approximate solver. First, given the individual
masks mi, the problem reduces to k independent weight learning problems. Sec-
ond, optimizing for a dropout mask is similar to the problem of pruning a neural
network.
Based on these observations, we propose to simplify the optimization procedure
into a series of greedy optimizations of {wi,mi}, for i = {1, . . . , k}. To do this,
61
we adopt a three-step approach originally proposed for network pruning. Specif-
ically, at the i-th iteration, a pre-training step is first executed on all available
model weights that exclude the ones used (i.e, not masked out) so far. Given the
pre-trained parameters, an optimization step is then performed to determine the
optimal dropout mask mi before a final fine-tuning step with only the retained
weights wi ◦mi.
The pre-training and finetuning steps are straightforward with stochastic gradi-
ent descent (SGD) given the dropout masks. On the other hand, the intermediate
step of binary dropout mask optimization of mi is worth focusing on. In this work,
we adopt the score-based edge-pop algorithm [177] to find the optimal mask mi
given a pre-trained network weights. In essence, the edge-pop algorithm trans-
forms the discrete optimization problem into a differentiable problem where SGD
can be used. This is achieved by assigning a continuous score to each weight in
the NN indicating its relative importance. During a forward pass, a binary mask
mi can be generated by ranking these scores and choosing the weight with the
largest scores. A gradient for this sort and choose layer can be approximated via a
relaxed backward pass. Detailed description of the algorithm can be found in the
Appendix.
The original edge-pop algorithm is applied to a randomly initialized net-
work [177]. As such, the importance scores are also randomly initialized in their
setup. In our work, however, the edge-pop algorithm is applied on a pre-trained
network. In this scenario, we found it critical to initialize the scores proportional
to the magnitude of weights. This is inspired by the effectiveness of the weight
magnitude-based method for network pruning [46]. A similar observation was also
made previously by Sehwag et al. [187]. In practice, we use magnitudes of weights
62
divided by the layer-wise maximum as the initial values of scores so that all scores
have values in [−1, 1]. Another modification we applied is that, at the i-th it-
eration, weights that were used in a previous dropout iteration are masked out
and excluded from consideration by the edge-pop algorithm. Thus, we propose
to approximately solve the original optimization problem of Equation 4.2 with a
greedy, sequential optimization procedure. A summary of the proposed orthogonal
dropout algorithm can be found in Algorithm 1.
Fixing The Classification Layer Applying dropout to the final fully-connected
(i.e., classification) layer can significantly reduce the performance of each subnet-
work, hampering overall ensemble performance. One way around this is to not
apply dropout and have all subnetworks share the classification layer. With this in
mind, and inspired by some recent reports [85,205] in addition to our own empiri-
cal experience, we found randomly initializing and freezing a shared (no dropout)
classification layer to be very effective. Thus, this is our default implementation
for the proposed orthogonal dropout method in our experiments. Ablation study
below includes results without a fixed classification layer.
4.5 Experiments
To evaluate the performance of the proposed method, we conduct extensive exper-
iments with popular NN architectures on several benchmark datasets. We use the
CIFAR-10 and CIFAR-100 dataset [111], the CUB-200 dataset [215] and the tiny-
imagenet3 dataset [35]. We use the ResNet18 [79] and the Wide-ResNet28-10 [229]
for CIFAR datasets, and ResNet50 model for the CUB-200 and the Tiny-Imagenet
3https://www.kaggle.com/c/tiny-imagenet
63
Table 4.1: Results for ResNet models on various datasets. Best results for efficient
ensembles are highlighted in bold. Fixed classification layer is used for orthogonal
Dropout. See Table 4.3 and the Appendix for further ablation study on this.
Method Accuracy (↑) NLL (↓) ECE (↓) Size
Deterministic 93.5% 0.296 0.0408 1×
MC Dropout 94.4% 0.191 0.0202 1×
BatchEnsemble 94.8% 0.203 0.0269 ∼ 1×
CIFAR10
ResNet18 MIMO Ensemble 94.3% 0.205 0.0180 ∼ 1×
Masksemble 93.8% 0.202 0.0099 ∼ 1×
Orthogonal Dropout (Ours) 95.1% 0.157 0.0082 ∼ 1×
Deep Ensemble 94.8% 0.175 0.0110 5×
Deterministic 73.0% 1.28 0.141 1×
MC Dropout 73.3% 1.11 0.0902 1×
BatchEnsemble 74.3% 1.05 0.0910 ∼ 1×
CIFAR100
ResNet18 MIMO Ensemble 73.8% 1.09 0.0664 ∼ 1×
Masksemble 73.7% 0.999 0.0224 ∼ 1×
Orthogonal Dropout (Ours) 77.7% 0.864 0.0191 ∼ 1×
Deep Ensemble 76.7% 0.921 0.0377 5×
Deterministic 50.1% 2.45 0.196 1×
MC Dropout 50.2% 2.49 0.171 1×
BatchEnsemble 54.0% 2.33 0.128 ∼ 1×
CUB200
ResNet50 MIMO Ensemble 52.0% 2.11 0.0879 ∼ 1×
Masksemble 49.6% 2.32 0.118 ∼ 1×
Orthogonal Dropout (Ours) 61.4% 1.67 0.0335 ∼ 1×
Deep Ensemble 55.6% 1.98 0.0725 5×
Deterministic 56.3% 2.18 0.164 1×
MC Dropout 56.3% 1.96 0.0871 1×
BatchEnsemble 54.3% 2.31 0.185 ∼ 1×
Tiny-Imagenet
ResNet50 MIMO Ensemble 54.0% 1.96 0.0461 ∼ 1×
Masksemble 50.1% 2.07 0.0405 ∼ 1×
Orthogonal Dropout (Ours) 63.2% 1.55 0.0189 ∼ 1×
Deep Ensemble 61.4% 1.73 0.0336 5×
datasets. We use accuracy, the negative log-likelihood (NLL) [120] and the ex-
pected calibration error (ECE) [70] to measure performance. ECE, in particular,
is a measure of the quality of uncertainty.
Baselines In addition to MC dropout and deep ensembles, we compare orthog-
onal dropout with several other recently proposed state-of-the-art methods for
64
efficient ensembles. These include BatchEnsembles [218], MIMO ensembles [77]
and Masksembles [39]. We use an ensemble of 5 models for all types of ensembling
methods except MIMO, for which we found an ensemble of 2 models gave the best
performance for ResNet models. Lastly, during inference, we do 30 forward passes
for MC dropout which we observe was sufficient to achieve its best performance.
Optimization For ResNet models, we use SGD with identical hyper-parameters
as originally used in the ResNet paper. We optimize the models for 150 epochs
during both the pre-training and finetuning step, and optimize dropout masks for
20 epochs during pruning step for our orthogonal dropout. In order to ensure a fair
comparison, we train baseline models for longer so that they are fully converged.
MC dropout, MIMO ensembles and the models in deep ensembles are trained for
200 epochs respectively. When training MIMO ensembles, we also use a batch rep-
etition of 4 to enhance the model performance, as suggested in the original paper.
We empirically observe that it takes longer for BatchEnsemble and Masksembles
to converge, and thus train these models for 500 epochs. For experiments using
Wide-ResNet28-10, we follow the identical training procedure used by Havasi et
al. [77] for a fair comparison against their experimental results.
4.5.1 Results
Experimental results for the ResNet models are summarized in Table 4.1. As
seen clearly from the table, our proposed method significantly outperforms other
recently proposed state-of-the-art (SOTA) methods of memory-efficient ensembles
in terms of both accuracy and quality of uncertainty estimates. Similar trends can
be also observed from experiments with Wide ResNet28-10 as well; our proposed
65
Table 4.2: Results for Wide ResNet28-10. Asterisk symbol (*) represents results
adapted directly from [77]. Best results for efficient ensembles are highlighted in
bold.
CIFAR10 CIFAR100
Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓)
Deterministic* 96.0% 0.159 0.023 79.8% 0.875 0.086
MC Dropout* 95.9% 0.160 0.024 79.6% 0.830 0.050
BatchEnsemble* 96.2% 0.143 0.021 81.5% 0.740 0.056
MIMO* 96.4% 0.123 0.010 82.0% 0.690 0.022
Masksembles 94.6% 0.173 0.008 76.7% 0.843 0.015
Orthogonal Dropout (Ours) 96.6% 0.122 0.005 82.8% 0.701 0.021
Deep Ensembles* 96.6% 0.114 0.010 82.7% 0.666 0.021
method consistently outperforms other SOTA methods.
More interestingly, for experiments with the ResNet models, we can see from
Table 4.1 that orthogonal dropout even produces performance better than that of
standard deep ensembles, which consumes approximately five times more mem-
ory resources during inference time.This improvement is likely due to at least two
reasons. Firstly, as we discuss further below, the sequential dropout mask op-
timization serves as an additional vehicle for regularization. Secondly, as we will
further elaborate in Section 4.5.2, this improvement over deep ensembles is also due
to fixing the classification layer. Nevertheless, we emphasize that, even without
fixing the classification layer, our method consistently outperforms other recently
proposed SOTA efficient ensembling techniques. This can be confirmed by compar-
ing results for orthogonal dropout without the fixed classification layer summarized
in Table 4.3. Lastly, we note that the relative gap between orthogonal dropout and
deep ensembles becomes negligible for the experiments with Wide ResNet28-10 for
CIFARs. We conjecture that this is because a higher L2 regularization is used
for training of this model, following the exact training configuration of Havasi et
al. [77], thereby nullifying the regularization effect of fixing the classifier layers.
We leave it as future work to further understand this regularization effect.
66
94.5 CIFAR-10 CIFAR-100
75
94.0 74
73
93.5 72
93.0 711st 2nd 3rd 4th 5th 1st 2nd 3rd 4th 5th
(a) (b)
CUB200 Tiny-Imagenet
60.0
57.5 59
55.0 58
52.5 57
1st 2nd 3rd 4th 5th 1st 2nd 3rd 4th 5th
(c) (d)
Figure 4.1: Bar plots of accuracy of individual orthogonal dropout subnetworks
of ResNet models. ”i-th” model represents the i-th subnetwork obtained using
Algorithm 1 sequentially.
Individual Model Performance To gain further insights into orthogonal
dropout, we show in Figure 4.1 bar plots of the accuracy of the individual subnet-
works in ResNet models on various datasets. Firstly, it can be seen that for all
datasets except CUB200, subnetworks obtained later during the proposed greedy
optimization procedure, in general, exhibit poorer performance. Intuitively, this is
because later subnetworks have fewer parameters for learning. For instance, for an
orthogonal dropout ensemble with five subnetworks, during training of the third
subnetwork, only 60% of the weights are available for parameter and dropout mask
optimization. Interestingly, we see that the 2nd model is consistently the best per-
forming subnetwork out of the ensemble of five subnetworks. We hypothesize that
removing a small portion of parameters in neural networks can implicitly regularize
67
Accuracy
Accuracy
Accuracy Accuracy
Table 4.3: Ablation study of the proposed method. orthogonal dropout methods
are trained without dropout mask optimization. ”MO” corresponds to ”mask
optimization” and ”FC” corresponds to ”Fixed Classifier”. ”Ind Acc” denotes the
averaged individual model accuracy in an ensemble, while ”Ens Acc” represents
the ensemble accuracy.
CIFAR10 CIFAR100
Ind Acc (↑) Ens Acc (↑) NLL (↓) ECE (↓) IA (↓) Ind Acc (↑) Ens Acc (↑) NLL (↓) ECE (↓) IA (↓)
MC Dropout 93.4% 94.4% 0.287 0.0415 0.708 71.3% 73.3% 1.11 0.0902 0.776
Orthogonal Dropout 93.2% 94.3% 0.188 0.0138 0.597 71.4% 75.1% 0.977 0.0455 0.655
Orthogonal Dropout+MO 93.8% 94.9% 0.176 0.0122 0.601 72.3% 76.3% 0.935 0.0328 0.648
Orthogonal Dropout+MO+FC 93.9% 95.1% 0.157 0.0082 0.594 73.7% 77.7% 0.864 0.0191 0.638
Deep Ensemble 93.4% 94.8% 0.175 0.0110 0.581 72.7% 76.7% 0.921 0.0377 0.652
Deep Ensemble+FC 93.5% 95.1% 0.151 0.0091 0.580 74.1% 77.8% 0.858 0.0221 0.645
a network and account for these results. This could also help explain why for the
CUB200 dataset, even the 5th model outperformed the 1st by far. Compared to the
other three datasets, the CUB200 dataset is significantly smaller in size, consisting
of only approximately 6000 images for training. As such, aggressive regulariza-
tion can potentially significantly improve generalization performance. Lastly, we
note that, despite the significantly lower accuracy of later subnetworks, we found
including them in the ensemble still leads to a positive gain.
4.5.2 Ablation Study
In this section, we conduct additional experiments using CIFAR10/100 and the
ResNet18 model to decompose the contribution of each component of our pro-
posed method. Specifically, we compare standard MC dropout against 1. dropout
with randomly generated and orthogonal masks, 2. orthogonal dropout with mask
optimization, 3. orthogonal dropout with both mask optimization and fixed clas-
sification layer. To further demonstrate the effect of fixing the classification layer,
we also train a deep ensemble with a fixed classification layer. In addition to ac-
curacy, NLL and ECE, we also report the individual level accuracy and compute
the Inter-rater Agreement (IA) [118] between individual models in an ensemble, as
68
CIFAR-10 CIFAR-100
95.2
78
95.0
94.8 77
94.6
76
94.4
94.2 75
94.0
2 3 4 5 6 7 8 9 10
2 3 4 5 6 7 8 9 10 Number of Models
(a) (b)
CIFAR-10 CIFAR-100
0.23 1.10
0.22 1.05
0.21
1.00
0.20
0.95
0.19
0.90
0.18
0.17 0.85
0.16 0.80
2 3 4 5 6 7 8 9 10
0.15
2 3 4 5 6 7 8 9 10 Number of Models
(c) (d)
CIFAR-10 CIFAR-100
0.08
0.020 Ours
0.07
0.018 Deep Ensemble with FC
0.016 Deep Ensemble w/o FC 0.06
0.014 0.05
0.012 0.04
0.010
0.03
0.008
0.02
0.006
2 3 4 5 6 7 8 9 10
0.004
2 3 4 5 6 7 8 9 10 Number of Models
(e) (f)
Figure 4.2: Plot of accuracy/NLL/ECE against number of models in the ensembles.
For orthogonal dropout, number of models is varied by changing the size of each
subnetwork and all the orthogonal dropout ensembles are of the same size. ”FC”
corresponds to ”Fixed Classifier”.
a measure of diversity in the ensemble to gain further insights. The lower the IA,
the more diverse the ensembles.
Results for this ablation study can be found in Table 4.3. Firstly, we note that
69
ECE NLL Accuracy
ECE NLL Accuracy
even orthogonal dropout without dropout mask optimization outperforms standard
MC dropout significantly for CIFAR100, despite the significantly smaller ensemble
size (for MC dropout, results are obtained with 30 forward passes whereas orthog-
onal dropout ensemble contains only five subnetworks). This can be explained by
increased diversity among models in ensembles, as evident from considerably lower
IA for orthogonal dropout. In fact, the amount of diversity in orthogonal dropout
is almost identical to that of deep ensembles.
Furthermore, we see that optimizing dropout masks and fixing the classier layer
further boost the performance of orthogonal dropout substantially. The improve-
ment primarily is attributable to the increase in individual model performance. To
isolate the effect of fixing the classifier layer, we also train a deep ensemble of 5
models with a fixed classifier layer. Note that fixing the classification layer consis-
tently yields an ensemble with better performance in this particular experimental
setup. Remarkably, orthogonal dropout models with mask optimization and fixed
classifier layer perform as well as deep ensembles with 5 models with fixed classifier
layer in terms of both accuracy and calibration.
4.5.3 How Many Subnetworks Can We Fit?
We also investigate how many subnetworks we can fit into a ResNet18 model.
This can be achieved by adjusting the percentage of parameters each subnetwork
consumes. For instance, if each subnetwork consists of 50% of the parameters, we
can fit in 2 subnetworks into a ResNet18 model, and if each subnetwork consists
of 10% of the parameters, the orthogonal dropout ensemble would contain 10
subnetworks. Nevertheless, increasing the ensemble size can decrease the quality
of individual model performance. As such, there is an inherent trade-off that needs
70
to be balanced.
In figure 4.2, we plot model performance (ensemble accuracy, NLL and ECE)
against the number of subnetworks in a single ResNet18 model. Surprisingly, we
see that we can even fit 10 subnetworks into a ResNet18 model using orthogonal
dropout with an extremely competitive ensemble performance. This is in stark
contrast with the MIMO ensemble, which also aims at fitting multiple subnet-
works into one model, whose performance degrades drastically with 4 or more
subnetworks in the model [77]. Moreover, WRN28-10, a network with much more
parameters, was used in their experiment. With ResNet18, we empirically observe
even 3 subnetworks in MIMO yields a poor performance (see Appendix).
To further understand the quality of orthogonal dropout ensemble, we also plot
the performance achievable by an explicit deep ensemble with the same numbers
of models in the ensemble. In general, we see that orthogonal dropout is capable
of outperforming even a standard deep ensemble of 10 models for this particular
experimental setup. Moreover, we see that orthogonal dropout matches a deep
ensemble of 8 models with fixed classification layers, giving us significant memory
saving.
71
CHAPTER 5
ACCELERATING UNCERTAINTY ESTIMATES COMPUTATION
WITH UNCERTAINTY-AWARE DISTRIBUTION DISTILLATION
Calibrated estimates of uncertainty are critical for many real-world computer vision
applications of deep learning. While there are several widely-used uncertainty es-
timation methods, dropout inference [51] stands out for its simplicity and efficacy.
This technique, however, requires multiple forward passes through the network
during inference and therefore can be too resource-intensive to be deployed in real-
time applications. We propose a simple, easy-to-optimize distillation method for
learning the conditional predictive distribution of a pre-trained dropout model for
fast, sample-free uncertainty estimation in computer vision tasks. We empirically
test the effectiveness of the proposed method on both semantic segmentation and
depth estimation tasks, and demonstrate our method can significantly reduce the
inference time, enabling real-time uncertainty quantification, while achieving im-
proved quality of both the uncertainty estimates and predictive performance over
the regular dropout model.
5.1 Introduction
Uncertainty exists in many machine learning problems due to noise in the obser-
vations and incomplete coverage of domain. How certain can we trust the model
built upon limited yet imperfect data? Reliable uncertainty estimates are crucial
for trustworthy applications such as medical diagnosis and autonomous driving.
Many algorithms have been proposed to estimate the uncertainty of neural net-
works (NN) [13,110,134,220]. Among these, the MC dropout [51] is arguably one
72
Bayesian SegNet Loss
Input Teacher Sample Ground Truth
SegNet Mean Loss
Input Student Samples from Teacher
Variance
Figure 5.1: An illustration of the proposed method. Given a trained teacher, a
deterministic student is used to approximately parameterize the predictive distri-
bution of the teacher model, enabling sample-free uncertainty estimation.
of the most popular approaches due to its simplicity and scalability. This approach
has been adopted in different computer vision tasks recently [11, 43]. A follow-up
work [104] further enhanced quality of uncertainty estimates by incorporating both
aleatoric and epistemic uncertainty into deep neural networks. Despite its success,
MC dropout requires test-time sampling to obtain uncertainty. This costly sam-
pling process can introduce severe latency in real-time prediction tasks such as the
perception system of self-driving vehicles and lead to undesired consequences.
In order to eliminate the expensive dropout sampling at the test-time, prior
work [21] has explored distilling knowledge from MC dropout samples of a teacher
model into a student network (Dropout Distillation or DD). Nevertheless, DD has
several limitations. Specifically, the student model only learns from the predictive
means of the dropout teacher model, and the dispersion of the teacher’s predic-
tion which entails important uncertainty information associated with the predic-
tions [142], is completely neglected in their approach. To address the problem,
in this chapter, we propose an easy-to-optimize, generally applicable distillation
73
framework for fast, sample-free uncertainty estimation. Specifically, we approxi-
mate the entire predictive distribution produced by a MC-dropout teacher with
flexible parametric distributions. At test time, the parameters of the distribution
are output by a single deterministic student network to obtain reliable uncertainty
estimates in one forward pass. In addition, we show that our method can distill
both epistemic and aleatoric uncertainty with little extra computation.
We examine the effectiveness of the proposed method on regression and clas-
sification with high resolution, real-world datasets. For regression, we experiment
on monocular depth estimation using NYU Depth V2 [156] and KITTI [57]. For
classification, we experiment on semantic segmentation using CamVid [19] and
VOC2012 [42]. In addition to significant faster inference time, quantitative and
qualitative results show the student network produces uncertainty estimates of bet-
ter quality than those of the teacher model, i.e. MC dropout pre-trained model. We
also demonstrate the predictive mean and uncertainty obtained with our method
are superior to those learned from DD [21].
5.2 Related Work
Uncertainty estimation can be obtained for deep learning in a principled manner
through Bayesian neural networks [139, 157]. However, they typically suffer from
significant computational burdens due to the intractability of posteriors. As such,
computational tractability has been a primary focus of research. One such direc-
tion is through Markov Chain Monte Carlo (MCMC) [157]. For instance, stochastic
gradient versions of MCMC have been proposed [29, 63, 137, 216] to scale MCMC
method to large datasets. Nevertheless, these approaches can be difficult to scale
74
to high-dimensional data. An alternative solution is through variational inference
in which parametric distributions are used to approximate the intractable true pos-
teriors of the weights of neural networks [13,110,134,220]. However, the variational
inference can suffer from sub-optimal performances [12].
There have also been Non-Bayesian techniques for uncertainty estimation. For
instance, an ensemble of randomly-initialized NNs [121] has shown to be effective.
However, it requires training and saving multiple NNs, which can be costly in prac-
tice. Methods to more efficiently obtaining ensembles exist [56, 88], but these can
come at the cost of quality of uncertainty estimates. Most of the above-mentioned
methods require multiple forward passes of NNs at test time, which prohibits
their deployment in real-time computer vision systems. Several techniques have
been proposed to speed up uncertainty estimation. Postels et al. [172] proposed
a method for sampling-free uncertainty estimation through variance propagation.
However, the simplistic assumptions they made about the covariance matrices of
NN activation might lead to inaccurate approximations. Another approach speeds
up the sampling process by leveraging the temporal information in videos [91].
However, the method cannot be generically applied and results in a non-trivial drop
in predictive performance. Ilg et al. [93] proposed the use of a multi-hypotheses
NN with a novel loss function to obtain sample-free uncertainty estimates in the
optical flow estimation task. However [93] uses an additional network to merge the
hypotheses that incurs extra memory cost at inference time.
Distillation-based methods have also been explored. [21] proposed to distill
predictive means from a dropout teacher to a student network. As addressed above,
this leads to the loss of the epistemic uncertainty of the MC dropout teacher. Most
similar to our method is [143], which uses the Dirichlet distribution to approximate
75
predictive distribution of an ensemble of networks. However, the proposed method
requires a large size of ensemble to effectively train a student network, which can
be prohibitively expensive for challenging computer vision tasks. In addition, the
proposed use of the Dirichlet distribution is not only practically hard to optimize,
but also not applicable for regression tasks.
5.3 Method
Suppose we have a dataset D = (X,Y ) = {(xi,yi)}ni=1, where each (xi,yi) ∈
(X × Y) is i.i.d. and X ⊆ Rd corresponds to the feature space. Most tasks
in computer vision can be considered as either regression or classification. For
regression, Y ⊆ Rk for some integer k, and in the context of k-class classification,
Y = {1, · · · , k} is the label space. We define fw(x) to be a neural network such
that f : X → Y , and w = {W }Li i=1 corresponds to the parameters of the network
with L-layers, where each Wi is the weight matrix in the i-th layer. We define
the model likelihood p(y|x,w) = p(y|fw(x)). For regression tasks, it is common
to assume p(y|fw(x)) = N (f (x), σ2w ), for some noise term σ. For classification
tasks, p(y|fw(x)) = Softmax(fw(x)) is commonly assumed. To capture epistemic
uncertainty, we put a prior distribution on the weights of the network, p(w). A
common choice is the zero mean Gaussian N (0, I). Bayes Theorem can then be
used to obtain the posterior p(w|X,Y ) = p(Y |X,w)p(w)/p(Y |X), with which
the predictive distribution can be d∫etermined by
p(y|x,Dtrain) = p(y|x,w)p(w|Dtrain) dw. (5.1)
76
5.3.1 Preliminary: Dropout for Bayesian Deep Learning
The marginal distribution p(Y |X), and thus p(w|X,Y ) are often intractable.
Variational inference uses a tractable family of distributions qθ(w) paramaterized
by θ to approximate the true posterior p(w|X,Y ), thereby turning the prob-
lem into a tractable optimization task. MC dropout, which casts the dropout
regularization as approximate Bayesian inference, is one such example [51]. It
involves training NNs with dropout after each weight layer. With an opti-
∫mized model, the approximate predictive distribution is given by q(y|x,Dtrain) =
p(y|x,w)qθ(w) dw. The integral can be approximated through performing
Monte Carlo integration over qθ(w). This corresponds to dropout at test time.
In classification for example,
∑T1
p(y = c|x,Dtrain) ≈ Softmax(fwt(x)), (5.2)T
t=1
where wt ∼ qθ(w) are dropout samples from the NN.
Epistemic uncertainty can be computed with the approximate inference frame-
work as derived above. For regression, epistemic uncertainty is captured by the
predictive variance, which can be approximated by computing the variance of the
dropout approximate distribution:
∑T1
σ2y ≈ fwt(x)Tf TwT t
(x)− µyµy, (5.3)
∑ t=1
where µ = Ty t=1 fwt(x). In the context of classification, numerous measures have
been proposed as uncertainty estimates [54]. In this chapter, we use the mutual
information between the predictions and the model posterior (BALD),
I [y,w|x,Dtrain] = H [y|x,Dtrain]− E [H [y|x,w]] (5.4)
as the uncertainty estimates in classification tasks.
77
As proposed by Kendall and Gal [104], aleatoric uncertainty can also be incor-
porated into the dropout model for concurrent estimation of both the epistemic and
aleatoric uncertainty. To do so, an input-dependent observation noise parameter
σ̂2 is output together with the prediction
[ ]
µ̂, σ̂2 = fw(x), (5.5)
where σ̂2 is a vector with the same dimension as ŷ that represents a diagonal
covariance matrix. σ̂ can be optimized with maximum likelihood estimation by
assuming that aleatoric uncertainty follows a parametric distribution (e.g Gaus-
sian) for regression. For classification, this Gaussian distribution is placed over the
logit space. Although the epistemic and aleatoric uncertainties are not mutually
exclusive, the total uncertainty can be approximated using
∑T ∑T1 1
Var(y) ≈ µ̂Tt µ̂t − µT
2
T y
µy + σ̂t (5.6)
∑ Tt=1 t=1
where µy =
T
t=1 µ̂t and the first and second term in the above expression ap-
proximates the epistemic and aleatoric uncertainties respectively.
5.3.2 A Teacher-Student Paradigm for Sample-free Uncer-
tainty Estimation
Despite the success of the MC dropout, inferring uncertainty at test-time often
requires multiple forward passes to generate samples of prediction, limiting its
application to many time-sensitive applications. In this chapter, we propose to use
a deterministic neural network fϕ(x) to parameterize a distribution r(y|x,Dtrain)
that approximates the predictive distribution q(y|x,Dtrain) of the dropout model.
Specifically, fϕ(x) learns to directly output the parameters of r(y|x,Dtrain). When
78
trained, fϕ(x) only requires one forward pass to infer both predictive mean and
uncertainty from the parameterized distribution r(y|x,Dtrain), thus eliminating
expensive sampling processes at test time.
Training fϕ(x) is straight-forward using a teacher-student paradigm similar to
the knowledge distillation [84]. We first train a Bayesian neural network (BNN)
fw(x) (e.g. a dropout model) on Dtrain. We then generate samples of predictions
from the pre-trained fw(x). These samples serve as “observations” from the distri-
bution q(y|x,Dtrain) for fϕ(x) to learn the parameters of r(y|x,Dtrain) given each
input x ∈ Dtrain. Eventually fϕ(x) learns an efficient mapping from input images
to the parameters of the distribution r(y|x,Dtrain) that accurately approximates
q(y|x,Dtrain). For simplicity, in the following illustration we term the BNN fw(x)
as the teacher model and fϕ(x) as the student model.
Sampling from the Bayesian Teacher
As mentioned above, predictive samples {ŷt = fwt(x)}mt=1 are generated from the
teacher to train the student. In the more complicated scenario where aleatoric
uncertainty is modeled by teacher, we incorporate aleatoric uncertainty into each
predictive sample with
ŷt = µ̂t + σ̂tϵ, ϵ ∼ N (0, I). (5.7)
where σ̂t is the aleatoric uncertainty output by the teacher given an input. In
practice, σ̂ 2t can be nois∑y. To stabilize training, instead of σ̂t , we first compute
empirical mean σ̃2 ≜ 1 Tt=1 σ̂
2
t , and use σ̃
2 to generate all samples {ŷt}mT t=1.
The larger the number of samples, the more accurate the student approximation
can be to the teacher predictive distribution. However, sampling a large number of
79
samples requires intensive computational resources. To cope with this challenge,
we generate a small number ofm predictive samples from the teacher for each input
on-the-fly at each epoch during training. In order to learn aleatoric unce[rtainty],
we further get k random samples from N (0, I) for each predictive sample µ̂ , σ̃2t
(see Eq. 5.7). In practice we use m = 5 and k = 10. As we demonstrate in the
experimental section, a small number of m and k per input is sufficient to learn
student model with excellent performance.
Optimizing the Student
We use maximum likelihood estimation (MLE) to optimize fϕ(x). Given the
samples {ŷt}mt=1 generated by the teacher, we minimize the negative log likelihood
for each input x ∑
Ls = − log r (ŷt|x;ϕ) . (5.8)
t
where r(ŷt|x;ϕ) is parameterized by fϕ(x). In order to avoid division by zero and
enable unconstrained optimization of the variance, we use log variance s = log(σ̂2)
as the output of the student. Thus, we have [µ̂, s] = fϕ(x).
For regression problems, we use the Laplace distribution to approximate the
variational predictive distribution. For simplicity, we assume independence among
all the dimensions of outputs so that log variance s is a vector of the same dimension
as µ̂. Given the Laplace assumption, a numerically stable MLE training objective
can be derived from Eq. 5.8 a∑s √ ( )
L 1 1 1 1s = 2 exp − si |ŷ − µ̂ |+ si (5.9)
N M 2 ti i 2
i,t
where i and t corresponds to the summation over the output space and the gen-
erated samples respectively and N and M are number of instances (e.g pixels) in
80
the output space and the generated predictive samples from teacher, respectively.
The reason to choose Laplace distribution over Gaussian distribution is because it
is more appropriate to model the variances of residuals with ℓ1 loss, which usually
outperforms ℓ2 loss in computer vision tasks.
For classification problems, we use a logit-normal distribution to model
teacher’s approximate predictive distribution q(y|x,Dtrain) on the simplex1. In
practice, we use a Gaussian distribution with a diagonal covariance matrix to ap-
proximate the teacher’s predictive distribution on the logit space. As a result, the
student model outputs µi and si as the mean and log variance of the Gaussian for
each member of the logits. Similar to the regression set-up, we derive a numerically
stable Gaussian MLE training objective
L 1 1
∑ 1 2 1
s = exp (−si) ||ŷ − µ̂ | | + si (5.10)
N M 2 ti i 2
i,t
where yti are predicted logits sampled from teacher. Since close-form solution does
not exist for the moments of a logit-normal distribution, Monte Carlo sampling on
the logit space is performed at test time to obtain uncertainty estimates. This only
incurs a tiny computational overhead during inference as it amounts to multiple
forward passes of one layer of the student network (the softmax function). As
shown in experiments, the student model still has a large advantage in inference
time over its teacher in addition to better performance.
We empirically observe that training solely with the above loss functions some-
times leads to sub-optimal predictive performance. This may be due to the noisy
signal provided by the generated samples. Thus we leverage ground truth labels
in addition to predictive samples from the teacher to stabilize the training of the
1Dirichlet distribution is an obvious alternative, but we empirically observe that training with
logit-normal is much more numerically stable.
81
student model. We use the loss function for which the teacher model is trained in
conjunction with the Ls, leading to the total loss
Ltotal = Ls + λLt, (5.11)
where the λ is a hyper-parameter to be tuned and Lt corresponds to the categorical
cross entropy loss for classification tasks or L1 loss for regression tasks. We found
that λ = 1 generally performs well for our experiments.
Additional Augmentation
When the same training dataset is used for both the teacher and the student train-
ing, the student may underestimate the epistemic uncertainty of the teacher due
to overfitting of the teacher network to the training data. Ideally, in order to fully
capture the teacher predictive distribution, the dataset used to train the student
should not overlap with the one for the teacher. However, training using only
a subset of available samples can lead to sub-optimal performance. To alleviate
this problem, we perturb the training set during the training of the student us-
ing extra data augmentation methods unused when training the teacher, in order
to synthetically generate new samples unseen by the teacher model. We choose
color jittering as the augmentation method that augments each image via color
jitter with random variation in the range of [−0.2, 0.2] in four aspects: brightness,
contrast, hue, and saturation when training the student.
As we demonstrate below, this extra augmentation during student training
can be crucial for enhanced quality of uncertainty estimates. We emphasize that
the additional gain in uncertainty estimates does not directly come from data
augmentation, but rather from teacher predictions that more closely correspond
82
Figure 5.2: Example predictions on CamVid. Each uncertainty map shows the
sum of aleatoric and epistemic uncertainty. Same for all the following example
plots.
to the test-time predictive distributions as a consequence of this augmentation. In
the experiments below, we show that the teacher model does not have the same
performance boost with the extra augmentation.
5.4 Experiments
We conduct experiments on two pixel-wise computer vision tasks: semantic seg-
mentation and depth regression. We compare the performance of the proposed
method with that of the teacher models using MC dropout. For a holistic evalu-
ation, we consider teacher networks trained both with and without the aleatoric
uncertainty. Following [104, 172], we use 50 samples for MC dropout to evalu-
ate teacher’s performance and uncertainty. Architectures identical to that of the
teacher models without the dropout layers are used as student models. As dis-
cussed in the previous section, we use 50 samples from the logit space to evaluate
uncertainty (BALD) of student models for classification tasks. To demonstrate the
general applicability, we also show the effectiveness of the proposed method when
83
the teacher network corresponds to a Deep Ensemble [121].
Evaluation Metrics
On top of metrics to evaluate the performance of the predictive means of our mod-
els, we measure both the Area Under the Sparsification Error curve (AUSE) [93]
and the expected calibration error (ECE) [70] as measures to evaluate the quality
of uncertainty estimates. In essence, AUSE measures how much the estimated un-
certainty coincides with true predictive errors. Brier score and the mean absolute
error are used as predictive errors to compute AUSE for classification and regres-
sion tasks respectively. In the context of classification, ECE measures how much
the predictive means of probabilities from the softmax function are representative
of the true correctness of predictions. In the context of regression, we use ECE
described in [115] to quantify the amount of mismatch between the predictive dis-
tribution and the empirical CDFs. We follow [115] and compute ECE with the ℓ2
norm with a bin size of 30.
5.4.1 Semantic Segmentation
Bayesian SegNet [103], which contains dropout layers inserted after the central
four encoder and decoder units, was proposed to obtain uncertainty estimates for
semantic segmentation. In this work, we use the architecture with a dropout rate
of 0.5 as the teacher model for all our experiments. We use the CamVid and
VOC2012 datasets.
For CamVid, following Kendall et al. [103], we use 11 generalized classes and
a downsampled image size of 360 × 480. For the teacher network, we train using
84
Figure 5.3: Example predictions on Pascal VOC2012.
the Stochastic Gradient Descent (SGD) with an initial learning rate of 10−3, a
momentum of 0.9, and a weight decay of 5 × 10−4 for 100,000 steps. In order to
achieve faster convergence, we initialize the student network using the weights of
the teacher network. To this end, a smaller initial learning rate of 5× 10−4 is used
to train the student network for 80,000 steps. We employ a “poly” learning rate
policy on both the teacher and student networks as done by Chen et al. [27]. We
use a batch size of 4 for both per step.
For VOC2012, we use the same augmented “train” and “val” split as in [27].
Input images are resized to 224× 224. For the optimal performance of the teacher
model, SGD with a higher initial learning rate of 10−2 is used instead, with a
batch size of 8 for 150000 steps. Similarly, we initialize the student model with
the weights of the teacher. The student model is trained for 100000 steps with an
initial learning rate of 10−3 using a size of 8 per step. The performance on the
“val” split is reported in the results. We also include the results of the student
models trained using Dropout Distillation (DD) [21] as a baseline comparison.
85
Evaluation
Results for both the teacher and the student are summarized in Table 5.1. On top
of a significant boost in run-time, the student network also leads to improvements
in terms of most of the metrics evaluated. We believe the reason for the observed
improvements in both predictive performance and uncertainty estimates is mainly
due to learning the entire predictive distribution implicitly through samples from
the teacher models with the proposed optimization objective can have the loss at-
tenuation effect as described in [104]. In contrast, Dropout Distribution (DD) [21],
which only distills the mean prediction of the teacher as the standard knowledge
distillation, shows worse performances of the student than those of the teachers
in all the metrics. This further demonstrates the benefit of distilling the entire
predictive distribution from the teacher.
Figure 5.2 and 5.3 are random selected examples from the validation set of
CamVid and Pascal VOC respectively. Visual examples suggest that the student
model can accurately capture both the predictive mean and uncertainty of the
teacher model. Furthermore, a closer comparison reveals the exceptional quality
of the uncertainty estimates produced by the student model. For instance, in
the second example from the CamVid dataset in Figure 5.2, a small part of the
ego vehicle is captured by the camera at the bottom of the figure. While the
teacher model confidently predicts the area as “road surface”, the student model
highlights this subtle anomaly with high uncertainty estimates. A similar contrast
is also observed in the top example of Figure 5.3, where the boundary of people
is assigned much higher uncertainty by the student model. Besides, the bowls
and plates on the dining table, which are not in the list of labeled classes for the
dataset, also “confuses” the student model, but not the teacher.
86
Table 5.1: Results on the segmentation problem. The “T”, “S” and “AU” corre-
sponds to the teacher and student model, and the aleatoric uncertainty respectively.
“T+AU” corresponds to a teacher model trained with the aleatoric uncertainty.
“DD” corresponds to the student trained using Dropout Distillation [21]. Best
performing results for each teacher-student pair are bold-faced.
Camvid
Model T S DD [21] T+AU S+AU
Accuracy ↑ 0.906 0.907 0.903 0.907 0.909
Classwise Acc ↑ 0.764 0.765 0.747 0.766 0.750
IOU ↑ 0.645 0.650 0.642 0.645 0.650
ECE ↓ (×10−3) 3.78 2.23 6.73 3.67 2.86
AUSE ↓ (×10−2) 1.47 1.60 2.59 1.63 1.60
Runtime (s) ↓ 1.6 0.078 0.078 2.1 0.078
Pascal VOC
Model T S DD [21] T+AU S+AU
Accuracy ↑ 0.834 0.851 0.828 0.831 0.848
Classwise Acc ↑ 0.813 0.828 0.806 0.809 0.827
IOU ↑ 0.697 0.727 0.691 0.693 0.722
ECE ↓ (×10−3) 62.7 59.0 67.5 63.0 59.0
AUSE ↓ (×10−2) 4.35 3.82 4.86 4.20 4.31
Runtime (s) ↓ 0.51 0.028 0.028 0.68 0.028
Run-Time Comparison
Figure 5.4 (a)-(c) illustrate a comparison of running time and performance using
different numbers of samples for MC dropout. While the running time of MC
dropout can be shortened with fewer samples, it comes at the cost of quality
of prediction and uncertainty estimates. The running time of MC Dropout is
optimized by caching results before the first dropout layer for a fair comparison.
We further demonstrate the merit of the proposed method by comparing the
running time of the student with several other recently proposed sample-free meth-
ods for uncertainty estimates. Figure 5.4 (d) illustrates the speed boost with
different methods on the CamVid dataset with Bayesian SegNet. The ratios are
87
(a) (b)
(c) (d)
Figure 5.4: (a)-(c): Comparison of performance against the running time for both
the teacher (with the aleatoric uncertainty) and student model using the CamVid
dataset. (d) Speed-up ratios of uncertainty estimates for the CamVid dataset with
the Bayesian SegNet compared to Huang et al. [91] and Postels et al. [172], two
other sample-free uncertainty estimation methods.
computed with respect to the same baseline of MC dropout with 50 samples at test
time. Our proposed method achieves a more significant boost in speed than pre-
viously proposed methods for accelerating dropout inference, in addition to other
advantages such as wider applicability and improved predictive performance.
Performance under Distribution Shift
We also evaluate the performance of the proposed method under a distribution
shift using models trained with the CamVid dataset. The Cityscapes dataset [31],
88
Figure 5.5: Performance of models trained with CamVid and evaluated on
Cityscapes.
which contains street scenes collected from different cities, is an ideal dataset for
such evaluation. We emphasize that neither the teacher nor the student sees images
from the Cityscapes dataset during training. The results are summarized in Figure
5.5, which is evaluated on the overlapped classes between CamVid and Cityscapes.
Surprisingly, while both the teacher and student models perform unsatisfactorily,
the student performs significantly better than the teacher in terms of all of the
metrics evaluated, suggesting its enhanced robustness against the distribution shift
when trained with the proposed teacher-student pipeline. We hypothesis that by
seeing the distribution of soft labels from a bayesian teacher from the distillation
process, the student learns to output less confident, more generalizable outputs.
The true cause can leave for further works. This can be important for lots of
application domains with long-tail scenarios like autonomous driving.
89
(a) (b)
(c) (d)
Figure 5.6: Top: Relative means of BALD for samples of seen and unseen classes
during training compared to the “Reference” models, which refer to models trained
with both seen and unseen classes. Bottom: Distribution of BALD for samples of
seen and unseen classes during training.
Outlier Detection
In addition, we examine the effectiveness of the uncertainty estimates for outlier
detection using the CamVid dataset. Following [172], we use “pedestrian” and “bi-
cyclist” as held-out classes and exclude them from training. Ideally, classes unseen
during training should have much higher uncertainty estimates than that of the
seen classes. We show in Figure 5.6 comparisons of relative means of the uncer-
90
Figure 5.7: Example predictions on CamVid when “pedestrian” and “bicyclist”
are held out during training. “Reference” refers to models trained with all classes.
tainty estimates against those of “reference” models, which refer to models trained
with both seen and unseen classes, for both inlier and outlier classes. While both
teacher and student assign higher uncertainty to outlier classes compared to the
“reference” models on average, the relative mean is much higher for the student. To
further quantify the performance, we also compute the Jensen–Shannon distance
between distributions of uncertainty estimates of inlier and outlier classes [140].
Again, the difference in the inlier and outlier distribution is larger for the student
network, suggesting its enhanced ability for outlier detection. Lastly, we show
in Figure 5.7 two randomly chosen examples to illustrate the difference between
teacher and student. As seen clearly, regions with pedestrians and bicyclists have
higher uncertainty estimates when they are not present in training for both the
teacher and student. The magnitude is much larger for the student as represented
by bright spots in the uncertainty plot.
91
5.4.2 Pixel-Wise Depth Estimation
For pixel-wise depth estimation tasks, NYU DEPTH V2 (NYU) and the KITTI
Odometry dataset (KITTI) are used to conduct experiments. We follow the same
ResNet-based architecture to [141] for the training of both datasets in RGB based
depth estimation, with dropout p = 0.2 placed after each convolutional layer except
the final one. For NYU, we use the same train/test split as in [141] and for KITTI
we train our models on sequences 00-10 and evaluate them on sequences 11-21.
Identical procedures are used to train the teacher models for both NYU and KITTI.
During training, SGD optimizer with an initial learning rate of 0.01, a momentum
of 0.9, and weight decay of 10−4 with ”poly” learning rate poly is adopted for a
total of 40 epochs. For NYU, we initialize the student model with the weights
of the teacher and train the student model for 30 epochs using a smaller learning
rate of 0.005. We empirically observe that initializing with the teacher model for
KITTI leads to overfitting to the training set and thus we train the student model
from scratch with the identical procedure as used for teacher training. We use a
batch size of 8 in all of the depth estimation experiments.
Evaluation
The quantitative performance of both the teacher and the student models is sum-
marized in Table 5.2. Similar to segmentation tasks, the student model outper-
forms the teacher in most of the evaluation metrics. Example predictions shown
in Figure 5.8 again illustrate that the student network is able to closely approxi-
mate the uncertainty estimates produced by the teacher model. Moreover, as more
number of dropout layers are inserted into the NNs for experiments with depth
estimation, the relative speed-up ratio achieved by the student model is further
92
Table 5.2: Results on the depth estimation. The “T”, “S” and “AU” corre-
sponds to the teacher and student model, and the aleatoric uncertainty respectively.
“T+AU” corresponds to a teacher model trained with the aleatoric uncertainty.
NYU KITTI
Model T S T+AU S+AU T S T+AU S+AU
RMSE ↓ 0.542 0.540 0.548 0.548 4.80 4.75 4.83 4.81
REL ↓ 0.155 0.152 0.158 0.154 0.123 0.122 0.117 0.117
log 10 ↓ 0.065 0.064 0.065 0.064 0.053 0.052 0.052 0.051
δ1 ↑ 0.793 0.798 0.794 0.799 0.843 0.847 0.845 0.846
δ2 ↑ 0.947 0.949 0.945 0.946 0.948 0.951 0.950 0.949
δ3 ↑ 0.985 0.984 0.982 0.981 0.981 0.982 0.981 0.981
ECE ↓ (×10−2) 9.38 8.09 5.79 5.13 7.80 2.95 4.53 2.18
AUSE ↓ (×10−2) 6.01 6.06 5.88 5.82 0.701 0.660 0.597 0.595
Runtime (s) ↓ 0.73 0.016 0.739 0.016 0.28 0.007 0.29 0.007
Table 5.3: Top-4 Rows : Impact of adding augmentation in training on quality of
uncertainty produced on the CamVid and NYU datasets. ”T” and ”S” represents
teacher and student models, and ”AUG” corresponds to augmentation. Last Row :
Uncertainty performance of student model when a deep ensemble with five NNs is
used as the teacher model.
CamVid NYU
ECE (×10−3) AUSE (×10−2) ECE (×10−3) AUSE (×10−2)
T w/o AUG 3.67 1.62 57.9 5.88
T w/ AUG 3.90 1.62 57.1 5.90
S w/o AUG 4.63 2.19 54.0 5.91
S w/ AUG 2.86 1.60 51.3 5.80
S w/ Ens T 2.96 1.91 56.3 5.93
Figure 5.8: Example predictions on NYU.
93
increased due to less cached computation for the teacher. For instance, the student
model achieves a speed-up ratio of 46 for the NYU dataset.
5.4.3 Ablation Study on Additional Augmentation
To demonstrate the importance of additional augmentation during student train-
ing, we also summarize in Table 5.3 results when the student is trained without
extra augmentation. Using extra augmentation in the student training process as
discussed in Section 5.3.2 helps the student produce much better uncertainty esti-
mation. We can also see that the same extra augmentation does not improve the
performance of the teacher’s uncertainty estimation, suggesting that the student
model benefits from seeing the teacher’s predictions more closely aligned with the
test-time predictive distributions, rather from data augmentation itself.
5.4.4 Distilling from Deep Ensemble
To examine the effectiveness of using deep ensembles as teachers [143], we train an
ensemble of deterministic neural networks with aleatoric uncertainty [121]. The
training detail is identical to that described above. Due to limited computational
resources, we fix the number of models in the ensemble to five. Dirichlet distribu-
tion is not used to approximate teacher’s predictive distribution for classification
as in [143] because we empirically found it very numerically unstable and led to
failure of convergence. We show the uncertainty results in Table 3. Full results can
be found in the Appendix. As seen clearly from Table 3, the student obtained from
the ensemble teacher have worse calibration performance than the student distilled
from MC-Dropout teachers. The gap is likely due to the difficulty in learning a
94
good predictive distribution with just 5 samples.
5.4.5 Discussion
Our experiments show that incorporating aleotoric uncertainty can result in mini-
mal improvements for both teachers and students. This could be caused by signifi-
cant overlaps between the two types of uncertainties learned by the teacher model,
since the two types of uncertainties are not mutually exclusive, and can coincide
significantly [104]. Nonetheless, aleatoric uncertainty can be beneficial for other
tasks and datasets. The goal of the chapter is to propose a general distillation
strategy capable of also incorporating aleotoric uncertainty. Using the proposed
approach, as clearly seen from the experimental results, students can match or
surpass their teacher models in performance with or without aleatoric uncertainty.
We also stress that, since the student is supervised by both the ground truth
labels and the teacher’s predictions, there can be discrepancies in predictive dis-
tributions between the teacher and the student models. Nevertheless, we believe
these discrepancies can be beneficial and account for the improved performance of
the student. As demonstrated, student models produce well-calibrated uncertainty
maps that also semantically make sense without the need for expensive multiple
forward passes.
95
CHAPTER 6
TOWARDS A DEEPER UNDERSTANDING OF KNOWLEDGE
DISTILLATION
It has been recently demonstrated that multi-generational self-distillation can im-
prove generalization [49]. Despite this intriguing observation, reasons for the en-
hancement remain poorly understood. In this chapter, we first demonstrate ex-
perimentally that the improved performance of multi-generational self-distillation
is in part associated with the increasing diversity in teacher predictions. With this
in mind, we offer a new interpretation for teacher-student training as amortized
MAP estimation, such that teacher predictions enable instance-specific regular-
ization. Our framework allows us to theoretically relate self-distillation to label
smoothing, a commonly used technique that regularizes predictive uncertainty,
and suggests the importance of predictive diversity in addition to predictive un-
certainty. We present experimental results using multiple datasets and neural
network architectures that, overall, demonstrate the utility of predictive diversity.
Finally, we propose a novel instance-specific label smoothing technique that pro-
motes predictive diversity without the need for a separately trained teacher model.
We provide an empirical evaluation of the proposed method, which, we find, often
outperforms classical label smoothing.
6.1 Introduction
First introduced as a simple method to compress high-capacity neural networks into
a low-capacity counterpart for computational efficiency, knowledge distillation [84]
has since gained much popularity across various application domains ranging from
96
computer vision to natural language processing [108,128,166,224,226] as an effec-
tive method to transfer knowledge or features learned from a teacher network to
a student network. This empirical success is often justified with the intuition that
deeper teacher networks learn better representation with greater model complex-
ity, and the ”dark knowledge” that teacher networks provide facilitates student
networks to learn better representations and hence enhanced generalization per-
formance. Nevertheless, it still remains an open question as to how exactly student
networks benefit from this dark knowledge. The problem is made further puzzling
by the recent observation that even self-distillation, a special case of the teacher-
student training framework in which the teacher and student networks have iden-
tical architectures, can lead to better generalization performance [49]. It was also
demonstrated that repeated self-distillation process with multiple generations can
further improve classification accuracy.
In this work, we aim to shed some light on self-distillation. We start off
by revisiting the multi-generational self-distillation strategy, and experimentally
demonstrate that the performance improvement observed in multi-generational
self-distillation is correlated with increasing diversity in teacher predictions. In-
spired by this, we view self-distillation as instance-specific regularization on the
neural network softmax outputs, and cast the teacher-student training procedure
as performing amortized maximum a posteriori (MAP) estimation of the softmax
probability outputs. The proposed framework provides us with a new interpreta-
tion of the teacher predictions as instance-specific priors conditioned on the inputs.
This interpretation allows us to theoretically relate distillation to label smoothing,
a commonly used technique to regularize predictive uncertainty of NNs, and sug-
gests that regularization on the softmax probability simplex space in addition
to the regularization on predictive uncertainty can be the key to better gener-
97
alization. To verify the claim, we systematically design experiments to compare
teacher-student training against label smoothing. Lastly, to further demonstrate
the potential gain from regularization on the probability simplex space, we also de-
sign a new regularization procedure based on label smoothing that we term “Beta
smoothing.”
Our contributions can be summarized as follows:
1. We provide a plausible explanation for recent findings on multi-generational
self-distillation.
2. We offer an amortized MAP interpretation of the teacher-student training
strategy.
3. We attribute the success of distillation to regularization on both the label
space and the softmax probability simplex space, and verify the importance
of the latter with systematically designed experiments on several benchmark
datasets.
4. We propose a new regularization technique termed “Beta smoothing” that
improves upon classical label smoothing at little extra cost.
5. We demonstrate self-distillation can improve calibration.
6.2 Related Works
Knowledge distillation was first proposed as a way for model compression [4, 20,
84]. In addition to the standard approach in which the student model is trained
to match the teacher predictions, numerous other objectives have been explored
for enhanced distillation performance. For instance, distilling knowledge from
98
intermediate hidden layers were found to be beneficial [83, 92, 107, 184, 194, 224].
Recently, data-free distillation, a novel scenario in which the original data for the
teacher is unavailable to students, has also been extensively studied [22,26,148,225].
The original knowledge distillation technique for neural networks [84] has stim-
ulated a flurry of interest in the topic, with a large number of published improve-
ments and applications. For instance, prior works [5,188] have proposed Bayesian
techniques in which distributions are distilled with Monte Carlo samples into more
compact models like a neural network. More recently, there has also been work
on the importance of distillation from an ensemble of model [143], which provides
a complementary view on the role of predictive diversity. Lopez-Paz et al. [132]
combined distillation with the theory of privileged information, and offered a gen-
eralized framework for distillation. To simplify distillation, Zhu et al. [240] pro-
posed a method for one-stage online distillation. There have also been successful
applications of distillation for adversarial robustness [166].
Several papers have attempted to study the effect of distillation training on stu-
dent models. Furlanello et al. [49] examined the effect of distillation by comparing
the gradients of the distillation loss against that of the standard cross-entropy loss
with ground truth labels. Phuong et al. [171] considered a special case of distilla-
tion using linear and deep linear classifiers, and theoretically analyzed the effect
of distillation on student models. Cho and Hariharan [30] conducted a thorough
experimental analysis of knowledge distillation, and observed that larger models
may not be better teachers. Another experimentally driven work to understand
the effect of distillation was also done in the context of natural language process-
ing [237]. Most similar to our work is [227], in which the authors also established
a connection between label smoothing and distillation. However, our argument
99
comes from a different theoretical perspective and offers complementary insights.
Specifically, [227] does not highlight the importance of instance-specific regulariza-
tion. We also provide a general MAP framework and a careful empirical comparison
of label smoothing and self-distillation.
6.3 Preliminaries
We consider the problem of k-class classification. Let X ⊆ Rd be the feature space
and Y = {1, .., k} be the label space. Given a dataset D = {xi, y ni}n=1 where each
feature-label pair (xi, yi) ∈ X × Y , and we are interested in finding a function
that maps input features to corresponding labels f : X → Rc. In this work, we
restrict the function class to the set of neural networks fw(x) where w = {W }Li i=1
are the parameters of a neural network with L layers. We define a likelihood
model p(y|x;w) = Cat (softmax (fw(x))), a categorical distribution with param-
eters softmax (fw(x)) ∈ ∆(L). Here ∆(L) denotes the L-dimensional probability
simplex. Typically, maximum likelihood estimation (MLE) is performed. This
leads to the cross-entropy loss ∑n ∑k
Lcce(w) = − yij log p(y = j|xi;w), (6.1)
i=1 j=1
where yij corresponds to the j-th element of the one-hot encoded label yi.
6.3.1 Teacher-Student Training Objective
Given a pre-trained mod∑el (∑teacher) fwt , distillation loss can be defined as:n k ( )
Ldist(w) = − [softmax fwt(x)/T ]j log p(y = j|xi;w), (6.2)
i=1 j=1
100
where [·]j denotes the j’th element of a vector. A second network (student) fw can
then be trained with the following total loss:
L(w) = αLcce(w) + (1− α)Ldist(w), (6.3)
where α ∈ [0, 1] is a hyper-parameter, and T corresponds to the temperature
scaling hyper-parameter that flattens teacher predictions. In self-distillation, both
teacher and student models have the same network architecture. In the original
self-distillation experiments conducted by Furlanello et al. [49], α and T are set to
0 and 1, respectively throughout the entire training process.
Note that, temperature scaling has been applied differently compared to pre-
vious literature on distillation [84]. As addressed in Section 6.5, we only apply
temperature scaling to teacher predictions in computing distillation loss. We em-
pirically observe that this yields results consistent with previous reports. More-
over, as we show in the Appendix D.2, performing temperature scaling only on
the teacher but not the student models can lead to significantly more calibrated
predictions.
6.4 Multi-Generation Self-Distillation: A Close Look
Self-distillation can be repeated iteratively such that during training of the i-th
generation, the model obtained at (i−1)-th generation is used as the teacher model.
This approach is referred to as multi-generational self-distillation, or “Born-Again
Networks” (BAN). Empirically it has been observed that student predictions can
consistently improve with each generation. However, the mechanism behind this
improvement has remained elusive. In this work, we argue that the main attribute
that leads to better performance is the increasing uncertainty and diversity in
101
teacher predictions. Similar observations that more “tolerant” teacher predictions
lead to better students were also made by Yang et al. [223]. Indeed, due the mono-
tonicity and convexity of the negative log likelihood function, since the element
that corresponds to the true label class of the softmax output p(y = yi|xi;w) is
often much greater than that of the other classes, together with early stopping,
each subsequent model will likely have increasingly unconfident softmax outputs
corresponding to the true label class.
6.4.1 Predictive Uncertainty
We use Shannon Entropy to quantify the uncertainty in instance-specific teacher
predictions p(y|x;wi), averaged over the training set, which we call “Average Pre-
dictive Uncertainty,” and define as:
∑n ∑n ∑k
Ex [H (p(·|x;wi))] ≈
1 1
H (p(·|xj;wi)) = −p(yc|xj;wi) log p(yc|xj;wi).
n n
j=1 j=1 c=1
(6.4)
Note that previous literature [38,169] has also proposed to use the above measure
as a regularizer to prevent over-confident predictions. Label smoothing [169,198] is
a closely related technique that also penalizes over-confident predictions by explic-
itly smoothing out ground-truth labels. A detailed discussion on the relationship
between the two can be found in Appendix D.1.
6.4.2 Confidence Diversity
Average Predictive Uncertainty is insufficient to fully capture the variability asso-
ciated with teacher predictions. In this chapter, we argue it is also important to
102
consider the amount of spreading of teacher predictions over the probability sim-
plex among different (training) samples. For instance, two teachers can have very
similar Average Predictive Uncertainty values, but drastically different amounts
of spread on the probability simplex if the softmax predictions of one teacher
are much more diverse among different samples than the other. We coin this
population spread in predictive probabilities “Confidence Diversity.” As we show
below, characterizing the Confidence Diversity can be important for understanding
teacher-student training.
The differential entropy1 over the entire probability simplex is a natural mea-
sure to quantify the confidence diversity. However, accurate entropy estimation
can be challenging, and its computation is severely hampered by the curse of
dimensionality, particularly in applications with a large number of classes. To
alleviate the problem, in this chapter, we propose to measure only the entropy
of the softmax element corresponding to the true label class, thereby simplifying
the measure to a one-dimen(sional )entropy estimation task. Mathematically, if we
denote c = ϕ(x, y)[softmax fw(x) ]y, and let pC be the probability density func-
tion of the random variable Cϕ(X, Y ) where (X, Y ) ∼ p(x, y), then, we quantify
Confidence Diversity via the differen∫tial entropy of C:
h(C) = − pC(c) log pC(c) dc. (6.5)
We use the KNN-based entropy estimator to compute h(C) over the training set [9].
In essence, the above measure quantifies the amount of spread associated with the
teacher predictions on the true label class. The smaller the value, the more similar
the softmax values are across different samples.
1This is distinct from the average predictive uncertainty discussed in the previous section,
which measures the average Shannon entropy of probability vectors.
103
74.0 3.0
73.8 1.20 Test NLL 0.10
73.6 3.5
73.4 1.15 0.08 4.0
73.2 4.5
73.0 1.10 0.06
72.8
1.05 0.04
5.0
72.6 Test Accuracy Predictive Uncertainty 5.572.4 0.02 Confidence Diversity6.0
0 2 4 6 8 10 1.00 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Generations Generations Generations Generations
Figure 6.1: Results for sequential self-distillation over 10 generations are shown
above. Model obtained at the (i − 1)-th generation is used as the teacher model
for training at the i-th generation. Accuracy and NLL are obtained on the test
set using the student model, whereas the predictive uncertainty and confidence
diversity are evaluated on the training set with teacher predictions.
6.4.3 Sequential Self-Distillation Experiment
We perform sequential self-distillation with ResNet-34 on the CIFAR-100 dataset
for 10 generations. At each generation, we train the neural networks for 150 epochs
using the identical optimization procedure as in the original ResNet paper [79].
Following Furlanello et al. [49], α and T are set to 0 and 1 respectively throughout
the entire training process. Additional experiments with different values of T can
be found in Appendix D.3. Fig. 6.1 summarizes the results. As indicated by the
general increasing trend in test accuracy, sequential distillation indeed leads to
improvements. The entropy plots also support the hypothesis that subsequent
generations exhibit increasing diversity and uncertainty in predictions. Despite
the same increasing trend, the two entropy metrics quantify different things. The
increase in average predictive uncertainty suggests overall a drop in the confidence
of the categorical distribution, while the growth in confidence diversity suggests
an increasing variability in teacher predictions. Interestingly, we also see obvious
improvements in terms of NLL, suggesting in addition that BAN can improve
calibration of predictions [70].
To further study the apparent correlation between student performance and en-
tropy of teacher predictions over generations, we conduct a new experiment, where
104
we instead train a single teacher. This teacher is then used to train a single gen-
eration of students while varying the temperature hyper-parameter T in Eq. 6.3,
which explicitly adjusts the uncertainty and diversity of teacher predictions. For
consistency, we keep α = 0. Results are illustrated in Fig. 6.2. As expected,
increasing T leads to greater predictive uncertainty and diversity in teacher pre-
dictions. Importantly, we see this increase leads to drastic improvements in the
test accuracy of students. In fact, the gain is much greater than the best achieved
with 10 generations of BAN with T = 1 (indicated with the flat line in the plot).
The identified correlation is consistent with the recent finding that early-stopped
models, which typically have much larger entropy than fully trained ones, serve
as better teachers [30]. Lastly, we also see improvements in NLL with increasing
entropy of teacher predictions. However, too high T leads to a subsequent increase
in NLL, likely due to teacher predictions that lack in confidence.
A closer look at the entropy metrics of the above experiment reveals an impor-
tant insight. While the average predictive uncertainty is strictly increasing with T ,
the confidence diversity plateaus after T = 2.5. The plateau of confidence diversity
coincides closely with the stagnation of student test accuracy, hinting at the im-
portance of confidence diversity in teacher predictions. The apparent correlation
between accuracy and confidence diversity can be also seen from the additional
sequential self-distillation experiments found in Appendix D.3. This makes in-
tuitive sense. Given a training set, we would expect that some of the samples
be much more typically representative of the label class than others. Ideally, we
would hope to classify the typical examples with much greater confidence than an
ambiguous example of the same class. Previous results show that training with
such instance-specific uncertainty can indeed lead to better performance [170]. Our
view is that in self-distillation, the teacher provides the means for instance-specific
105
75.25 3.0
75.00 1.5 Test NLL 2.5 Predictive Uncertainty 0
74.75 1.4 1
74.50 1.3 2.0
74.25 1.2 1.5
2
74.00 1.1 1.0
3
73.75 Test Accuracy 1.0 0.5
4
73.50 5 Confidence Diversity0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
temperature temperature temperature temperature
Figure 6.2: Results with teacher predictions scaled by varying temperature T . The
flat lines in the plots correspond to the largest/smallest values achieved over 10
generations of sequential distillation with T = 1 in the previous experiments for
accuracy, predictive uncertainty and confidence diversity/NLL.
regularization.
6.5 An Amortized MAP Perspective of Self-Distillation
The instance-specific regularization perspective on self-distillation motivates us
to recast the training procedure as performing Maximum a posteriori (MAP)
estimation on the softmax probability vector. Specifically, suppose now that
the likelihood p(y|x, z) = Cat(z) be a categorical distribution with parameter
z ∈ ∆(L) and the conditional prior p(z|x) = Dir(αx) be a Dirichlet distribu-
tion with instance-specific parameter αx. Due to conjugacy of the Dirichlet prior,
a closed-form solution of ẑ = ∑ci+αxi−1i − , where ci corresponds to number ofj cj+αxj 1
occurrences of the i-th category, can be easily obtained.
The above framework is not useful for classification when given a new sample x
without any observations y. Moreover, in the common supervised learning setup,
only one observation of label y is available for each sample x. The MAP solution
shown above merely relies on the provided label y for each sample x, without
exploiting the potential similarities among different samples (xi)’s in the entire
dataset for more accurate estimation. For example, we could have different samples
that are almost duplicates (cf. [6]), but have different yi’s, which could inform us
106
about other labels that could be drawn from zi. Thus, instead of relying on the
instance-level closed-form solution,(we can)train a (student) network to amortize
the MAP estimation ẑi ≈ softmax fw(xi) with a given training set, resulting in
an optimization problem of:
∑n ∑n
max log p(z|xi, yi;w,αx) = max log p(y = yi|z,xi;w) + log p(z|xi;w,αx)
w w
i=1 ∑i=1n ∑n ∑k
= max ︸ log[softm︷a︷x (fw(xi))]y︸+ ︸ ([αx︷]︷c − 1) log[z]i i ︸c .w i=1 i=1 c=1
Cross entropy Instance-specific regularization
(6.6)
Eq. 6.6 is an objective that provides us with a function to obtain a MAP solution
of z given an input sample x. Note that, we do not make any assumptions about
the availability or number of label observations of y for each sample x. This
enables us to find an approximate MAP solution to x at test-time when αx and
y are unavailable. The resulting framework can be generally applicable to various
scenarios like semi-supervised learning or learning from multiple labels per sample.
Nevertheless, in the following, we restrict our attention to supervised learning with
a single label per training sample.
6.5.1 Label Smoothing as MAP
The difficulty now lies in obtaining the instance-specific prior Dir(αx). A naive
independence assumption that p(z|x) = p(z) can be made. Under such an as-
sumption, a sensible choice of prior would be a uniform distribution across all
possible labels. Choosing [αx]c = [α]
β
c = + 1 for all c ∈ {1, ..., k} for somek
107
hyper-parameter β, the MAP objective becomes
∑n ∑n ∑k
LLS = −
1
log[z]y + β − log[z]c. (6.7)i k
i=1 i=1 c=1
As noted in prior work, this loss function is equivalent to the commonly used
label smoothing (LS) regularization [169, 198] (derivations can be found in Ap-
pendix D.1). Observe also that the training objective in essence promotes predic-
tions with larger predictive uncertainty, but not confidence diversity.
6.5.2 Self-Distillation as MAP
A better instance-specific prior distribution can be obtained using a pre-trained
(teacher) neural network. Let us consider a network fwt traine(d with)the reg-
ular MLE objective, by maximizing p(y|x;wt) = Cat(softmax fwt(x) , where
∑[exp(fwt (x))][softmax (f (x))] = iwt i . Now, due to conjugacy of the Dirichlet prior,
j [exp(fwt (x))]j
the marginal likelihood p(y|x;αx) is a Dirichlet-multinomial distribution [150]. In
the case of single label observation considered, the marginal likelihood reduces to
a categorical distribution. As such, we have: p(y|x;αx) = Cat(αx), where αx is
normalized such that [α ] = ∑[αx]ix i . We can thus interpret exp (fwt(x)) as the
j [αx]j
parameters of the Dirichlet distribution to obtain a useful instance-specific prior
on z. However, we observe that there is a scale ambiguity that needs resolving,
since any of the following will yield the same αx:
αx = β exp(fwt(x)/T ) + γ, (6.8)
where T = 1 and γ = 0, and β corresponds to some hyper-parameter. Using T > 1
and γ > 0 corresponds to flattening the prior distribution, which we found to be
useful in practice - an observation consistent with prior work. Note that in the limit
of T → ∞, the instance-specific prior reduces to a uniform prior corresponding to
108
classical label smoothing. Setting γ = 1 (we also experimentally explore the effect
of varying γ. See Appendix D.9 for details), we obtain
∑
αx = β exp(fwt(x)/T ) + 1 = β [exp(fwt(x)/T )]j softmax(fwt(x)/T ) + 1.
j
(6.9)
Plugging this into Eq. 6.6 yields
∑n ∑n ∑k
LSD = − log[z]y + β ωx −[softmax(fwt(xi)/T )]c log[z]c, (6.10)i i
i=1 i=1 c=1
very similar to the dis∑tillation loss of Eq. 6.3, with an additional sample-specific
weighting term ωx = j[exp(fwt(xi)/T )]j!i
Despite the interesting result, we empirically observe that, with temperature
values T found to be useful in practice, the relative weightings of samples are too
close to yield a significant difference from regular distillation loss. Hence, for all of
our experiments, we still adopt the distillation loss of Eq. 6.3. However, we believe
that, with teacher models trained with an objective more appropriate than MLE,
the difference might be bigger. We hope to explore alternative ways of obtaining
teacher models to effectively utilize the sample re-weighted distillation objective
as future work.
The MAP interpretation, together with empirical experiments conducted in
Section 6.4, suggests that multi-generational self-distillation can in fact be seen as
an inefficient approach to implicitly flatten and diversify the instance-specific prior
distribution. Our experiments suggest that instead, we can more effectively tune
for hyper-parameters T and γ to achieve similar, if not better, results. Moreover,
from this perspective, distillation in general can be understood as a regularization
strategy. Some empirical evidence for this can be found in Appendix D.5 and D.6.
109
6.5.3 On the Relationship between Label Smoothing and
Self-Distillation
The MAP perspective reveals an intimate relationship between self-distillation
and label smoothing. Label smoothing increases the uncertainty of predictive
probabilities. However, as discussed in Section 6.4, this might not be enough to
prevent overfitting, as evidenced by the stagnant test accuracy despite increasing
uncertainty in Fig. 6.2. Indeed, the MAP perspective suggests that, ideally, each
sample should have a distinct probabilistic label. Instance-specific regularization
can encourage confidence diversity, in addition to predictive uncertainty.
While the predictive uncertainty can be explicitly used for regularization as
previously discussed in Section 6.4.1, we observe empirically that promoting con-
fidence diversity directly through the proposed measure in Section 6.4.2 can be
hard in practice, yielding unsatisfactory results. This could have been caused by
difficulty in estimating confidence diversity accurately using mini-batch samples.
Naively promoting confidence diversity during the early stage of training could
also have harmed learning. As such, we can view distillation as an indirect way of
achieving this objective. We leave it as future work to further explore alternative
techniques to enable direct regularization of confidence diversity.
6.6 Beta Smoothing Labels
Self-distillation requires training a separate teacher model. In this chapter, we
propose an efficient enhancement to label smoothing strategy where the amount of
smoothing will be proportional to the uncertainty of predictions. Specifically, we
110
make use of the exponential moving average (EMA) predictions as implemented
by Tarvainen and Valpola [201] of the model at training, and obtain a ranking
based on the confidence (the magnitude of the largest element of the softmax) of
predictions at each mini-batch, on the fly, from smallest to largest. Instead of
assigning uniform distributions [α ] = βx c +1 for all c ∈ {1, ..., k} to all samples ask
priors, during each iteration, we sample and sort a set of i.i.d. random variables
{b1 ≤ ... ≤ bm} from Beta(a, 1) where m corresponds to the mini-batch size and a
corresponds to the hyper-parameter associated with the Beta distribution. Then,
we assign [αx ]y = βbi + 1 and [α ] = β
1−bi
x c − + 1 for all c ̸= yi as the prior toi i i k 1
each sample xi, based on the ranking obtained. In this way, samples with larger
confidence obtained through the EMA predictions will receive less amount of label
smoothing and vice versa. Thus, the amount of label smoothing applied to a
sample will be proportional to the amount of confidence the model has about that
sample’s prediction. Those instances that are more challenging to classify will,
therefore, have more smoothing applied to their labels.
In practice, for consistency with distillation, Eq. 6.3 is used for training. Beta-
smoothed labels of bi on the ground truth class and
1−bi
− on all other classes arek 1
used in lieu of teacher predictions for each xi. Lastly, note that EMA predictions
are used in order to stabilize the ranking obtained at each iteration of training.
We empirically observe a significant performance boost with the EMA predictions.
We term this method Beta smoothing.
To better examine the role of EMA predictions has on Beta smoothing, we
conduct two ablation studies. Firstly, since the EMA predictions are used for
Beta smoothed labels, we compare the effectiveness of Beta smoothing against
self-training explicitly using the EMA predictions (see Appendix D.4 for details).
111
Moreover, to test the importance of ranking obtained from EMA predictions, we
include in the Appendix D.7 an additional experiment for which random Beta
smoothing is applied to each sample.
Beta smoothing regularization implements an instance-specific prior that en-
courages confidence diversity, and yet does not require the expensive step of train-
ing a separate teacher model. We note that, due to the constantly changing prior
used at every iteration of training, Beta smoothing does not, strictly speaking,
correspond to the MAP estimation in Eq. 6.6. Nevertheless, it is a simple and
effective way to implement the instance-specific prior strategy. As we demon-
strate in the following section, it can lead to much better performance than label
smoothing. Moreover, unlike teacher predictions which have unique softmax values
for all classes, the difference between Beta and label smoothing only comes from
the ground-truth softmax element. This enables us to conduct more systematic
experiments to illustrate the additional gain from promoting confidence diversity.
6.7 Empirical Comparison of Distillation and Label
Smoothing
To further demonstrate the benefits of the additional regularization on the soft-
max probability vector space, we design a systematic experiment to compare self-
distillation against label smoothing. In addition, experiments on Beta smoothing
are also conducted to further verify the importance of confidence diversity, and
to promote Beta smoothing as a simple alternative that can lead to better per-
formance than label smoothing at little extra cost. We note that, while previous
works have highlighted the similarity between distillation and label smoothing from
112
another perspective [227], we provide a detailed empirical analysis that uncovers
additional benefits of instance-specific regularization.
6.7.1 Experimental Setup
We conduct experiments on CIFAR-100 [111], CUB-200 [215] and Tiny-
imagenet [35] using ResNet [79] and DenseNet [89]. We follow the original opti-
mization configurations, and train the ResNet models for 150 epochs and DeseNet
models for 200 epochs. 10% of the training data is split as the validation set.
All experiments are repeated 5 times with random initialization. For simplicity,
label smoothing is implemented with explicit soft labels instead of the objective
in Eq. 6.7. We fix ϵ = 0.15 in label smoothing for all our experiments (additional
experiments with ϵ = 0.1, 0.3 can be found in the Appendix D.3.1). The hyper-
parameter α of Eq. 6.3 is taken to be 0.6 for self-distillation. Only one generation
of distillation is performed for all experiments. To systematically decompose the
effect of the two regularizations in self-distillation, given a pre-trained teacher and
α, we manually search for temperature T such that the average effective label of
the ground-truth class, α+ (1−α)[softmax (fwt(xi)/T )]y , is approximately equali
to 0.85 to match the hyper-parameter ϵ chosen for label smoothing. Eq. 6.3 is also
used for Beta smoothing with α = 0.4. The parameter a of the Beta distribution
is set such that E[α + (1 − α)bi] = ϵ, to make the average probability of ground
truth class the same as ϵ−label smoothing.
We emphasize that the goal of the experiment is to methodically decompose the
gain from the two aforementioned regularizations of distillation. Note that, both α
and T can influence the amount of predictive uncertainty and confidence diversity
in teacher predictions at the same time. This coupled effect can make hyper-
113
parameter tuning hard. Due to limited computational resources, hyper-parameter
tuning is not performed, and the results for all methods can be potentially en-
hanced. Lastly, we also incorporate an additional distillation experiment in which
the deeper DenseNet model is used as the teacher model for comparison against
self-distillation. Results can be found in Appendix D.8.
6.7.2 Results
Test accuracies are summarized in the top row for each experiment in Fig. 6.3.
Firstly, all regularization techniques lead to improved accuracy compared to the
baseline model trained with cross-entropy loss. In agreement with previous results,
self-distillation performs better than label smoothing in all of the experiments
with our setup, in which the effective degree of label smoothing in distillation is,
on average, the same as that of regular label smoothing. The results suggest the
importance of confidence diversity in addition to predictive uncertainty. It is worth
noting that we obtain encouraging results with Beta smoothing. Outperforming
label smoothing in all but the CIFAR-100 ResNet experiment, it can even achieve
comparable performance to that of self-distillation for the CUB-200 dataset with no
separate teacher model required. The improvements of Beta smoothing over label
smoothing also serve direct evidence on the importance of confidence diversity, as
the only difference between the two is the additional spreading of the ground truth
classes. We hypothesize that the gap in accuracy between Beta smoothing and
self-distillation is mainly due to better instance-specific priors set by a pre-trained
teacher network. The differences in the non-ground-truth classes between the two
methods could also account for the small gap in accuracy performance.
Results on calibration are shown in the bottom rows of Fig. 6.3, where we report
114
CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12
76
78
74
76
0.14 0.10
0.12 0.05
CE LS B SD CE LS B SD
(a) (b)
CUB-200, ResNet-34 CUB-200, DenseNet-121-12
56 60.0
54 57.5
52 55.0
0.2 0.2
0.1 0.1
0.0 CE LS B SD 0.0 CE LS B SD
(c) (d)
Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12
58 59
56 58
57
54
0.10
0.1
0.05
0.0 CE LS B SD CE LS B SD
(e) (f)
Figure 6.3: Experimental Results performed on CIFAR-100, CUB-200 and the
Tiny-Imagenet dataset. ”CE”, ”LS”, ”B” and ”SD” refers to ”Cross Entropy”,
”Label Smoothing”, ”Beta Smoothing” and ”Self-Distillation” respectively. The
top rows of each experiment show bar charts of accuracy on test set for each
experiment conducted, while the bottom rows are bar charts of expected calibration
error.
115
ECE Accuracy ECE Accuracy ECE Accuracy
ECE Accuracy ECE
Accuracy ECE Accuracy
the expected calibration error (ECE) [70]. As anticipated, all regularization tech-
niques lead to enhanced calibration. Nevertheless, we see that the errors obtained
with self-distillation are much smaller in general compared to label smoothing.
As such, instance-specific priors can also lead to more calibrated models. Beta
smoothing again not only produces models with much more calibrated predictions
compared to label smoothing but compares favorably to self-distillation in a ma-
jority of the experiments.
6.8 Discussion and Future Directions
Recent literature shows that label smoothing leads to better calibration perfor-
mance [153]. In this chapter, we demonstrate that distillation can also yield more
calibrated models. We believe this is a direct consequence of not performing tem-
perature scaling on student models during training. Indeed, with temperature
scaling also on the student models, the student logits are likely pushed larger
during training, leading to over-confident predictions.
More generally, we have only discussed the teacher-student training strategy
as MAP estimation. There have been other recently proposed techniques involv-
ing training with soft labels, which we can interpret as encouraging confidence
diversity or implementing instance-specific regularization. For instance, the mixup
regularization [233] technique creates label diversity by taking random convex com-
binations of the training data, including the labels. Recently proposed consistency-
based semi-supervised learning methods such as [119,201], on the other hand, uti-
lize predictions on unlabeled training samples as an instance-specific prior. We
believe this unifying view of regularization with soft labels can stimulate further
116
ideas on instance-specific regularization.
117
CHAPTER 7
A CASE STUDY OF DEEP LEARNING TO DIGITAL
PATHOLOGY IMAGE ANALYSIS
Machine Learning (ML) algorithms have been increasingly applied to a range of
histopathology tasks. An area that remains under-explored is the in-depth analysis
of the errors made by pathologists and ML models in these tasks. Furthermore,
little has been reported on hybrid approaches combining human and ML predic-
tions. In this final chapter, we present a detailed empirical analysis comparing
expert neuropathologists and ML models for the task of predicting IDH mutation
status in HE-stained histology slides of infiltrating gliomas, independently and
synergistically. We find that errors made by neuropathologists and ML models
trained using the TCGA dataset are distinct, representing modest agreement be-
tween predictions (human-vs-human κ = 0.656; human-vs-ML model κ = 0.598).
While no ML model surpassed human performance on an independent institu-
tional test dataset (human AUC = 0.918, max ML AUC = 0.883), a hybrid model
aggregating human and ML predictions demonstrates predictive performance com-
parable to the consensus of two expert neuropathologists (hybrid classifier AUC
= 0.921 vs. two-neuropathologist consensus AUC = 0.920). In addition, we show
that models trained at different optical objective levels can exhibit different types
of errors, underscoring the significance of aggregating across spatial scales in the
ML approach. Finally, we present a detailed interpretation of our multi-scale ML
ensemble model which reveals that predictions are driven by human-identifiable
features at the patch-level.
118
7.1 Introduction
With the advancement of computer processing power and the demonstrated utility
of deep learning approaches across multiple data-rich domains, the adoption of ma-
chine learning to medical diagnostics is anticipated to have a transformative effect
on patient care. Already, methylation-based machine learning (ML) approaches
to the classification of tumors of the central nervous system (CNS) have demon-
strated performance that can exceed traditional histology-based diagnosis [24],
and has allowed for the identification of novel entities [123] and molecular sub-
types within established classification systems [101, 123, 202]. Molecularly-defined
entities continue to emerge, many demonstrating overlapping histology with other
established tumor classes [180]. However, routine histopathologic examination
remains the mainstay of oncologic diagnosis due to its relatively low cost, ubiqui-
tous accessibility, the limited availability of advanced molecular assays, and estab-
lished robustness - particularly when performed by experienced subspecialty expert
histopathologists. Even in healthcare centers with access to advanced molecular
assays, the availability of subspecialty experts needed to perform organ-specific
histopathologic examination and integrate molecular results into the overall diag-
nostic picture may be lacking. Developing robust machine learning models that
leverage the immense, data-rich trove of existing and prospective histology slides
via digitally scanned whole slide images (WSI) and that reproduce or augment
subspecialist histopathology expertise can 1) help general pathologists render ac-
curate subspecialty diagnoses, 2) serve as a check on human sources of error by
acting as a highly reproducible and fatigue-free assistant, 3) help prioritize the
highest yield assays for a given specimen, reducing costs and tissue expenditure,
and 4) reveal discordant biases between ML models and human pathologists, which
when approached synergistically could increase the detection of clinically pertinent
119
biomarkers than either in isolation. Moreover, interrogating and understanding the
features that drive ML classification could reveal avenues for improvement in hu-
man expert assessments.
Infiltrating gliomas are the most common primary tumors of the CNS in
adults [149, 165], and despite significant advances in the understanding of their
biology, they are considered incurable by current standards of care, including sur-
gical gross total resection, radiotherapy, and chemotherapy [196]. Historically,
infiltrating gliomas were classified into the broad categories of astrocytoma and
oligodendroglioma on cytomorphological grounds, and assigned histologic grades
based on particular features including mitotic activity, necrosis, and microvascular
proliferation. The term ‘glioblastoma’ (GBM) was synonymous with the highest
grade variant of infiltrating astrocytoma (IV of IV) and such tumors carry a poor
prognosis with an average survival less than 2 years [16, 133]. With the discovery
of isocitrate dehydrogenase (IDH) mutation as a key driver of gliomagenesis in 25-
30% of infiltrating gliomas and its correlation with a favorable prognosis, recent
consensus guidelines regard IDH-mutant (IDHmut) tumors as biologically distinct
entities from IDH-wildtype (IDHwt) tumors, and indeed the term ‘glioblastoma’
is now only applied to IDHwt infiltrating astrocytomas with high grade histo-
logical and/or molecular features [16, 87, 133, 182, 222]. While IDHmut gliomas
are enriched for tumors with lower-grade histologic features, there is no known
definitive histologic standard for determining IDH status from histomorphology
alone, and immunohistochemical or molecular means are currently required for
such a determination; however, histomorphologic correlates of molecular alter-
ations are well-recognized in many tumor types, including infiltrating gliomas. As
noted by the WHO, certain histologic features have a stronger association with
IDHmut status, including gemistocytic and oligodendroglial-like cytomorphology,
120
while higher grade features such as palisading necrosis and microvascular prolifer-
ation are enriched in IDHwt tumors; however these features lack sensitivity and
specificity [162,163]. Our experience suggests that subspecialty neuropathologists
who review a high volume of infiltrating gliomas can predict the presence of IDH
mutation from routine HE stains with a relatively high degree of accuracy. There-
fore, we believed that histological prediction of IDH-status represented an ideal
prototype for the more general paradigm of designing computer vision models to
interrogate whole-slide images (WSI) to predict critical, clinically relevant tumor
biomarkers, and to combine these results with human assessment.
As discussed in previous chapters, convolutional neural networks (CNNs) are
a class of neural network architectures that have achieved state-of-the-art perfor-
mance in a range of computer vision problems. A challenge in the application of
CNNs to WSI processing is that there is a practical limit to the input image size
that can be handled (typically less than 1000x1000 pixels2) by today’s hardware
resources, such as GPU compute power and memory. WSI often have in the range
of 105 pixels in each dimension, and key diagnostic features are usually seen only
in small foci, necessitating tiling of the source image into appropriately sized train-
ing patches, and aggregation of patch-level class predictions to generate slide-level
predictions. Previous work has shown that CNNs can be used to classify WSI
histology data, particularly in epithelial cancers, including the prediction of driver
mutations in some cancers [23, 32, 34, 36, 41, 86, 102, 135]. Furthermore, integrat-
ing CNN predictions from histology with genomic information has been found to
predict behavior in infiltrating gliomas better than traditional histologic grading
alone [152].
Prior studies have largely trained ML classifiers on image patches derived at a
121
single level of magnification without aggregating across scales. This is in contrast
to what pathologists typically do, which is use a range of magnifications in assessing
tissue; i.e., pathologists scan slides at low magnification both to identify features
better appreciated at low power as well as to identify regions of interest for closer
examination at higher power. We therefore hypothesized that the accuracy of our
prototypical classification task would be magnification-level dependent, and that
ensembling multiple ML models trained at different scales would generate more
robust classification. Finally, we hypothesized that neuropathologists and ML
models would make different types of errors in classification, and that the aggregate
assessment of a hybrid pathologist/ML model would be superior to either human
or ML assessment alone.
7.2 Materials and Methods
7.2.1 Datasets
In this study, we used datasets from two cohorts of infiltrating gliomas patients ob-
tained from The Cancer Genome Atlas (TCGA) [203] and Weill-Cornell Medicine
(WCM). (1) TCGA: We downloaded HE-stained WSI along with gender and age
information from the TCGA-LGG and TCGA-GBM datasets. Only formalin-fixed
paraffin-embedded (FFPE) HE-stained diagnostic slides from primary tumor sites
were used. From these datasets, we obtained a total of 801 slide images (601 IDHwt
and 200 IDHmut) from 372 patients (261 IDHwt and 111 IDHmut) (Table 7.1).
We then split TCGA data into training, validation, and test sets, with all slides
from individual patients being sorted to the same subset. To ensure IDH class
122
balance during model evaluation for straightforward interpretation, we randomly
sampled 30 IDHwt slides and 30 IDHmut slides each in both the TCGA validation
and test sets. All other slides in the TCGA cohort were used for training. (2)
WCM: We queried the in-house clinical database at WCM for infiltrating gliomas
with available HE-stained slides, with recorded IDH mutation and 1p19q codele-
tion status, from 2011 to 2020. From these cases, a balanced dataset of IDHwt
and IDHmut gliomas (including both astrocytomas and oligodendrogliomas) were
scanned using the Aperio T2 system at 40X. This test dataset comprised 87 slides
from 74 patients with IDHwt gliomas, and 87 slides from 67 patients with IDHmut
gliomas.The digital images were captured with an Aperio T2 scanner. The scanned
images were reviewed by author CS for adequacy, and the evaluating authors (BL
and DP) were blinded to all information about the cases beyond the scanned HE
slides. The WCM dataset was used as an independent external test set to evaluate
ML model robustness and generalizability and to compare the ML models with
human IDH prediction performance. A subset of the WCM test set was also cre-
ated to exclude confounding effects generated from age and gender by propensity
score matching. Specifically, we calculated IDH mutation propensity score of each
patient in WCM test set based on age and gender. We then stratified propensity
scores, ranging from 0 to 1, into 10 bins with equal width. The same number of
samples from two IDH groups were selected as matching pairs from each propen-
sity score bin. The left samples in IDHmut group were the ones having relatively
high propensity scores, and the ones in IDHwt groups had relatively low propen-
sity scores. This yielded a demographically balanced WCM test* dataset with 36
patients in each IDH class.
123
Table 7.1: Summary of the demographics for the TCGA training, validation, and
test datasets and the WCM test datasets. No significant differences are seen in sex
between the IDHmut and IDHwt groups. IDH mutant gliomas show statistically
significant enrichment in younger patients, consistent with historic controls.
† indicates average simulation p-value: 140 IDH WT slides in the training dataset
were randomly sampled and one-way Anova was then conducted. Simulations were
repeated for 1000 times. * indicates propensity score matching accounting for age
and sex
IDH Status
Overall p value
WT MUT
Count (n)
Training 681 (312) 541 (232) 140 (80)
Slide (Patient)
Validation 60 (29) 30 (13) 30 (16)
Test 60 (31) 30 (16) 30 (15)
TCGA Overall 801 (372) 601 (261) 200 (111)
WCM Test 174 (141) 87 (74) 87 (67)
WCM Test* 85 (72) 41 (36) 44 (36)
Training 52.5 (16.4) 58.0 (13.1) 36.5 (14.6)
Validation 41.5 (19.7) 59.9 (12.1) 26.5 (8.56) 0.131†
Age (Years) Test 47.5 (21.0) 62.1 (15.7) 32.0 (13.4)
Mean (SD) TCGA Overall 51.2 (17.3) 58.4 (13.2) 34.5 (14.1) <0.0001
WCM Test 52.4 (16.6) 62.7 (12.8) 41.1 (12.5) <0.0001
WCM Test* 51.7 (12.05) 54.4 (12.5) 49.0 (11.1) 0.055
Training 115 (36.9) 84 (36.2) 31 (38.8)
Validation 13 (44.8) 8 (61.5) 5 (31.3) 0.821†
Female Test 13 (41.9) 6 (37.5) 7 (46.7)
n (%) TCGA Overall 141 (37.9) 98 (37.5) 43 (38.7) 0.921
WCM Test 63 (44.7) 38 (51.4) 25 (37.3) 0.132
WCM Test* 33 (45.8) 18 (50) 15 (41.7) 0.636
7.2.2 Image Preprocessing
We first tiled all WSI into non-overlapping patches of size 256 by 256 pixels at
levels of down-sampling corresponding to 2.5X, 5X, 10X, and 20X magnification
(Figure 7.1). Pixel values ranging between 40 and 215 in greyscale space were
treated as informative tissue, and pixels outside this range were considered unin-
formative, either as background whitespace (> 215) or folded tissue (< 40). Only
patches with over 75% tissue percentage were kept for further training and testing.
124
Figure 7.1: A schematic for the end-to-end process of model training and deploy-
ment. WSI are tiled into patches of 256x256 size at 2.5X, 5X, 10X, and 20X
magnification factors (A). In each training iteration (mini-batch), 200 randomly
selected and augmented patches from a single magnification of a single WSI were
passed to single-scale Densenet121 classifiers, initialized with imageNet pre-trained
weights. Feature embedding vectors from each patch were then aggregated using
näıve averaging, and the resulting vector was then passed to a final fully con-
nected (linear) classifier (B). Following training, the predictions three versions of
each single-scale model trained with different random seeds were averaged to pro-
duce a single-scale ensemble, and the predictions from each single-scale ensemble
were averaged to produce the multiscale ensemble (MSE) predictions. (C).
All patches with significant blurriness or pen marks were excluded by thresholding
RGB values obtained heuristically.
125
7.2.3 Model Training
After the image preprocessing step, each WSI had four sets of patches correspond-
ing to magnifications of 2.5X, 5X, 10X, and 20X. Single-scale models were trained
for each scale. We used a pre-trained DenseNet-121 architecture [89], without the
last dense layer, as the feature extractor to generate patch-level embeddings of
length 1024. All patch-level embeddings from one slide generated in each iteration
were aggregated into slide-level embeddings using average pooling. A fully con-
nected layer with 1024 nodes was then implemented to take the aggregated slide-
level features as input and output slide-level IDH mutation probabilities. Due to
memory constraints, only 200 patches from one WSI were randomly selected and
passed to the network for each training step (Figure 7.1 B). If there were less
than 200 patches for one slide, we used all available patches in that mini-batch.
Note the mini-batch consisted of a single WSI. To keep IDH classes balanced dur-
ing training, we randomly sampled 140 IDHwt slides and used all 140 IDHmut
slides in each training epoch. We used Adam as to minimize binary cross-entropy
loss [35,112] with a learning rate of 0.00001, and a maximum of 100 epochs [110].
Models from the epoch with the best validation loss were used. Three separate
single-scale models were trained using different random initial seeds.
7.2.4 Model Inference
The trained models from the last step can be used for predicting both patch-
level and slide-level IDH mutation status. We first averaged the three slide-level
probabilistic predictions at a given scale to compute single-scale predictions. A
multi-scale ensemble (MSE) was then computed by averaging all four single-scale
126
predictions (Figure 7.1 C). For patients with multiple slides, patient-level predic-
tions were computed by averaging slide-level predictions.
7.2.5 Pathologist Evaluation
The WCM test set was separately evaluated by two neuropathologists, blinded to
all patient information and ancillary testing beyond the WSI, to compare the model
predictions to human observers. For each case, both pathologists were asked to
issue a prediction for IDH status in a semiquantitative scale, normalized to a range
of 0 and 1 (i.e., 0 for a prediction of IDHwt and 1 for IDHmut, values close to 0.5
for cases with low certainty). The pathologists’ predictions were then averaged to
generate a two-pathologist consensus score. The predictions from each pathologist
were averaged with the MSE prediction to generate hybrid classifier scores, and
the two-pathologist consensus score was averaged with the MSE predictions to
generate a two-pathologist consensus-hybrid model.
7.2.6 Prediction Heatmap
Eight cases in the WCM test set were selected for visualization including, covering
all possible IDH status combinations of ground-truth, pathologists’ ensemble, and
slide-level MSE predictions. We used a sliding window strategy to generate a MSE
prediction heatmap. We set window size as 256 × 256 and step size as 256, 128,
64 and 32 for 20X, 10X, 5X, and 2.5X, respectively. Using this sliding windows
process, we passed patches containing greater than 50% tissue pixels through the
single-scale models. Pixel-level predictions were computed by averaging model
predictions for patches that contained that pixel, excluding patches below the 50%
127
tissue threshold. These regions were then manually examined by pathologists to
gain insights into the histologic features impacting predictions.
7.2.7 UMAP Visualization
We randomly selected five 10X patches from each WSI in the WCM test set for
UMAP visualization [8, 147]. Patch embeddings extracted by trained convolu-
tional base of the best performing 10X classifier were used as patch represen-
tations. We used the Python UMAP package with default hyper-parameters to
obtain the UMAP representations for each patch. For visualization purposes, the
first two dimensional vectors of UMAP projections were used as coordinates to
show the original input patches, ground-truth IDH mutation status, ground-truth
integrated molecular diagnosis (oligodendroglioma, IDHmut astrocytoma, IDHwt
astrocytoma), patch-level IDH prediction scores, and slide-level IDH prediction
scores from the classifier. The patches were then reviewed by the pathologists to
determine the presence of human-identifiable features in each clustering, and the
association between histomorphology with specific diagnoses.
7.2.8 Statistical Analysis and Software
All model trainings and inferences were performed on 4 NVIDIA Titan X GPUs.
Image preprocessing, model training and inference were conducted in Python, ver-
sion 3.7.4. OpenSlide python was used for reading and tiling WSI. Pytorch was
used for training neural networks. All statistical analyses were performed in R,
version 4.0.3. Slide prediction heatmaps were plotted using the ComplexHeatmap
R package [69]. Age differences were evaluated using t-test. Chi-square test was
128
used to test the gender difference between two IDH status groups. Confidence in-
tervals of model performance metrics were evaluated through sample bootstrapings
for 1000 times. All statistical tests were two-sided with a significance threshold of
p < 0.05.
7.2.9 Image Augmentation
To increase the model generalizability and reduce potential overfitting, we imple-
mented several image augmentation strategies during training. Since all patches
within each batch were from one WSI, color augmentations were performed on
slide level for each iteration, i.e., we only used one set of color augmentation pa-
rameters each iteration for all patches from each slide. We first transformed RGB
patches into HSV color space. Then pixel values were augmented channel-wise
as: Iaugc = αcIc + βc. Ic were pixel values in channel c. αc and βc were channel
specific color augmentation factors where αc and βc were sampled from uniform
distributions U(1− σ, 1 + σ) and U(−σ, σ) respectively for each slide. We set σ as
0.05 to control augmentation degree. In addition, each patch had 50% probability
of being flipped either vertically or horizontally and equal probability (25% each)
of being rotated by 0, 90, 180 or 270 degrees. Distinct augmentation parameters
were randomly generated during patch selection for each mini-batch.
129
Figure 7.2: ROC curves for the ML classifiers, pathologists, and hybrid models on
the WCM test data. Figure A compares the model performance of the single-scale
ensembles and the multi-scale ensemble. (MSE). The performance of the semi-
quantitative predictions of two expert neuropathologists and the two-pathologist
averaged consensus are compared in Figure B. Figure C compares the predictions
of the top-performing neuropathologist with the MSE, and the hybrid model gen-
erated by näıve averaging of pathologist and MSE predictions.
7.3 Results
7.3.1 ML Models Accurately Predict IDH Mutation Status
WSI images obtained from the publicly available TCGA database were used for
training, including 801 (601 IDHwt and 200 IDHmut) slides (Table 7.1). These
were split into training, validation, and test sets. As an external validation set,
WSI from our institution (Weill Cornell Medicine) were used, comprising 174 (87
IDHwt and 87 IDHmt) slides. WSI were tiled into 256x256 pixel2 patches over
multiple down-sampled levels corresponding to 2.5X, 5X, 10X, and 20X magni-
fication (Figure 7.1; see methods). Single-scale models were trained using the
DenseNet-121 CNN architecture [89] and patch-level embeddings were aggregated
into slide-level embeddings via average pooling, which were then used to gener-
ate slide-level IDH mutation probabilities at output. 200 patches from each WSI
were randomly selected and passed to the network during each training step (Fig-
ure 7.1 B). A multi-scale ensemble (MSE) was then generated by averaging all the
130
predictions over the single-scale models (Figure 7.1C; see methods for detail).
Receiver operating characteristic (ROC) curves were generated for patient-level
predictions of IDH status evaluated on the WCM test dataset using 1) single-scale
models, 2) multiscale ensemble (MSE) ML model, 3) expert neuropathologist, and
4) hybrid neuropathologist-MSE scores. Within the ML model (Figure 7.2 A), a
trend towards higher accuracy at intermediate magnifications is noted, with the
highest accuracy achieved by the 10x classifier (AUC = 0.881, 95% confidence in-
terval = 0.88-0.883), with diminished AUCs seen in models using the lowest (2.5x)
and highest (20x) levels of magnification. No ML model demonstrated a supe-
rior AUC compared to neuropathologists (Figure 7.2 B), and consensus averaging
of the two neuropathologists’ semiquantitative predictions demonstrated a higher
AUC than each neuropathologist individually. Averaging the top performing neu-
ropathologist’s semiquantitative predictions with the MSE prediction scores to
generate a human-ML hybrid classifier (Figure 7.2 C) shows a higher AUC than
either the ML classifier or the pathologist alone and demonstrates performance
similar to that of the two-neuropathologist consensus (hybrid classifier AUC =
0.921, 95% confidence interval = 0.920-0.923 vs. neuropathologist consensus AUC
= 0.92, 95% confidence interval = 0.918-0.921). Additionally, averaging of two-
neuropathologist consensus with the ML model provides an incremental increase
in prediction accuracy (AUC = 0.928, 95% confidence interval 0.927-0.929).
131
c IDH Wild Type IDH Mutant
2.5X
5X
10X
20X
Multi−Scale Ensemble
P1
P2
P1+P2
P+MSE
Predicted IDH Mutant Probability
0 0.5 1
Figure 7.3: Patient-level predictions in the WCM test data, for the pathologists
and ML models. Panel A compares the semiquantitative prediction scores of
the two neuropathologists (κ = 0.656, R = 0.767). Panel B compares the two-
neuropathologist consensus predictions to the multiscale classifier. (κ = 0.598,
R = 0.674). Panel C shows all patient-level predictions using the single-scale mod-
els, multiscale ensemble, individual pathologists (P1, P2), two-pathologist consen-
sus (P1+P2), and the hybrid classifier (P+WSIP1+MSE).
132
7.3.2 Single-Scale ML Models Make Distinct Errors Rela-
tive to Each Other and to Humans
Comparisons of patient-level predictions of the pathologists and classifiers using the
WCM data are shown in Figure 7.3. Figure 7.3 A shows a scatter plot comparing
the semiquantitative prediction scores of the two pathologists. Concordant predic-
tions are found in the yellow quadrants, while discordant predictions appear in the
pink quadrants. High densities of accurate predictions are located at the extremes
of the concordant regions, while inaccurate predictions are enriched in regions of
lower certainty. The Pearson coefficient r for the semiquantitative predictions of
the pathologists is 0.767, while the Cohen’s kappa for the binary predictions of the
pathologists is 0.656. Figure 7.3 B shows a scatter plot of the pathologist consen-
sus score (averaged semiquantitative predictions of the pathologists) compared to
the MSE predictions. The correlation between MSE and pathologist consensus is
less than between the two pathologists (Pearson coefficient r = 0.674), and corre-
spondingly there is a lower degree of concordance between the binary classifications
(Cohen’s kappa = 0.598). Among discordant cases, there is a slight enrichment
of IDHmut cases that are accurately predicted by the pathologists and missed by
the MSE, while there is slight enrichment of IDHwt cases accurately predicted by
the MSE and missed by the pathologists. Figure 7.3 D shows patient-level IDH
prediction scores from the single-scale and multi-scale ensemble classifiers, pathol-
ogists, and hybrid predictions, highlighting the orthogonal nature of errors made
at individual levels of magnification.
133
Figure 7.4: This shows examples of the sliding windows visualizations, with rep-
resentative patches from regions from 3 example cases that provide insight into
features recognized by the classifier. (A) shows a low power HE image of a slide
that was accurately predicted as IDHmut by the neuropathologists, but was in-
correctly classified by the MSE. (B) shows a heatmap of average pixel-level IDH
mutation status predictions. Selected patches from image A demonstrate higher
IDHmut predictions in regions of solid tumor (C), with higher IDHwt predic-
tions in regions of minimally involved brain parenchyma (D). E and F show an
example of a slide from an IDHmut case, which was misclassified by both the
neuropathologists and the ML classifier. Regions from this slide containing tumor
with monomorphic gemistocytic cytomophology (G) and regions of minimally in-
volved brain parenchyma with perineuronal and perivascular white space artifact
(H) were associated with a higher prediction for IDHmut, while areas of mini-
mally involved brain parenchyma without significant whitespace artifact (I) and
regions with more bizarre cytology (J) were associated with a higher prediction of
IDHwt status. Figures K and L show a slide from an IDHmut glioma which was
accurately predicted by the ML classifier, but inaccurately predicted by the neu-
ropathologists. Areas of mildly cellular tumor, both with and without whitespace
artifact (M and N respectively) were associated with higher IDHmut predictions,
while regions of necrosis (O) and regions of minimally involved brain parenchyma
(P) were associated with higher IDHwt predictions.
7.3.3 Patch-Level Predictions Reveal Features that Drive
Accurate and Inaccurate Predictions
To gain insight into (1) the decision-related morphological features of the ML mod-
els and (2) the types of errors made by both the classifiers and pathologists, sliding
134
patch-level IDH predictions were generated for selected slides using the MSE, three
of which will be examined in further detail here (Figure 7.4). In the first infor-
mative case (Figure 7.4 A-D), neuropathologists were correct in predicting IDH
mutation, but the case was inaccurately predicted by the MSE to be IDHwt at the
slide-level. Regions shown in yellow (Figure 7.4 C) were predicted by the MSE as
consistent with IDH mutation, and were also recognized by the neuropathologists
as harboring relatively hypercellular infiltrating tumor that was likely IDH-mutant.
In contrast, regions encoded in blue (Figure 7.4 D) drove the overall slide-level
misclassification of MSE. These regions were enriched in brain parenchyma with
minimal to no infiltration by tumor cells (as determined by human examination)
and were disregarded as non-contributory to the IDH classification task by the
neuropathologists. Thus, although the classifier was correct in determining that
these areas were not enriched for IDH-mutated tumor, the binary classification
task of determining the slide’s overall IDH status was evidently hampered by the
large presence of uninvolved brain.
In a second case (Figure 7.4 E-J), that of an IDHmut glioma that was in-
accurately classified by both the neuropathologists and the MSE, many regions
harbored a relatively monomorphic gemistocytic cytomorphology (Figure 7.4 G).
These regions were accurately interpreted by the classifier as consistent with
IDHmut status, and in retrospect also likely would have been favored to represent
IDH-mutated tumor to the neuropathologists if presented in isolation. However,
one region of marked nuclear pleomorphism ( 7.4 J) was considered by the classifier
as IDHwt and this same region also drove the neuropathologists’ decision to classify
the entire slide as likely IDHwt. Human-determined ‘uninformative regions’ again
drove inaccurate MSE classification of particular areas: regions of uninvolved brain
but with increased white-space around individual neurons and vascular channels
135
due to tissue processing 7.4 H) by the MSE, while regions of relatively uninvolved
brain and without significant intraparenchymal white-space ( 7.4 G) were again
erroneously predicted as IDHwt, similar to the first sample case.
The final example (Figure 7.4 K-P) illustrates an IDHmut glioma inaccurately
predicted by the neuropathologists as IDHwt, but correctly predicted by the MSE.
In this case, solid regions of tumor (7.4 M and 7.4N) were accurately predicted
by the ML classifier as areas with (IDH-mutated) tumor. A large area of necrosis
was present in this slide (7.4O), which drove inaccurate prediction of IDHwt by
both neuropathologists, and this area in isolation was also classified as IDHwt
by the MSE. Once again, regions of minimally involved normal brain (7.4P) were
predicted as correlating with IDHwt by the MSE.
7.3.4 Patch-Level Embedding Vectors Reflect Diagnosti-
cally Relevant Human-Identifiable Features
To gain further insight into the histological features encoded by our trained ML
models, 5 random patches were selected from each slide in the WCM dataset and
uniform manifold approximation and projection (UMAP) was performed on the
patch-level embedding vectors from the best performing 10x-scale classifier (Fig-
ure 7.5 A-B). Review of the histological features of the clustered patches revealed
remarkably consistent patterns across patches from distinct slides. Emergent
human-identifiable features included: 1) microcystic architecture (Figure 7.5C),
which is correlated with IDHmut status; 2) hypercellular regions of tumor with
round, monomorphic nuclei, reminiscent of oligodendrocytes, which were appro-
priately enriched for IDHmut tumors (Figure 7.5 D); 3) hypercellular tumor areas
136
Figure 7.5: UMAP coordinates of the feature embedding vector activations from
patches passed through the 10x classifier. A shows some example tiles in 2D UMAP
coordinates. B shows the patch-level IDH status prediction scores as predicted by
the 10x classifier. Tiles from region C demonstrate microcystic architecture. Tiles
from region D demonstrate hypercellular regions of infiltrating tumor, with round
cytology, enriched for tumors with oligodendroglial morphology. Tiles from region
E demonstrate hypercellular regions of tumor with a greater degree of nuclear
spindling/elongation and nuclear pleomorphism. Tiles from region F demonstrate
brain parenchyma without significant infiltration by tumor cells.
137
with spindled nuclei and greater pleomorphism, enriched for IDHwt tumors (Fig-
ure 7.5E); and 4) brain parenchyma without significant human-detectable involve-
ment by tumor (by HE), that were predicted by the classifier as harboring IDH
mutation irrespective of the ground truth slide-level class (Figure 7.5 F). Other
features apparently captured by the embedding vectors include patches with a sig-
nificant amount of whitespace (Figure 7.5 A, top-right) and regions with abundant
hemorrhage or necrosis (Figure 7.5 A, top center).
7.4 Discussion
Just as our understanding of the molecular biology and oncogenic drivers of neo-
plastic disease evolves quickly, iteratively impacting the diagnostic framework upon
which clinicians rely to treat their patients, the accessibility of computing power
and ML techniques are also evolving at a rapid pace. An open question in med-
ical diagnostics is whether existing but previously untapped data-rich resources,
such as histological images made computationally accessible though scanning and
digitization, effectively encodes information that is comparable or even superior to
current standard-of-care diagnostic and/or prognostic testing modalities in guiding
effective patient care. To begin to attack this question, we selected a prototypical
problem in glioma diagnostics, endeavoring to predict one of the most prognosti-
cally significant molecular markers in glioma biology, IDH mutation, using CNN
models with HE-based histological information as the sole input. Moreover, we
compared the performance of this task over multiple magnification scales. As a
reference point, we compared the performance of the CNN models, trained on the
order of hours, with those of subspecialty-trained expert neuropathologists, trained
on the order of years to decades.
138
Comparison of ML model predictions to expert pathologists shows that while
similar degrees of accuracy are obtained on the classification task, the types of er-
rors made were distinct, and the combination of human pathologist predictions and
ML predictions results in greater classification robustness than either alone. Man-
ual interrogation of the patch-level predictions from accurately and inaccurately
classified slides demonstrates several likely confounders exploited by the ML mod-
els which, interestingly, were found to be reproducible at all levels of magnification.
The most striking sources of error in our models correlate with regions of human-
interpreted low informativity within the underlying tissue. In particular, regions
with abundant non-neoplastic brain tissue and/or without human-detectable evi-
dence of infiltrating tumor cells were most often classified as IDHwt, while regions
with increased white-space secondary to vacuolation, edema, and/or tissue arti-
facts, including histologic ‘cracks’ in the tissue and cautery artifact, were most
often classified as IDHmut. While areas lacking tumor are indeed IDHwt per se
given the putative absence of tumor cells, the task was built around slide-level
classification of de facto tumors. Our interpretation, therefore, is that classifying
these regions as IDHwt on-average drove the classifier to a higher degree of accu-
racy overall, despite the patch-level ‘uninformativeness’ as determined by human
observers. A future direction to address the possible confounding effects of such
regions is to explicitly annotate and train toward a third-class label, that of “non-
neoplastic brain” through the incorporation of autopsy and epilepsy histological
slides into the training set. Surprisingly, in the set of sliding window heatmaps an-
alyzed, the models were not clearly driven by features that pathologists often used
to predict IDH-class due to their enrichment in IDHwt tumors, such as well-formed
palisading necrosis and microvascular proliferation.
The presence of human-identifiable features as seen in the UMAP projections
139
demonstrates that the CNNs are capable of recognizing some of the features used
by humans in the classification of gliomas. In particular, we showed that patches
demonstrating microcystic architecture or oligodendroglial cytomorphology were
enriched for IDHmt classification while patches with increased spindled cells and
pleomorphism were enriched for IDHwt classification. In fact, histomorphologic
correlates of certain driver alterations have been previously identified, such as
giant-cell morphology in IDHwt glioblastomas harboring TP53 mutation, and ep-
ithelioid morphology in high grade gliomas harboring BRAF mutations; however,
given the heterogeneity of infiltrating gliomas, and particularly in IDHwt astrocy-
tomas/glioblastomas, these morphologic correlates as assessed by human pathol-
ogists have relatively poor predictive utility. The UMAP also clearly illustrated
that regions of human-interpreted low informational value relative to the task were
enriched for particular classes, such as normal appearing brain being enriched for
IDHwt class. Again, the identification of recurrent potential confounders across
these models suggests that strategies to devalue or exclude uninformative patches
could further improve classification accuracy, and expanding the number of avail-
able classes to include non-neoplastic samples, as alluded to above, may improve
ML performance. In addition, we believe that given a sufficiently large dataset
of histologic data paired with RNA transcriptome and DNA methylation profil-
ing, histomorphologic correlates may be identified, however further studies will be
necessary to assess for this.
How to aggregate patch-level predictions into a slide-level classification is a
widely studied problem – often in the multiple-instance learning literature. For ex-
ample, attention mechanisms that increase the weight of highly informative patches
on the final classification prediction have been found to be useful in other cancer
types. However, the differing biological characteristics of tumor types that are
140
reflected in histology (for example that infiltrating gliomas typically have an ill-
defined border with respect to the surrounding non-neoplastic tissue, a feature that
differs significantly from that of epithelial cancers) are likely to impact the rela-
tive efficacy of any particular ML algorithm, and the strategies employed are not
likely to be universally applicable to models trained for relatively discrete diagnos-
tic tasks. In our experiments conducted with this dataset, attention mechanisms
did not provide a significant improvement on classification performance relative to
näıve averaging of embedding weights (data not shown). That said, as the number
of potential target outputs of the model increases, attention mechanisms may help
boost performance, but future studies using a broader variety of target classes are
necessary to better assess this.
Our results demonstrate that the level of magnification used for input images
does impact ML model accuracy, with the greatest levels of accuracy achieved at
intermediate levels of magnification (corresponding to 10x objective in our study).
One interpretation of this finding is that while lower levels of magnification provide
a larger field of view with a greater degree of overall tissue sampling and increased
architectural information, higher levels of magnification provide increased cytologic
detail yet with a smaller field of view. Intermediate levels of magnification may
represent a sweet spot that captures both low-power and high-power information
relevant to the task. Of practical importance, some of the errors made at different
levels of magnification were found to be orthogonal to each other, and to the errors
made by human observers, providing a rationale for multi-scale ML models and hy-
brid ML-human approaches to histological diagnosis. At the same time, we believe
that designing ML models to more explicitly recapitulate the human methodology
of examining the entire slide at lower power, and then selecting regions of interest
to interrogate at higher power could result in more robust model prediction accu-
141
racy while also reducing the computational cost necessary to interrogate an entire
image at high power. However, future studies will be necessary to confirm this.
This study has demonstrated that ML models can achieve near human-level
performance at predicting clinically relevant oncologic biomarkers using HE-based
histological information alone, even with a completely external test set and with
training times and slide exposure that is minimal compared to that needed to
train human subspecialty experts. Moreover, by analyzing single magnification and
multi-scale input models and by interrogating encoded features through heatmap
and UMAP visualizations of patch-level predictions, crucial insights of how to iter-
atively improve the ML models can be obtained. This study represents a proof-of-
principle that ML models hold great promise in approaching and potentially super-
seding human level performance of critical biomarker detection via deep learning
of widely accessible HE slides, potentially unmasking a full-circle return to the HE
as the gold standard for oncological diagnostics and prognostication.
142
CHAPTER 8
CONCLUSION
As we have discussed in this thesis, reliability is a crucial aspect for successful
applications of deep neural networks to real-world problem solving. We proposed
several methods on this important problem.
In Chapter 2, we proposed theoretically grounded and easy-to-use classes of
noise-robust loss functions, the Lq loss and the truncated Lq loss, for classification
with noisy labels that can be employed with any existing DNN algorithm. We
empirically verified noise robustness on various datasets with both closed- and
open-set noise scenarios.
In Chapter 3, we reinterpreted MC dropout as ensemble averaging strategy,
and attributed its poor performance in convolutional neural networks to a lack of
diversity of sampled models using the error-ambiguity decomposition of the Brier
score (or MSE), a widely used performance metric that captures both accuracy
and calibration of probabilistic outputs. As we demonstrated empirically, om-
nibus dropout, which is simple-to-implement and computationally efficient, strikes
the right balance between model diversity among sampled models while retaining
reasonable performance of individuals models, thereby consistently improving the
quality of the ensemble’s prediction.
In Chapter 4, we proposed orthogonal dropout, an easy-to-implement tech-
nique that allows us to split a single high capacity neural network model into an
ensemble of subnetworks. In our experiments, we demonstrated and discussed the
regularization effect achieved by training pruned networks and using a randomized
and frozen fully connected layer in the network. Finally, we presented exhaus-
143
tive results that show that our method consistently outperforms several of the
recently proposed state-of-the-art methods for efficient ensembles. Furthermore,
our method achieved accuracy and uncertainty values matching that of an explicit
deep ensemble, while demanding significantly less storage. Lastly, we described
several shortcomings of the proposed method for potential future exploration.
In Chapter 5, we presented a two-stage teacher-student framework for fast
uncertainty estimates. The proposed student training procedure is not only capable
of producing uncertainty estimates at no extra cost but also leads to improved
predictive performance and more informative uncertainty estimates. We believe
the method gets us one step closer to the realm of trustworthy deep learning for
computer vision.
In chapter 6, we provided empirical evidence that diversity in teacher predic-
tions is correlated with student performance in self-distillation. Inspired by this
observation, we offered an amortized MAP interpretation of the popular teacher-
student training strategy. The novel viewpoint provides us with insights on self-
distillation and suggests ways to improve it. For example, encouraged by the results
obtained with Beta smoothing, there are possibly better and/or more efficient ways
to obtain priors for instance-specific regularization.
Finally in chapter 7, we applied deep learning for a real-world digital whole slide
histopathology image analysis problem. Specifically, we showed that combining an
expert pathologist’s assessments with ML model predictions can classify IDH mu-
tation status in infiltrating gliomas at a comparable level to two-expert consensus.
This serves as proof of principle for the broader application of neural network mod-
els in deriving clinically relevant molecular markers based on histopathology alone.
We also demonstrated that ML classification performance varies with level of mag-
144
nification, and that discordant errors are made across scales, suggesting value in
ensembling across levels of magnification.
145
BIBLIOGRAPHY
[1] Javier Antorán, James Urquhart Allingham, and José Miguel Hernández-
Lobato. Depth uncertainty in neural networks. arXiv preprint
arXiv:2006.08437, 2020.
[2] Devansh Arpit, Stanis law Jastrzebski, Nicolas Ballas, David Krueger, Em-
manuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron
Courville, Yoshua Bengio, et al. A closer look at memorization in deep net-
works. arXiv preprint arXiv:1706.05394, 2017.
[3] Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, and Trevor Darrell. Auxil-
iary image regularization for deep cnns with noisy labels. arXiv preprint
arXiv:1511.07069, 2015.
[4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In
Advances in neural information processing systems, pages 2654–2662, 2014.
[5] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max
Welling. Bayesian dark knowledge. In Advances in Neural Information Pro-
cessing Systems, pages 3438–3446, 2015.
[6] Björn Barz and Joachim Denzler. Do we train on test data? purging cifar of
near-duplicates. arXiv preprint arXiv:1902.00423, 2019.
[7] Mokhtar S Bazaraa, Hanif D Sherali, and Chitharanjan M Shetty. Nonlinear
programming: theory and algorithms. John Wiley & Sons, 2013.
[8] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Im-
manuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell.
Dimensionality reduction for visualizing single-cell data using umap. Nature
biotechnology, 37(1):38–44, 2019.
[9] Jan Beirlant, Edward J Dudewicz, László Györfi, and Edward C Van der
Meulen. Nonparametric entropy estimation: An overview. International
Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997.
[10] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler.
The power of ensembles for active learning in image classification. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 9368–9377, 2018.
146
[11] Lorenzo Bertoni, Sven Kreiss, and Alexandre Alahi. Monoloco: Monocular
3d pedestrian localization and uncertainty estimation. In Proceedings of the
IEEE International Conference on Computer Vision, pages 6861–6871, 2019.
[12] Léonard Blier and Yann Ollivier. The description length of deep learning
models. In Advances in Neural Information Processing Systems, pages 2216–
2226, 2018.
[13] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wier-
stra. Weight uncertainty in neural network. In International Conference on
Machine Learning, pages 1613–1622, 2015.
[14] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for
large-scale machine learning. Siam Review, 60(2):223–311, 2018.
[15] George EP Box and David R Cox. An analysis of transformations. Journal
of the Royal Statistical Society. Series B (Methodological), pages 211–252,
1964.
[16] Daniel J Brat, Kenneth Aldape, Howard Colman, Dominique Figrarella-
Branger, Gregory N Fuller, Caterina Giannini, Eric C Holland, Robert B
Jenkins, Bette Kleinschmidt-DeMasters, Takashi Komori, et al. cimpact-now
update 5: recommended grading criteria and terminologies for idh-mutant
astrocytomas. Acta neuropathologica, 139(3):603–608, 2020.
[17] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-
performance large-scale image recognition without normalization. arXiv
preprint arXiv:2102.06171, 2021.
[18] J Paul Brooks. Support vector machines with the ramp loss and the hard
margin loss. Operations research, 59(2):467–479, 2011.
[19] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla.
Segmentation and recognition using structure from motion point clouds. In
ECCV (1), pages 44–57, 2008.
[20] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model com-
pression. In Proceedings of the 12th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 535–541, 2006.
[21] Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. Dropout dis-
147
tillation. In International Conference on Machine Learning, pages 99–107,
2016.
[22] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney,
and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 13169–13178, 2020.
[23] Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor,
Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter,
David S Klimstra, and Thomas J Fuchs. Clinical-grade computational
pathology using weakly supervised deep learning on whole slide images. Na-
ture medicine, 25(8):1301–1309, 2019.
[24] David Capper, David TW Jones, Martin Sill, Volker Hovestadt, Daniel
Schrimpf, Dominik Sturm, Christian Koelsche, Felix Sahm, Lukas Chavez,
David E Reuss, et al. Dna methylation-based classification of central nervous
system tumours. Nature, 555(7697):469–474, 2018.
[25] Shih-Kang Chao, Zhanyu Wang, Yue Xing, and Guang Cheng. Directional
pruning of deep neural networks. Advances in Neural Information Processing
Systems, 33, 2020.
[26] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu,
Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student
networks. In Proceedings of the IEEE International Conference on Computer
Vision, pages 3514–3522, 2019.
[27] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam.
Rethinking atrous convolution for semantic image segmentation. arXiv
preprint arXiv:1706.05587, 2017.
[28] Liyan Chen, Philip Gautier, and Sergul Aydore. Dropcluster: A structured
dropout for convolutional networks. arXiv preprint arXiv:2002.02997, 2020.
[29] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamil-
tonian monte carlo. In International conference on machine learning, pages
1683–1691, 2014.
[30] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge dis-
tillation. In Proceedings of the IEEE International Conference on Computer
Vision, pages 4794–4802, 2019.
148
[31] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus
Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding. In Proc. of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016.
[32] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos,
Navneet Narula, Matija Snuderl, David Fenyö, Andre L Moreira, Narges
Razavian, and Aristotelis Tsirigos. Classification and mutation prediction
from non–small cell lung cancer histopathology images using deep learning.
Nature medicine, 24(10):1559–1567, 2018.
[33] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak
Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein,
Matthew D Hoffman, et al. Underspecification presents challenges for credi-
bility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
[34] Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav
Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot,
Brendan O’Donoghue, Daniel Visentin, et al. Clinically applicable deep learn-
ing for diagnosis and referral in retinal disease. Nature medicine, 24(9):1342–
1350, 2018.
[35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Im-
agenet: A large-scale hierarchical image database. In Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–
255. IEEE, 2009.
[36] James A Diao, Jason K Wang, Wan Fung Chui, Victoria Mountain,
Sai Chowdary Gullapally, Ramprakash Srinivasan, Richard N Mitchell, Ben-
jamin Glass, Sara Hoffman, Sudha K Rao, et al. Human-interpretable image
features derived from densely mapped cancer pathology slides predict diverse
molecular phenotypes. Nature communications, 12(1):1–15, 2021.
[37] Thomas G Dietterich. Ensemble methods in machine learning. In Interna-
tional workshop on multiple classifier systems, pages 1–15. Springer, 2000.
[38] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik.
Maximum-entropy fine grained classification. In Advances in Neural Infor-
mation Processing Systems, pages 637–647, 2018.
[39] Nikita Durasov, Timur Bagautdinov, Pierre Baque, and Pascal Fua.
149
Masksembles for uncertainty estimation. arXiv preprint arXiv:2012.08334,
2020.
[40] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture
search: A survey. The Journal of Machine Learning Research, 20(1):1997–
2017, 2019.
[41] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter,
Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of
skin cancer with deep neural networks. nature, 542(7639):115–118, 2017.
[42] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The PASCAL Visual Object Classes
Challenge 2012 (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html.
[43] Di Feng, Lars Rosenbaum, and Klaus Dietmayer. Towards safe autonomous
driving: Capture uncertainty in the deep neural network for lidar 3d vehicle
detection. In 2018 21st International Conference on Intelligent Transporta-
tion Systems (ITSC), pages 3266–3273. IEEE, 2018.
[44] Davide Ferrari, Yuhong Yang, et al. Maximum lq-likelihood estimation. The
Annals of Statistics, 38(2):753–783, 2010.
[45] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles:
A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
[46] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Find-
ing sparse, trainable neural networks. In International Conference on Learn-
ing Representations, 2018.
[47] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael
Carbin. Pruning neural networks at initialization: Why are we missing the
mark? arXiv preprint arXiv:2009.08576, 2020.
[48] Benôıt Frénay and Michel Verleysen. Classification in the presence of label
noise: a survey. IEEE transactions on neural networks and learning systems,
25(5):845–869, 2014.
[49] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and
Anima Anandkumar. Born again neural networks. In International Confer-
ence on Machine Learning, pages 1607–1616, 2018.
150
[50] Yarin Gal et al. Uncertainty in deep learning.
[51] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning. In international conference
on machine learning, pages 1050–1059. PMLR, 2016.
[52] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application
of dropout in recurrent neural networks. In Advances in neural information
processing systems, pages 1019–1027, 2016.
[53] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Proceedings of
the 31st International Conference on Neural Information Processing Systems,
pages 3584–3593, 2017.
[54] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active
learning with image data. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017.
[55] Xavier Gastaldi. Shake-shake regularization. arXiv preprint
arXiv:1705.07485, 2017.
[56] Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty
estimation for deep neural classifiers. International Conference on Learning
Representations, 2019.
[57] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision
meets robotics: The kitti dataset. International Journal of Robotics Research
(IJRR), 2013.
[58] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regulariza-
tion method for convolutional networks. In Advances in Neural Information
Processing Systems, pages 10727–10737, 2018.
[59] Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under
label noise for deep neural networks. In AAAI, pages 1919–1925, 2017.
[60] Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization
tolerant to label noise. Neurocomputing, 160:93–107, 2015.
[61] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks
using a noise adaptation layer. 2016.
151
[62] Siavash Golkar, Michael Kagan, and Kyunghyun Cho. Continual learning
via neural pruning. arXiv preprint arXiv:1903.04476, 2019.
[63] Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-
learning for stochastic gradient mcmc. International Conference on Learning
Representations, 2019.
[64] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and
harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[65] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and
Yoshua Bengio. Maxout networks. International Conference on Machine
Learning, 2013.
[66] Alex Graves. Practical variational inference for neural networks. In Advances
in neural information processing systems, pages 2348–2356, 2011.
[67] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A
survey of deep learning techniques for autonomous driving. Journal of Field
Robotics, 37(3):362–386, 2020.
[68] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy,
Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent
advances in convolutional neural networks. Pattern Recognition, 77:354–377,
2018.
[69] Zuguang Gu, Roland Eils, and Matthias Schlesner. Complex heatmaps reveal
patterns and correlations in multidimensional genomic data. Bioinformatics,
32(18):2847–2849, 2016.
[70] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian QWeinberger. On calibration of
modern neural networks. In International Conference on Machine Learning,
pages 1321–1330. PMLR, 2017.
[71] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for
efficient dnns. In Proceedings of the 30th International Conference on Neural
Information Processing Systems, pages 1387–1395, 2016.
[72] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang,
and Masashi Sugiyama. Masking: A new perspective of noisy supervision.
arXiv preprint arXiv:1805.08193, 2018.
152
[73] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman coding.
arXiv preprint arXiv:1510.00149, 2015.
[74] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights
and connections for efficient neural networks. In Proceedings of the 28th
International Conference on Neural Information Processing Systems-Volume
1, pages 1135–1143, 2015.
[75] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize bet-
ter: Stability of stochastic gradient descent. In International conference on
machine learning, pages 1225–1234. PMLR, 2016.
[76] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Fried-
man. The elements of statistical learning: data mining, inference, and pre-
diction, volume 2. Springer, 2009.
[77] Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu,
Jasper Snoek, Balaji Lakshminarayanan, Andrew M Dai, and Dustin Tran.
Training independent subnetworks for robust prediction. arXiv preprint
arXiv:2010.06610, 2020.
[78] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh.
Robust pruning at initialization.
[79] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity map-
pings in deep residual networks. In European conference on computer vision,
pages 630–645. Springer, 2016.
[81] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network
robustness to common corruptions and perturbations. arXiv preprint
arXiv:1903.12261, 2019.
[82] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using
trusted data to train deep networks on labels corrupted by severe noise. arXiv
preprint arXiv:1802.05300, 2018.
[83] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and
153
Jin Young Choi. A comprehensive overhaul of feature distillation. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages
1921–1930, 2019.
[84] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a
neural network. arXiv preprint arXiv:1503.02531, 2015.
[85] Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classifier: the
marginal value of training the last weight layer. In International Confer-
ence on Learning Representations, 2018.
[86] Todd C Hollon, Balaji Pandian, Arjun R Adapa, Esteban Urias, Akshay V
Save, Siri Sahib S Khalsa, Daniel G Eichberg, Randy S D’Amico, Zia U
Farooq, Spencer Lewis, et al. Near real-time intraoperative brain tumor
diagnosis using stimulated raman histology and deep neural networks. Nature
medicine, 26(1):52–58, 2020.
[87] Craig Horbinski, Julia Kofler, Lindsey M Kelly, Geoffrey H Murdoch, and
Marina N Nikiforova. Diagnostic use of idh1/2 mutation analysis in routine
clinical testing of formalin-fixed, paraffin-embedded glioma tissues. Journal
of Neuropathology & Experimental Neurology, 68(12):1319–1325, 2009.
[88] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and
Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv
preprint arXiv:1704.00109, 2017.
[89] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian QWeinberger.
Densely connected convolutional networks. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages 4700–4708, 2017.
[90] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger.
Deep networks with stochastic depth. In European Conference on Computer
Vision, pages 646–661. Springer, 2016.
[91] Po-Yu Huang, Wan-Ting Hsu, Chun-Yueh Chiu, Ting-Fan Wu, and Min
Sun. Efficient uncertainty estimation for semantic segmentation in videos.
In Proceedings of the European Conference on Computer Vision (ECCV),
pages 520–535, 2018.
[92] Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via
neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.
154
[93] Eddy Ilg, Ozgun Cicek, Silvio Galesso, Aaron Klein, Osama Makansi, Frank
Hutter, and Thomas Brox. Uncertainty estimates and multi-hypotheses net-
works for optical flow. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 652–667, 2018.
[94] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[95] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and
Andrew Gordon Wilson. Averaging weights leads to wider optima and better
generalization. Conference on Uncertainty in Artificial Intelligence, 2018.
[96] Siddhartha Jain, Ge Liu, Jonas Mueller, and David Gifford. Maximizing
overall diversity for improved uncertainty estimates in deep ensembles. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34,
pages 4264–4271, 2020.
[97] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Men-
tornet: Regularizing very deep neural networks on corrupted labels. arXiv
preprint arXiv:1712.05055, 2017.
[98] Zhengshen Jiang, Hongzhi Liu, Bin Fu, and Zhonghai Wu. Generalized am-
biguity decompositions for classification with applications in active learning
and unsupervised ensemble pruning. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 31, 2017.
[99] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I
Jordan. How to escape saddle points efficiently. In International Conference
on Machine Learning, pages 1724–1732. PMLR, 2017.
[100] Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks
from noisy labels with dropout regularization. In Data Mining (ICDM), 2016
IEEE 16th International Conference on, pages 967–972. IEEE, 2016.
[101] Pascal D Johann, Serap Erkek, Marc Zapatka, Kornelius Kerl, Ivo Buch-
halter, Volker Hovestadt, David TW Jones, Dominik Sturm, Carl Hermann,
Maia Segura Wang, et al. Atypical teratoid/rhabdoid tumors are comprised
of three epigenetic subgroups with distinct enhancer landscapes. Cancer cell,
29(3):379–393, 2016.
[102] Jakob Nikolas Kather, Alexander T Pearson, Niels Halama, Dirk Jäger,
Jeremias Krause, Sven H Loosen, Alexander Marx, Peter Boor, Frank Tacke,
155
Ulf Peter Neumann, et al. Deep learning can predict microsatellite insta-
bility directly from histology in gastrointestinal cancer. Nature medicine,
25(7):1054–1056, 2019.
[103] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet:
Model uncertainty in deep convolutional encoder-decoder architectures for
scene understanding. arXiv preprint arXiv:1511.02680, 2015.
[104] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep
learning for computer vision? In Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, pages 5580–5590, 2017.
[105] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from
noisy singly-labeled data. arXiv preprint arXiv:1712.04577, 2017.
[106] Patrick Kidger and Terry Lyons. Universal approximation with deep narrow
networks. In Conference on learning theory, pages 2306–2327. PMLR, 2020.
[107] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex net-
work: Network compression via factor transfer. In Advances in neural infor-
mation processing systems, pages 2760–2769, 2018.
[108] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1317–1327, 2016.
[109] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980, 2014.
[110] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and
the local reparameterization trick. In Advances in Neural Information Pro-
cessing Systems, pages 2575–2583, 2015.
[111] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features
from tiny images. 2009.
[112] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifi-
cation with deep convolutional neural networks. Advances in neural infor-
mation processing systems, 25, 2012.
[113] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross valida-
156
tion and active learning. In Proceedings of the 7th International Conference
on Neural Information Processing Systems, pages 231–238, 1994.
[114] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross vali-
dation, and active learning. In Advances in neural information processing
systems, pages 231–238, 1995.
[115] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate un-
certainties for deep learning using calibrated regression. In International
Conference on Machine Learning, pages 2796–2804, 2018.
[116] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction.
In Advances in Neural Information Processing Systems, pages 3474–3482,
2015.
[117] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning
for latent variable models. In Advances in Neural Information Processing
Systems, pages 1189–1197, 2010.
[118] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity
in classifier ensembles and their relationship with the ensemble accuracy.
Machine learning, 51(2):181–207, 2003.
[119] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learn-
ing. arXiv preprint arXiv:1610.02242, 2016.
[120] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple
and scalable predictive uncertainty estimation using deep ensembles. arXiv
preprint arXiv:1612.01474, 2016.
[121] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Sim-
ple and scalable predictive uncertainty estimation using deep ensembles. In
Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
[122] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet:
Ultra-deep neural networks without residuals. International Conference on
Learning Representations, 2017.
[123] Maria L astowska, Joanna Trubicka, Anna Sobocińska, Bartosz Wojtas, Mag-
dalena Niemira, Anna Sza lkowska, Adam Kretowski, Agnieszka Karkucińska-
Wieckowska, Magdalena Kaleta, Maria Ejmont, et al. Molecular identifica-
tion of cns nb-foxr2, cns eft-cic, cns hgnet-mn1 and cns hgnet-bcor pediatric
157
brain tumors using tumor-specific signature genes. Acta neuropathologica
communications, 8(1):1–14, 2020.
[124] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature,
521(7553):436–444, 2015.
[125] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E
Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied
to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
[126] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot
network pruning based on connection sensitivity. In International Conference
on Learning Representations, 2018.
[127] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf.
Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[128] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and
Jia Li. Learning from noisy labels with distillation. arXiv preprint
arXiv:1703.02391, 2017.
[129] Tongliang Liu and Dacheng Tao. Classification with noisy labels by im-
portance reweighting. IEEE Transactions on pattern analysis and machine
intelligence, 38(3):447–461, 2016.
[130] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell.
Rethinking the value of network pruning. In International Conference on
Learning Representations, 2018.
[131] Ekaterina Lobacheva, Nadezhda Chirkova, Maxim Kodryan, and Dmitry
Vetrov. On power laws in deep ensembles. arXiv preprint arXiv:2007.08483,
2020.
[132] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vap-
nik. Unifying distillation and privileged information. arXiv preprint
arXiv:1511.03643, 2015.
[133] David N Louis, Hiroko Ohgaki, Otmar D Wiestler, Webster K Cavenee,
Peter C Burger, Anne Jouvet, Bernd W Scheithauer, and Paul Kleihues.
The 2007 who classification of tumours of the central nervous system. Acta
neuropathologica, 114(2):97–109, 2007.
158
[134] Christos Louizos and Max Welling. Multiplicative normalizing flows for vari-
ational bayesian neural networks. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 2218–2227. JMLR. org,
2017.
[135] Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, Melissa Zhao, Maha
Shady, Jana Lipkova, and Faisal Mahmood. Ai-based pathology predicts
origins for cancers of unknown primary. Nature, 594(7861):106–110, 2021.
[136] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The
expressive power of neural networks: A view from the width. Advances in
neural information processing systems, 30, 2017.
[137] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochas-
tic gradient mcmc. In Advances in Neural Information Processing Systems,
pages 2917–2925, 2015.
[138] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.
Journal of machine learning research, 9(Nov):2579–2605, 2008.
[139] David JC MacKay. A practical bayesian framework for backpropagation
networks. Neural computation, 4(3):448–472, 1992.
[140] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and An-
drew Gordon Wilson. A simple baseline for bayesian uncertainty in deep
learning. arXiv preprint arXiv:1902.02476, 2019.
[141] Fangchang Mal and Sertac Karaman. Sparse-to-dense: Depth prediction
from sparse depth samples and a single image. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
[142] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior
networks. In Advances in Neural Information Processing Systems, pages
7047–7058, 2018.
[143] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution
distillation. arXiv preprint arXiv:1905.00076, 2019.
[144] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a
single network by iterative pruning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
159
[145] Naresh Manwani and PS Sastry. Noise tolerance under risk minimization.
IEEE transactions on cybernetics, 43(3):1146–1151, 2013.
[146] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss func-
tions for classification: theory, robustness to outliers, and savageboost. In
Advances in neural information processing systems, pages 1049–1056, 2009.
[147] Leland McInnes, John Healy, and James Melville. Umap: Uniform mani-
fold approximation and projection for dimension reduction. arXiv preprint
arXiv:1802.03426, 2018.
[148] Paul Micaelli and Amos J Storkey. Zero-shot knowledge transfer via adversar-
ial belief matching. In Advances in Neural Information Processing Systems,
pages 9551–9561, 2019.
[149] Kimberly D Miller, Quinn T Ostrom, Carol Kruchko, Nirav Patil, Tarik
Tihan, Gino Cioffi, Hannah E Fuchs, Kristin A Waite, Ahmedin Jemal,
Rebecca L Siegel, et al. Brain and other central nervous system tumor
statistics, 2021. CA: a cancer journal for clinicians, 71(5):381–406, 2021.
[150] Thomas Minka. Estimating a dirichlet distribution, 2000.
[151] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel
Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fid-
jeland, Georg Ostrovski, et al. Human-level control through deep reinforce-
ment learning. Nature, 518(7540):529, 2015.
[152] Pooya Mobadersany, Safoora Yousefi, Mohamed Amgad, David A Gutman,
Jill S Barnholtz-Sloan, José E Velázquez Vega, Daniel J Brat, and Lee AD
Cooper. Predicting cancer outcomes from histology and genomics using
convolutional networks. Proceedings of the National Academy of Sciences,
115(13):E2970–E2979, 2018.
[153] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label
smoothing help? In Advances in Neural Information Processing Systems,
pages 4696–4705, 2019.
[154] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining
well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI
Conference on Artificial Intelligence, 2015.
[155] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Am-
160
buj Tewari. Learning with noisy labels. In Advances in neural information
processing systems, pages 1196–1204, 2013.
[156] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor
segmentation and support inference from rgbd images. ECCV, 2012.
[157] Radford M Neal. Bayesian learning for neural networks, volume 118.
Springer Science & Business Media, 2012.
[158] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and
Andrew Y Ng. Reading digits in natural images with unsupervised feature
learning. In NIPS workshop on deep learning and unsupervised feature learn-
ing, volume 2011, page 5, 2011.
[159] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolu-
tion network for semantic segmentation. In Proceedings of the IEEE Inter-
national Conference on Computer Vision, pages 1520–1528, 2015.
[160] Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label
errors in test sets destabilize machine learning benchmarks. arXiv preprint
arXiv:2103.14749, 2021.
[161] Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident
examples: Rank pruning for robust classification with noisy labels. arXiv
preprint arXiv:1705.01936, 2017.
[162] Adriana Olar and Kenneth D Aldape. Using the molecular classification of
glioblastoma to inform personalized treatment. The Journal of pathology,
232(2):165–177, 2014.
[163] Adriana Olar, Khalida M Wani, Kristin D Alfaro-Munoz, Lindsey E Heath-
cock, Hinke F van Thuijl, Mark R Gilbert, Terri S Armstrong, Erik P Sul-
man, Daniel P Cahill, Elizabeth Vera-Bolanos, et al. Idh mutation status
and role of who grade and mitotic index in overall survival in grade ii–iii
diffuse gliomas. Acta neuropathologica, 129(4):585–596, 2015.
[164] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural
networks. arXiv preprint arXiv:1511.08458, 2015.
[165] Quinn T Ostrom, Haley Gittleman, Gabrielle Truitt, Alexander Boscia,
Carol Kruchko, and Jill S Barnholtz-Sloan. Cbtrus statistical report: pri-
161
mary brain and other central nervous system tumors diagnosed in the united
states in 2011–2015. Neuro-oncology, 20(suppl 4):iv1–iv86, 2018.
[166] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram
Swami. Distillation as a defense to adversarial perturbations against deep
neural networks. In 2016 IEEE Symposium on Security and Privacy (SP),
pages 582–597. IEEE, 2016.
[167] Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for
universal approximation. arXiv preprint arXiv:2006.08859, 2020.
[168] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock,
and Lizhen Qu. Making deep neural networks robust to label noise: a loss
correction approach. stat, 1050:22, 2017.
[169] Gabriel Pereyra, George Tucker, Jan Chorowski, L ukasz Kaiser, and Geof-
frey Hinton. Regularizing neural networks by penalizing confident output
distributions. arXiv preprint arXiv:1701.06548, 2017.
[170] Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga
Russakovsky. Human uncertainty makes classification more robust. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages
9617–9626, 2019.
[171] Mary Phuong and Christoph Lampert. Towards understanding knowledge
distillation. In International Conference on Machine Learning, pages 5142–
5151, 2019.
[172] Janis Postels, Francesco Ferroni, Huseyin Coskun, Nassir Navab, and Fed-
erico Tombari. Sampling-free epistemic uncertainty estimation using ap-
proximated variance propagation. In Proceedings of the IEEE International
Conference on Computer Vision, pages 2931–2940, 2019.
[173] Ning Qian. On the momentum term in gradient descent learning algorithms.
Neural networks, 12(1):145–151, 1999.
[174] Rahul Rahaman and Alexandre H Thiery. Uncertainty quantification and
deep ensembles. arXiv preprint arXiv:2007.08792, 2020.
[175] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya,
162
et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with
deep learning. arXiv preprint arXiv:1711.05225, 2017.
[176] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation
functions. arXiv preprint arXiv:1710.05941, 2017.
[177] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi,
and Mohammad Rastegari. What’s hidden in a randomly weighted neural
network? In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 11893–11902, 2020.
[178] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for
medical image processing: Overview, challenges and the future. Classification
in BioApps, pages 323–350, 2018.
[179] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru
Erhan, and Andrew Rabinovich. Training deep neural networks on noisy
labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
[180] Annekathrin Reinhardt, Damian Stichel, Daniel Schrimpf, Felix Sahm, An-
drey Korshunov, David E Reuss, Christian Koelsche, Kristin Huang, An-
nika K Wefers, Volker Hovestadt, et al. Anaplastic astrocytoma with piloid
features, a novel molecular class of idh wildtype glioma with recurrent mapk
pathway, cdkn2a/b and atrx alterations. Acta neuropathologica, 136(2):273–
291, 2018.
[181] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learn-
ing to reweight examples for robust deep learning. arXiv preprint
arXiv:1803.09050, 2018.
[182] David E Reuss, Yasin Mamatjan, Daniel Schrimpf, David Capper, Volker
Hovestadt, Annekathrin Kratz, Felix Sahm, Christian Koelsche, Andrey Ko-
rshunov, Adriana Olar, et al. Idh mutant diffuse and anaplastic astrocytomas
have similar age at presentation and little difference in survival: a grading
problem for who. Acta neuropathologica, 129(6):867–873, 2015.
[183] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace
approximation for neural networks. International Conference on Learning
Representations, 2018.
[184] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chas-
sang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets.
arXiv preprint arXiv:1412.6550, 2014.
163
[185] Sebastian Ruder. An overview of gradient descent optimization algorithms.
arXiv preprint arXiv:1609.04747, 2016.
[186] Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adap-
tive sparsity by fine-tuning. Advances in Neural Information Processing Sys-
tems, 33, 2020.
[187] Vikash Sehwag, Shiqi Wang, Prateek Mittal, and Suman Jana. Hydra: Prun-
ing adversarially robust neural networks. Advances in Neural Information
Processing Systems (NeurIPS), 7, 2020.
[188] Yichen Shen, Zhilu Zhang, Mert R Sabuncu, and Lin Sun. Learning the
distribution: A unified distillation paradigm for fast uncertainty estimation
in computer vision. arXiv preprint arXiv:2007.15857, 2020.
[189] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data aug-
mentation for deep learning. Journal of big data, 6(1):1–48, 2019.
[190] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[191] Saurabh Singh, Derek Hoiem, and David Forsyth. Swapout: Learning an
ensemble of deep architectures. In Advances in neural information processing
systems, pages 28–36, 2016.
[192] Samarth Sinha, Homanga Bharadhwaj, Anirudh Goyal, Hugo Larochelle,
Animesh Garg, and Florian Shkurti. Dibs: Diversity inducing information
bottleneck in model ensembles. arXiv preprint arXiv:2003.04514, 2020.
[193] Samuel Smith, Erich Elsen, and Soham De. On the generalization benefit of
noise in stochastic gradient descent. In International Conference on Machine
Learning, pages 9058–9067. PMLR, 2020.
[194] Suraj Srinivas and Francois Fleuret. Knowledge transfer with jacobian
matching. In International Conference on Machine Learning, pages 4723–
4731, 2018.
[195] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning research, 15(1):1929–1958,
2014.
164
[196] Roger Stupp, Monika E Hegi, Mark R Gilbert, and Arnab Chakravarti.
Chemoradiotherapy in malignant glioma: standard of care and future di-
rections. Journal of Clinical Oncology, 25(26):4127–4136, 2007.
[197] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep
neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014.
[198] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbig-
niew Wojna. Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision and pattern recogni-
tion, pages 2818–2826, 2016.
[199] Akinori Tanaka, Akio Tomiya, and Kōji Hashimoto. Deep Learning and
Physics. Springer, 2021.
[200] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa.
Joint optimization framework for learning with noisy labels. arXiv preprint
arXiv:1803.11364, 2018.
[201] Antti Tarvainen and Harri Valpola. Mean teachers are better role models:
Weight-averaged consistency targets improve semi-supervised deep learning
results. In Advances in neural information processing systems, pages 1195–
1204, 2017.
[202] Michael D Taylor, Paul A Northcott, Andrey Korshunov, Marc Remke,
Yoon-Jae Cho, Steven C Clifford, Charles G Eberhart, D Williams Parsons,
Stefan Rutkowski, Amar Gajjar, et al. Molecular subgroups of medulloblas-
toma: the current consensus. Acta neuropathologica, 123(4):465–472, 2012.
[203] Katarzyna Tomczak, Patrycja Czerwińska, and Maciej Wiznerowicz. The
cancer genome atlas (tcga): an immeasurable source of knowledge. Contem-
porary oncology, 19(1A):A68, 2015.
[204] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph
Bregler. Efficient object localization using convolutional networks. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 648–656, 2015.
[205] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior.
In Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pages 9446–9454, 2018.
165
[206] Arash Vahdat. Toward robustness against label noise in training deep dis-
criminative neural networks. In Advances in Neural Information Processing
Systems, pages 5601–5610, 2017.
[207] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning
with symmetric label noise: The importance of being unhinged. In Advances
in Neural Information Processing Systems, pages 10–18, 2015.
[208] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and
Serge Belongie. Learning from noisy large-scale datasets with minimal su-
pervision. In The Conference on Computer Vision and Pattern Recognition,
2017.
[209] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks
behave like ensembles of relatively shallow networks. In Advances in neural
information processing systems, pages 550–558, 2016.
[210] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive
regularization. Advances in neural information processing systems, 26, 2013.
[211] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regu-
larization of neural networks using dropconnect. In International conference
on machine learning, pages 1058–1066, 2013.
[212] Fei Wang, Lawrence Peter Casalino, and Dhruv Khullar. Deep learning
in medicine—promise, progress, and challenges. JAMA internal medicine,
179(3):293–294, 2019.
[213] Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey, Hongyuan Zha,
Le Song, and Shu-Tao Xia. Iterative learning with open-set noisy labels.
arXiv preprint arXiv:1804.00092, 2018.
[214] Sarah Webb. Deep learning for biology. Nature, 554(7690):555–558, 2018.
[215] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and
P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001,
California Institute of Technology, 2010.
[216] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient
langevin dynamics. In Proceedings of the 28th international conference on
machine learning (ICML-11), pages 681–688, 2011.
166
[217] Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W Dusenberry, Jasper
Snoek, Balaji Lakshminarayanan, and Dustin Tran. Combining ensem-
bles and data augmentation can harm your calibration. arXiv preprint
arXiv:2010.09875, 2020.
[218] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative
approach to efficient ensemble and lifelong learning. In International Con-
ference on Learning Representations, 2019.
[219] Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyper-
parameter ensembles for robustness and uncertainty quantification. arXiv
preprint arXiv:2006.13570, 2020.
[220] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel
Hernández-Lobato, and Alexander L Gaunt. Deterministic variational in-
ference for robust bayesian neural networks. International Conference on
Learning Representations, 2018.
[221] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning
from massive noisy labeled data for image classification. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
2691–2699, 2015.
[222] Hai Yan, D Williams Parsons, Genglin Jin, Roger McLendon, B Ahmed
Rasheed, Weishi Yuan, Ivan Kos, Ines Batinic-Haberle, Siân Jones, Gregory J
Riggins, et al. Idh1 and idh2 mutations in gliomas. New England journal of
medicine, 360(8):765–773, 2009.
[223] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. Training deep
neural networks in generations: A more tolerant teacher educates better
students. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 33, pages 5628–5635, 2019.
[224] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from
knowledge distillation: Fast optimization, network minimization and trans-
fer learning. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4133–4141, 2017.
[225] Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang. Knowledge extraction
with no observable data. In Advances in Neural Information Processing
Systems, pages 2705–2714, 2019.
167
[226] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Visual relationship
detection with internal and external linguistic knowledge distillation. In
Proceedings of the IEEE international conference on computer vision, pages
1974–1982, 2017.
[227] Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Re-
visit knowledge distillation: a teacher-free framework. arXiv preprint
arXiv:1909.11723, 2019.
[228] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: At-
tacks and defenses for deep learning. IEEE transactions on neural networks
and learning systems, 30(9):2805–2824, 2019.
[229] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv
preprint arXiv:1605.07146, 2016.
[230] Sheheryar Zaidi, Arber Zela, Thomas Elsken, Chris Holmes, Frank Hutter,
and Yee Whye Teh. Neural ensemble search for performant and calibrated
predictions. arXiv preprint arXiv:2006.08573, 2020.
[231] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol
Vinyals. Understanding deep learning requires rethinking generalization.
arXiv preprint arXiv:1611.03530, 2016.
[232] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol
Vinyals. Understanding deep learning (still) requires rethinking generaliza-
tion. Communications of the ACM, 64(3):107–115, 2021.
[233] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-
Paz. mixup: Beyond empirical risk minimization. arXiv preprint
arXiv:1710.09412, 2017.
[234] Shaofeng Zhang, Meng Liu, and Junchi Yan. The diversified ensemble neural
network. Advances in Neural Information Processing Systems, 33, 2020.
[235] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recom-
mender system: A survey and new perspectives. ACM Computing Surveys
(CSUR), 52(1):1–38, 2019.
[236] Zhilu Zhang, Adrian V Dalca, and Mert R Sabuncu. Confidence calibration
for convolutional neural networks using structured dropout. arXiv preprint
arXiv:1906.09551, 2019.
168
[237] Chunting Zhou, Graham Neubig, and Jiatao Gu. Understanding knowl-
edge distillation in non-autoregressive machine translation. arXiv preprint
arXiv:1911.02727, 2019.
[238] Zhi-Hua Zhou. Ensemble learning. Encyclopedia of biometrics, 1:270–273,
2009.
[239] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. Chapman
and Hall/CRC, 2012.
[240] Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-fly
native ensemble. In Advances in neural information processing systems, pages
7517–7527, 2018.
169
APPENDIX A
SUPPLEMENTARY MATERIAL FOR ”LABEL-NOISE ROBUST
LEARNING OF NEURAL NETWORKS WITH GENERALIZED
CROSS ENTROPY LOSS”
Lemma 1. limq→0 Lq(f(x), ej) = LC(f(x), ej), where Lq represents the Lq loss,
and LC represents the categorical cross entropy loss.
Proof. from equation 2.6, and using L’Hôpital’s rule,
(1− f (x)q d) (1− f (x)q)L j dq
j
lim q(f(x), ej) = lim = lim
q→0 q→0 q q→0 d q
dq
= lim−f (x)qj log(fj(x)) = − log(fj(x)) = LC(f(x), e→ j).q 0
Lemma 2. For any x and q ∈ (0, 1], the sum of Lq loss with respect to all classes
is bounded by:
c
c− c(1−q) ∑≤ (1− fj(x)q) ≤ c− 1 . (A.1)
q q q
j=1
Proof. Obs∑erve that, since we have a softmax layer at the end, fj(x) ≤ 1 for
all j, and cj=1 fj(x) = 1. Now, since q ∈ (0, 1], we have fj(x) ≤ fj(x)q, and
(1− fj(x)∑) ≥ (1− f (x)
q
j ). Hence,
c ∑
(1− q ∑c − c− cfj(x) ) ≤ (1 fj(x)) j=1 fj(x) c− 1= = .
∑q ∑ q q qj=1 j=1 ∑
Moreover,∑since c f (x)q ≤ cj=1 j j=1(1/c)q for all x and q ∈ (0, 1], cj=1(1 −
fj(x)
q) ≥ c qj=1(1− (1/c) ), and∑c c(1− fj(x)q) ∑≥ (1− (1/c)q) c− c(1−q)= .
q q q
j=1 j=1
170
Theorem 1. Under uniform noise with η ≤ 1− 1 ,
c
0 ≤ (RηL (f
∗)−RηL (f̂)) ≤ A, (A.2)q q
and
A′ ≤ RLq(f ∗)−RLq(f̂) ≤ 0, (A.3)
where A = η[c
(1−q)−1] (1−q)≥ 0, A′ = η[1−c ] ∗
q(c−1) q(c− − < 0, f is the global minimizer of RLq(f),1 ηc)
and f̂ is the global minimizer of RηL (f).q
Proof. Recall that for any softmax output f ,
RLq(f) = ED[Lq(f(x), yx)] = Ex,yx [Lq(f(x), yx)],
and since for uniform noise with noise rate η, ηjk = 1− η for j = k, and ηjk = ηc−1
for j ≠ k, we have
RηL (f) = ED[Lq(f(x), ỹx)] = Ex,ỹx [Lq(f(x), ỹq x)]
= ExEyx|xEỹx|yx,x[Lq(f(x), ỹx)]
η ∑
= ExEyx|x[(1− η)Lq(f(x), yx) + Lq(f(x), i)]c− 1
i≠ yx
c
η ∑
= ExEyx|x[(1− η)Lq(f(x), yx) + ( Lq(f(x), i)− Lq(f(x), y− x
))]
c 1
i=1
η η ∑c
= (1− η)RLq(f)− RLq(f) + ExEyx|x[ Lq(f(x), i)]c− 1 c−
ηc η ∑1 i=1c
= (1− )RLq(f) + ExEyx|x[ Lq(f(x), i)]c− 1 c− 1
i=1
Now, from Lemma 2, we have:
− ηc η[c− c
(1−q)] ηc η
(1 )RLq(f) + ≤ R
η (f) ≤ (1− )R (f) + .
c− 1 q(c− 1) Lq Lc− 1 q q
171
We can also write the inequality in terms of RLq(f):
η − η ηc η[c− c
(1−q)] ηc
(RL (f) )/(1− ) ≤ R
η
Lq(f)) ≤ (RL (f)− )/(1− )q q c− 1 q q(c− 1) c− 1
Thus, for f̂ ,
η ∗ η ηcR ∗L (f )−RL (f̂) ≤ A+ (1− )(RLq(f )−RLq(f̂)) ≤ A,q q c− 1
or equivalently,
R (f ∗Lq )−
ηc
RLq(f̂) ≥ A′ + (R
η (f ∗L )−R
η
L (f̂))/(1− ) ≥ A
′
q q c− 1
(1−q)
where A = η[c −1] ′ η[1−c
(1−q)
≥ 0 and A = ] c−1− − − , since η ≤ , and f
∗ is a minimizer of
q(c 1) q(c 1 ηc) c
R (f). Lastly, since f̂ is the minimizer ofRη (f), we have thatRηLq L L (f
∗)−RηL (f̂) ≥q q q
0, or RLq(f
∗)−RLq(f̂) ≤ 0 . This completes the proof.
Remark. Note that, when q = 1, A = 0, and f ∗ is also minimizer of risk under
uniform noise.
Theorem 2. Under class dependent noise when ηij < (1− ηi), ∀j ̸= i, ∀i, j ∈ [c],
where ηij = p(ỹ = j|y = i), ∀j ̸= i, and (1− ηi) = p(ỹ = i|y = i), if RLq(f ∗) = 0,
then
0 ≤ (Rη ∗ ηL (f )−RL (f̂)) ≤ B, (A.4)q q
where B = c
1−q−1ED(1− ηyx) ≥ 0, f ∗ is the global minimizer of RLq(f), and f̂ isq
the global minimizer of RηL (f).q
Proof. For class dependent noise, from Lemma 2, for any soft-max output function
f we have ∑
RηL (f) = Eq D[(1− ηyx)Lq(f(x), yx)] + ED[ ηyxiLq(f(x), i)]∑ i ̸=yx ∑
≤ c− 1ED[(1− ηyx)( − Lq(f(x), i))] + ED[ ηyq xi
Lq(f(x), i)]
i≠ y∑x i ̸=yxc− 1
= ED(1− ηyx)− ED[ (1− ηyx − ηyq xi
)Lq(f(x), i)],
i ̸=yx
172
and
c− c1−q ∑
RηL (f) ≥ ED(1− ηyx)− ED[ (1− ηyx − ηyxi)Lq(f(x), i)].q q
i ̸=yx
Hence,
1−q
(Rη
c − 1
L (f
∗)−RηL (f̂)) ≤ ∑ ED(1− ηyx)+q q q
ED (1− ηyx − ηyxi)[Lq(f̂(x), i)− L (f ∗q (x), i)].
i≠ yx
Now, from our assumption that R ∗Lq(f ) = 0, we have Lq(f ∗(x), yx) = 0. This
is only satisfied iff f ∗ ∗i (x) = 1 when i = yx, and fi (x) = 0 if i ̸= yx. Hence,
Lq(f ∗(x), i) = 1/q ∀i ̸= yx. Moreover, by our assumption, we have (1−ηyx−ηyxi) >
0. As a result, to derive a upper bound for the expression above, we need to
maximize the second term. Note that by definition of the Lq loss, Lq(f̂(x), i) ≤ 1/q
∀i ∈ [c], and hence the second term is maximized iff Lq(f̂(x), i) = 1/q ∀i ̸= yx.
This implies that the maximum of the second term is non-positive, so we have
1−q
η c − 1(RL (f
∗)−RηL (f̂)) ≤ ED(1− ηq q yq x
).
Lastly, since f̂ is the minimizer of RηL (f), we have that R
η ∗ η
L (f ) − RL (f̂) ≥ 0.q q q
This completes the proof.
Lemma 3. For any x and q ∈ (0, 1), assuming 1/c ≤ k < 1 where c represents
the number of classes, the sum of truncated Lq loss with respect to all classes is
bounded by:
c
1 ∑
d̃kLq( ) + (c− d̃)Lq(k) ≤ Ltrunc(f(x), ej) ≤ cLq(k), (A.5)
d
j=1
where d̃ = max(1, (1−q)
1/q
).
k
Proof. For the upper bound, b∑y definition of truncated Lq, Ltrunc(f(x), ej) ≤
Lq(k) for any x and j. Hence, cj=1 Ltrunc(f(x), ej) ≤ cLq(k).
173
For the lower bound, it can be verified that,
∑c ∑c
Ltrunc(f̃(x), ej) ≤ Ltrunc(f(x), ej)
j=1 j=1
where f̃(x) = (p, · · · , p, 0, · · · , 0), with p = 1/d ≥ k and d is the number of
elements in f(x) with a value ≤ k. Note that since p > k, 1 ≤ d ≤ 1/k:
∑c
Ltrunc(f̃(x), ej) = dLq(p) + (c− d)L
1
q(k) = dLq( ) + (c− d)Lq(k).
d
j=1
We can get a universal lower bound (that does not depend on f) by minimizing
the above function with respect to d. To do so, we treat d to be continuous. By
definition of Lq loss, and recall that 0 < q < 1,
L 1 − L − 1 1min d q( ) + (c d) q(k) = min d[(1 ( )q)/q − (1− kq)/q] = min d[(kq − ( )q)].
d∈[1,1/k] d d∈[1,1/k] d d∈[1,1/k] d
We can verify using the second derivative test that the above objective function is
convex. As a result, we can find the minimum by taking its derivative. Doing so,
1/q
we find that d = (1−q) minimizes the above objective function. Hence, the lower
k
bound is
∑c
L 1d̃k q( ) + (c− d̃)Lq(k) ≤ Ltrunc(f(x), ej),
d
j=1
where d̃ = max(1, (1−q)
1/q
).
k
Remark. Using Lemma 3, we can prove that the proposed truncated loss leads to
more noise robust training following the same arguments as in Theorem 1 and 2.
174
APPENDIX B
SUPPLEMENTARY MATERIAL FOR ”IMPROVING
CONFIDENCE CALIBRATION FOR CONVOLUTIONAL NEURAL
NETWORKS WITH STRUCTURED DROPOUT”
B.1 Brief Review of Dropout As Bayesian Approximation
Let us assume a dataset D = (X,Y ) = {(xi, y )}ni i=1, where each (xi, yi) ∈ (X ×Y)
is i.i.d. In this chapter, we consider the problem of k-class classification, and let
X ⊆ Rd be the feature space and Y = {1, · · · , k} be the label space. A classi-
fier is a function that maps input feature space to the label space f : X → Rc.
We restrict our attention to functions that can be implemented as a DNN, and
denote it by fw(x), where w = {W }Li i=1 corresponds to the parameters of a net-
work with L-layers, and Wi corresponds to the weight matrix in the i-th layer.
We define a likelihood model p(y|x,w) = softmax(fw(x)). It is common prac-
tice to perform maximum likelihood to compute point estimates for w. Un-
certainty estimates can be obtained through Bayesian DNNs by first assuming
a prior distribution on the weights, p(w). A common choice is the zero mean
Gaussian N (0, I). Bayes Theorem can then be used to obtain the posterior
p(w|X,Y ) = p(Y |X,w)p(X)/p(Y∫|X), with which inference can be carried out:
p(y = c|x,Dtrain) = p(y = c|x,w)p(w|Dtrain)dw. (B.1)
The marginal distribution p(Y |X), and thus p(w|X,Y ) are often intractable.
Variational inference uses a tractable family of distributions qθ(w) paramaterized
by θ to approximate the true posterior p(w|X,Y ), by minimizing the Kullback-
Leibler divergence KL(qθ(w)|p(w|X,Y )), which is equivalent to optimizing a
175
bound on the true objective [66]. To interpret dropout as a variational inference
strategy [51], the approximate distribution is defined as:
Wi = Θi · diag(zi,j)Kij=1, (B.2)
zi,j ∼ Bernoulli(pi) for i = 1, · · · , L, j = i, · · · , Ki−1, (B.3)
where θ = {Θi}L Li=1 are variational parameters to be optimized and {pi}i=1 are user-
defined hyper-parameters that correspond to layerwise dropout rates. Minimizing
the KL-divergence is mathematically equivalent to maximizing the following ob-
jective:
∑n ∫
LV I(θ) = qθ(w) log p(yi|xi,w)dw −KL(qθ(w)|p(w)). (B.4)
i=1
Using Monte Carlo integration with one sample wi ∼ qθ(w) for each training
datum (x, y) to approximate the integral in the above equation, and optimizing
over mini-batches of size m, the approximated objective becomes:
∑m
L̂ nV I(θ) = log p(yi|xi,wi)−KL(qθ(w)|p(w)). (B.5)
m
i=1
As shown in [51], there is a direct correspondence between optimizing the above ob-
jective and regular dropout training for DNNs. Furthermore, uncertainty estimates
can be obtained through marginalizing and performing Monte Carlo integration
over the approximate distr∫ibution qθ(w). This corresponds to dropout at test time:∑T
p(y = c| D 1x, train) ≈ p(y = c|x,w)qθ(w)dw ≈ p(y|x,wt), (B.6)
T
t=1
where wt ∼ qθ(w) are dropout samples from the NN. This is referred to as the
MC dropout.
176
B.2 Relationship between Different Performance Metrics
Brier score, negative log-likelihood (NLL) and the expected calibration error (ECE)
are three of the most commonly used metrics for evaluating the quality of uncer-
tainty estimates. In this section, we discuss the relationship between them.
As we noted in Section 3.3, the Brier score is equal to the normalized MSE in
the context of classification. Recall, the ECE is defined as:
ECE(H) = Ex[(Ey[y|H(x)]−H(x))2], (B.7)
which measures the expected difference between the true class probability and the
confidence of the model [116]. In addition to the error-ambiguity decomposition
that we have discussed, MSE can also be decomposed as:
MSE(H) = Ex[(y −H(x))2] (B.8)
= Ex[(y − Ey[y|H(x)])2] + ECE(H) (B.9)
= Varx[y]− Varx[Ey[y|H(x)]] + +ECE(H), (B.10)
where Ey[y|H(x)] corresponds to the true probability of y = 1 conditioned on
H(x). Varx[Ey[y|H(x)]] measures the variation of the true class probabilities
across the level-sets of the ensemble model H [116]. Thus for this metric, the
numeric values of H(x) are not important. It is minimized if H(x) is a constant
and maximized when H(x) = f(y), for any bijective function f . One can there-
fore view Varx[Ey[y|H(x)]] as a weak metric of accuracy that is not sensitive to
calibration. Note Varx[y] does not depend on the models. Brier score thus can
be seen as a metric that is influenced by both the accuracy and the ECE of the
models. Similarly, NLL is a metric closely related Brier score on a log scale. Con-
sequently, sometimes better uncertainty estimates in terms of NLL or Brier score
177
0.0080 SVHN SVHN
dropout
0.0075 dropBlock
0.0070 dropChannel
0.965
dropLayer
0.0065 dropOmnibus 0.960
0.0060 dropout
dropBlock
0.0055 0.955 dropChannel
dropLayer
0.0050 dropOmnibus
0 5 10 15 20 25 30 0.950 0 5 10 15 20 25 30
Number of Models in the Ensemble Number of Models in the Ensemble
CIFAR-100 CIFAR-100
dropout
0.00450 dropBlock 0.76
0.00425 dropChanneldropLayer 0.75
0.00400 dropOmnibus 0.74
dropout
0.00375 0.73 dropBlock
dropChannel
0.00350 0.72 dropLayer
dropOmnibus
0.00325 0.710 5 10 15 20 25 30 0 5 10 15 20 25 30
Number of Models in the Ensemble Number of Models in the Ensemble
Figure B.1: Test Brier score (left) and accuracy (right) against number of models
for ensemble prediction at test time on SVHN and CIFAR-100. This corresponds
to the number of different MC dropout instantiations at test time of the same
model. The Model trained with omnibus dropout achieves the best in terms of
accuracy and Brier score.
can lead to slight drops in accuracy, as the reduction in calibration error outweighs
increase in classification error. This phenomenon is indeed observed in practice
as well. Figure 7 shows the plot of both NLL and accuracy against dropout rates
for all dropout methods considered in the chapter. For instance, it can be seen
that while increasing the dropout rate for the MC dropout model on CIFAR-100
dataset from 0.1 to 0.2 leads to a reduction in NLL, there is also quite a significant
dropout in classification accuracy. Similar trends can be seen for MC dropChan-
nel on CIFAR-10 as well. Nevertheless, the trade-off is not always present. To
exemplify, increasing dropout rate of MC dropout on the SVHN dataset also leads
to an increase in accuracy as well. In conclusion, when tuning for the optimal
dropout rate in practice, it can be beneficial to look at different metrics for a
holistic consideration.
178
Brier Score Brier Score
Accuracy Accuracy
Table B.1: Results comparing accuracy and uncertainty estimates obtained using
a single model when drop rate = 0.1 for all models. The top performing result
for each metric is bold-faced. MC omnibus-dropout is the best method in general.
Datasets Methods Accuracy ↑ NLL ↓ Brier ↓ (×10−3) ECE ↓ (×10−2)
Temp Scaling 95.7± 0.1 0.163± 0.002 6.62± 0.10 0.995± 0.160
Dropout 96.4± 0.1 0.179± 0.004 5.68± 0.07 1.34± 0.10
SVHN DropBlock 96.8± 0.1 0.133± 0.002 5.19± 0.07 1.26± 0.14
DropChannel 96.5± 0.1 0.148± 0.002 5.41± 0.04 0.663± 0.050
DropLayer 96.2± 0.1 0.154± 0.002 5.94± 0.10 1.13± 0.10
Omnibus dropout 96.8± 0.1 0.133± 0.003 5.07± 0.07 0.616± 0.077
Temp Scaling 93.9± 0.1 0.189± 0.002 9.06± 0.08 0.905± 0.114
Dropout 93.8± 0.1 0.226± 0.008 9.44± 0.10 2.30± 0.09
CIFAR10 DropBlock 93.4± 0.1 0.203± 0.003 9.89± 0.10 0.743± 0.116
DropChannel 93.7± 0.1 0.196± 0.006 9.20± 0.136 0.970± 0.171
DropLayer 94.0± 0.2 0.206± 0.001 9.09± 0.17 0.941± 0.068
Omnibus dropout 94.4± 0.1 0.173± 0.001 8.38± 0.10 0.607± 0.078
Temp Scaling 74.5± 0.3 1.00± 0.01 3.57± 0.04 4.02± 0.62
Dropout 74.8± 0.4 1.21± 0.01 3.71± 0.05 11.1± 0.4
CIFAR100 DropBlock 75.6± 0.2 1.04± 0.01 3.46± 0.02 6.98± 0.19
DropChannel 75.3± 0.2 1.02± 0.01 3.43± 0.03 5.57± 0.08
DropLayer 75.8± 0.3 1.04± 0.02 3.46± 0.04 7.42± 0.32
Omnibus dropout 76.3± 0.1 1.00± 0.01 3.37± 0.02 7.11± 0.20
B.3 Additional Results
Supplementary Results on Diversity of Dropout Models. In Figure 6, we
show plots of Brier score and accuracy against number of models used for prediction
on SVHN and CIFAR-100 datasets. As discussed in Section 3.3.5, patterns similar
to the plots obtained on the CIFAR-10 dataset in Figure 3 are also observed
consistently here. The only exception is to the MC dropLayer model on the SVHN
dataset, which obtains better performance on individual model but much smaller
improvements in both Brier score and test accuracy compared to the other the other
dropout methods. We would like to highlight out that the seemingly contradictory
results is likely caused by the shallow network used, an 18-layer ResNet. As no
down-sampling layers are dropped out for layer dropout, the effective number of
ResNet blocks that can be dropped is very small, leading to a much smaller dropout
rate compared to ther other methods. This is not an issue with deeper models
179
in which the number of downsampling blocks are much more than that of non-
downsampling ones.
SVHN SVHN
97.0
0.20 96.8
dropout 96.6 dropout
0.18 dropBlock dropBlock
dropChannel 96.4 dropChannel
dropLayer dropLayer
0.16 dropOmnibus 96.2 dropOmnibus
0.14 96.0
95.8
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Dropout Rate Dropout Rate
CIFAR-10 CIFAR-10
0.55 94
0.50
92
0.45
90
0.40
0.35 88
0.30 86
0.25
84
0.20
82
0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30
Dropout Rate Dropout Rate
CIFAR-100 CIFAR-100
1.25 76
1.20
74
1.15
72
1.10
1.05 70
1.00
68
0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30
Dropout Rate Dropout Rate
Figure B.2: Plots of test time NLL (left) and accuracy (right) against dropout rate
for models trained with different types of dropout on the SVHN, CIFAR-10 and
CIFAR-100 datasets.
We also investigate the sensitivity of the methods to the choice of dropout
rate. To that end, we also report the results obtained with a single model for each
method, with a fixed dropout rate of 0.1, a reasonable default value for dropout rate
in general. The results are shown in Table B.1. Possibly due to the combination
180
Test NLL Test NLL Test NLL
Test Accuracy Test Accuracy Test Accuracy
of all dropout method, omnibus dropout seems to be also relatively insensitive to
the choice of dropout rate, performing well in all the experiments.
Results on Tuning the Dropout Rate. Figure 7 illustrates the plots of
NLL and accuracy against dropout rate for all models on all of the datasets. As
discussed in Appendix B, conflict between NLL and accuracy can occur sometimes.
Interestingly, the NLL drastically increases after minima on all three datasets for
dropBlock, suggesting the possibility that the block size for dropBlock may be
too large towards later convolutional layers when the size of feature maps are
comparable to that of block size.
181
APPENDIX C
SUPPLEMENTARY MATERIAL FOR ”ENHANCING
UNCERTAINTY ESTIMATES WITH EFFICIENT NEURAL
NETWORK ENSEMBLES”
C.1 A Brief Review of the Edge-Pop Algorithm
We give a brief review of the Edge-Pop algorithm in this section. For simplicity,
we describe the algorithm with a fully connected neural network. The algorithm
can be easily extended to the case of CNNs.
Suppose we have an L-layer fully connected NN with parameters w =
{W (1), ...,W (L)}. If we let x = x(0) to be the input to the NN and x(h) to be
the h-th hidden layer of the NN, then a standard NN can be defined recursively
by
( )
x(h) = σ W (h)x(h−1) , 1 ≤ h ≤ L,
where σ denotes some non-lienar activations functions like the ReLU activation
function.
Now, in order to select a subset of weights from w, for each weight in the
parameters w, we learn a popup score associated with it. We denote the popup
scores by s = {S(1), ..., S(L)}. Note that, each score matrix S(h) is of the same
dimension as that of W (h). Then, given the set of score matrices, a set of binary
masks m = {M (1), ...,M (L)} can be generated. Specifically, for each score matrix
S(h), we sort the popup scores based on m(agnit)ude of th∣∣e sco∣∣ res at each layer.(h) (h) (h)
With a pre-determined ratio k%, Mij = f Sij = 1 if ∣Sij ∣ is among the top
182
(h)
k% highest scores in the h-th layer, and Mij = 0 otherwise. Then, during the
forward pass of NN with the Edge-Pop Algorithm, binary masks are applied onto
the weight matrices before the forward propagation
(( ) )
x(h) = σ M (h) ◦W (h) x(h−1) , 1 ≤ h ≤ L,
where ◦ denotes the Hadamard product.
During the entire learning procedure of the Edge-Pop Algorithm, the weight
matrices stay fixed, and only the score matrices are updated with gradient descent.
Note that, due to the use of binary masks, direct computation of the gradient is
impossible. As such, the straight-through gradient estimator is used instead so
that the thresholding function f(·) is replaced by the identity function instead.
(h)
This allows us to approximate the gradient for Sij by
∂L ∂x(h) ∂L (h) (h−1)
= W x ,
∂x(h) (h) ∂x(h) ij i∂Sij
where L denotes the cross-entropy loss. Given the gradient estimator, the popup
scores can then be updated via stochastic gradient descent.
Lastly, we note that a naive random initialization of popup scores as proposed
by Ramanujan et al. [177] can lead to significantly worse performance. To this
end, we instead choose to initialize the popup scores based on the weights of the
trained NNs. this is inspired by the recent success of magnitude-based pruning
techniques. As such, for each layer h, we initialize the scores by
(h)
(h) Wij
Sij = ,max(|W (h)|)
where max(|W (h)|) denotes the maximum magnitude of the matrix W (h) so that all
popup scores are normalized between [−1, 1]. Similar approach was also adopted
by Sehwag et al. [187].
183
C.2 Additional Ablation Studies
Optimal Number of Subnetworks in MIMO Networks We experimentally
investigate the number of subnetworks that can be fit into a MIMO network with
the ResNet18 model using CIFAR10 and CIFAR100. We use the identical training
procedure for all models as described in Section 5, varying only the number of sub-
networks, as a direct comparison against our proposed orthogonal dropout method.
Note that for MIMO models, changing the number of subnetworks amounts to sim-
ply adjusting the number of input images and the number of linear classification
layers; three subnetworks correspond to a network with three inputs and three
outputs.
Results are summarized in Figure C.1. A direct comparison of MIMO networks
against our proposed orthogonal dropout strategy reveals that we can fit much
more models into a network of the same capacity. Indeed, for the ResNet18 model,
having even three subnetworks in the MIMO networks can significantly degrade
accuracy performance. We hypothesize that this is due to the way subnetworks are
implemented in MIMO networks. During the training of MIMO networks, multiple
inputs are concatenated together and fed into the networks simultaneously. When
the size of the subnetworks grows, the number of channels in the concatenated
inputs also grows proportionally, thereby making the simultaneous training of the
subnetworks harder. After all, each input in the stack of inputs is independent of
one another. Yet, there is no explicit constraint/regularization in MIMO networks
to enforce such independence. As such, when the number of subnetworks becomes
large, it can be hard for networks to capture such independence between the input
images by themselves.
184
CIFAR-10 CIFAR-10
0.30
95
0.28
94 0.26
0.24
93
0.22
92 0.20
0.18
91
0.16
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
(a) (b)
CIFAR-10 CIFAR-100
0.040 77.5
75.0
0.035
72.5
0.030
70.0
0.025
67.5
0.020 65.0
0.015 62.5
0.010 60.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Number of Models
(c) (d)
CIFAR-100 CIFAR-100
0.14
1.4 MIMO
0.12
1.3 Ours
0.10
1.2
0.08
1.1
0.06
1.0
0.04
0.9
0.02
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Number of Models Number of Models
(e) (f)
Figure C.1: Plot of accuracy/NLL/ECE against number of models in the ensem-
bles. For the proposed method, the number of models is varied by changing the
size of each subnetwork and all the orthogonal dropout ensembles are of the same
size. For MIMO networks, the number of models is varied by changing the number
of inputs and outputs (classifier layers) of the networks.
185
NLL ECE Accuracy
ECE Accuracy NLL
Table C.1: Comparison against deep ensembles with reduced convolutional kernel
size, so that deep ensemble has the same number of parameters as orthogonal
dropout. ”FC” stands for fixed classification.
CIFAR10 CIFAR100
Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓)
rescaled Ensemble + FC 94.6% 0.174 0.0104 76.0% 0.911 0.0256
Orthogonal Dropout + FC 95.1% 0.157 0.0082 77.7% 0.864 0.0191
Table C.2: Comparison against baseline methods when all methods have fixed
classification layer.
CIFAR10 CIFAR100
Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓)
Dropout + FC 94.5% 0.187 0.0185 73.6% 1.12 0.0863
Batch Ensemble + FC 94.6% 0.210 0.0303 74.6% 1.02 0.0898
MIMO + FC 94.6% 0.182 0.0146 75.1% 0.988 0.0384
Masksemble + FC 93.5% 0.203 0.0090 73.6% 0.969 0.0143
Orthogonal Dropout + FC 95.1% 0.157 0.0082 77.7% 0.864 0.0191
Additional Experiments with Deep Ensembles To further demonstrate the
effectiveness of the proposed method, we conduct an comparison against an ensem-
ble of 5 independently trained√networks using ResNet, but each with convolutional
filters reduced by a factor of 1/5, so that in total, the size of this explicit ensem-
ble is the same as that of an orthogonal dropout model. We report the comparison
in Table C.1. As seen clearly, our proposed method is capable of significantly
outperforming it.
Additional Experiments with Fixed Classifier Layer To further demon-
strate that fixing the classification layer is not the main source of increase in
performance, we conduct an additional experiment and fix the classification layer
for all baseline methods using ResNet-18 and CIFAR datasets. Results of the
experiments are summarized in Table C.2. The proposed method significantly
outperforms all other SOTA methods.
186
Table C.3: Comparison against other types of dropout.
CIFAR10 CIFAR100
Accuracy (↑) NLL (↓) ECE (↓) Accuracy (↑) NLL (↓) ECE (↓)
Dropout 94.4% 0.191 0.0202 73.3% 1.11 0.0902
DropChannel 94.3% 0.239 0.0343 73.9% 1.21 0.136
DropoutBlock 94.5% 0.163 0.0139 73.3% 1.226 0.137
Orthogonal Dropout 95.1% 0.157 0.0082 77.7% 0.864 0.0191
Additional Experiments with Dropout We conduct additional ablation
study to compare against other forms of more advanced dropout methods like
DropChannel and DropBlock. Results of the experiments are summarized in Ta-
ble C.3.
187
APPENDIX D
SUPPLEMENTARY MATERIAL FOR”TOWARDS A DEEPER
UNDERSTANDING OF KNOWLEDGE DISTILLATION”
D.1 On Label Smoothing and Predictive Uncertainty Reg-
ularization
We first give a derivation on the equivalence of label smoothing regularization and
Eq. 6.7. With some simple rearrangement of the terms,
∑n ∑n ∑k
L 1LS = − log[z]y + β − log[z]i c
i=1 ∑( ki=1 c=1 )n k + β ∑ β
= − (1 + β) log[z]y + log[z]c .
k(1 + β) i k(1 + β)
i=1 c≠ yi
The above objective is clearly equivalent to the label smoothing regularization with
1− ϵ = k+β , up to a constant factor of (1 + β).
k(1+β)
Label smoothing regularizes predictive uncertainty. The amount of regulariza-
tion is controlled by the amount of smoothing applied. Evidently, the objective
does not regularize confidence diversity. Indeed, assuming a NN with capacity
capable of fitting the entire training data, predictions on training data will be
pushed arbitrarily close to the smoothed soft label. Empirical evidence for this
form of overfitting can be seen from experiments done by Müller et al. [153], in
which the authors demonstrated that applying label smoothing leads to hampered
distillation performance. The authors hypothesize that this is likely due to erasure
of ”relative information between logits” when label smoothing is applied, hinting
at the overfitting of predictions to the smoothed labels.
188
A closely related regularization technique is to explicitly regularize on predictive
uncertainty: ∑n ∑n ∑k
LPU = −
1
log[z]y + β [z]c log[z]i c.n
i=1 j=1 c=1
Prior papers [38, 169] have demonstrated that directly regularizing predictive un-
certainty can lead to better performance than label smoothing. However, we note
that the above objective does not regularize confidence diversity either. In fact,
it can be easily solved, with the method of Lagrange multiplier, that the optima
for the objective above is achieved when [z]y =
1
− − where Wi βW (exp( 1/β)(k 1)/β)+1
1−[z]
corresponds to the Lambert W function, and [z] = yic − for all c ̸= yi, for allk 1
sample pairs (xi, yi). As such, the global optima obtained by directly regularizing
predictive uncertainty is identical to that of label smoothing. In practice, differ-
ences between the two can arise due to the details of the optimization procedure
(like early stopping), and/or due to model capacity.
D.2 Additional Experiments with Temperature Scaling on
Student Models
To examine the effect of not applying temperature scaling on student models, we
conduct an experiment to compare models trained with and without temperature
scaling on student models for distillation loss with the ResNet-34 on the CIFAR-
100 dataset, using the training objective of Eq. 6.3. On top of the hyper-parameter
α = 0.4 used for experiments in Section 6.7, we also include results with α = 0.1, a
widely used value for knowledge distillation in prior work [30]. We vary the amount
of temperature scaling applied to illustrate the effect of different temperatures have
on student models.
189
alpha=0.1 alpha=0.1
75.0 0.125
74.5 0.100 Scale both
0.075
74.0 Scale teacher only
Scale both 0.050
73.5 Scale teacher only 0.025
2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Temperature Temperature
(a) (b)
alpha=0.4 alpha=0.4
76 0.20
75 0.15
0.10
74 Scale both Scale both
Scale teacher only 0.05 Scale teacher only
2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Temperature Temperature
(c) (d)
Figure D.1: Left: Test accuracies of ResNet-34 models on the CIFAR-100 dataset
when varying temperature. Right: ECE of ResNet-34 models on the CIFAR-
100 dataset when varying temperature. ”Scale both” corresponds to the origi-
nally proposed distillation objective in which both teacher and student models are
temperature-scaled during training. ”Scale teacher only” corresponds to only tem-
perature scaling teacher models during distillation. The green flat line represents
the performance achieved by the teacher model trained with cross-entropy loss.
Plots of test accuracy and ECE against amount of temperature scaling applied
are shown in Fig. D.1. Firstly, we observe that models trained with student scaling
have ECE almost identical to that of the teacher models. As a direct contrast, we
see that the student models trained without student scaling perform much better
in terms of calibration error in general over its teacher. Note that the relatively
large ECE when α = 0.4 and T > 3 is likely due to overly unconfident teacher
predictions. In addition, we highlight that, with the optimal hyper-parameters of
α and T used, student models trained without student scaling can also outperform
190
Accuracy
Accuracy
ECE ECE
significantly in terms of test accuracy. We acknowledge that there can be conflicts
between the performance of ECE and accuracy, as seen from superior test accuracy
but poor ECE achieved for α = 0.4 and T = 4.0. In practice, we can use the
negative log likelihood, a metric influenced by both ECE and accuracy, to find
the optimal α and T . Lastly, we note that, both α and T alter the amount of
predictive uncertainty and confidence diversity in teacher predictions at the same
time. This coupled effect could be the reason for the observed conflict between
ECE and accuracy. We leave it as a future work to explore alternative ways to
decouple the two measures for more efficient and effective parameter search. We
believe a decoupled set of parameters can lead to models with better calibration
and accuracy at the same time.
D.3 Additional Experiments on Sequential Self-Distillation
with Different Temperatures
76.0
75.5 4.0 Test NLL
0
75.0 3.5
4 1
74.5 3.0 3 2
74.0 2.5 3
73.5 22.0 4
73.0 1
72.5 Test Accuracy 1.5 Predictive Uncertainty 5 Confidence Diversity
72.0 1.0 0 60 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Generations Generations Generations Generations
70 4.5 Test NLL 04 1 Confidence Diversity4.0
60 3.5 3 2
50 3.0 3
40 2.5 2 4
30 2.0 1 5
20 Test Accuracy 1.5 Predictive Uncertainty 60
0 1 2 3 4 5 1.0 0 1 2 3 4 5 0 1 2 3 4 5 7 0 1 2 3 4 5
Generations Generations Generations Generations
Figure D.2: Results for sequential self-distillation over 5 generations are shown
above for different temperatures. Top: temperature T = 2.0; Bottom: tempera-
ture T = 3.0. The same temperatures are used throughout the entire sequential
distillation process. Model obtained at the (i − 1)-th generation is used as the
teacher model for training at the i-th generation. Accuracy and NLL are obtained
on the test set using the student model, whereas the predictive uncertainty and
confidence diversity are evaluated on the training set with teacher predictions.
191
To further verify the observation on predictive uncertainty and confidence diver-
sity made empirically in Section 4, we conduct additional sequential self-distillation
experiments with different values of temperature. Figure D.2 summarizes the re-
sults when temperature is 2 (top) and 3 (bottom) respectively. As seen clearly,
test accuracy and NLL performance correlate strongly with that of confidence di-
versity, further demonstrating the importance of confidence diversity for greater
generalizability in neural networks.
D.3.1 Additional Experiments with Different Amount of
Label Smoothing ϵ
In order to verify that the conclusions drawn from our empirical experiments hold
more generally, we conduct additional experiments varying the amount of label
smoothing ϵ. Additional smoothing parameters of ϵ = 0.1 and ϵ = 0.3 are used.
As a fair comparison, given the label smoothing parameter ϵ, hyper-parameters
for Beta smoothing and self-distillation are adjusted so that the amount of label
smoothing for samples on average is the same as that of label smoothing. Experi-
mental results are summarized in Figure D.3. Observe that the general trend in
terms of both the accuracy and calibration holds across different values of ϵ.
192
CIFAR-100, ResNet-34 CIFAR-100, ResNet-34
75 76
74 75
0.125
0.100 0.22
0.075 0.20
LS B SD LS B SD
(a) (b)
CUB-200, ResNet-34 CUB-200, ResNet-34
58
55
57
50 56
0.10 0.24
0.22
0.05
0.20
LS B SD LS B SD
(c) (d)
Tiny-ImageNet, ResNet-34 Tiny-ImageNet, ResNet-34
58 58
56 57
56
0.16
0.1
0.14
0.0 LS B SD LS B SD
(e) (f)
Figure D.3: Experimental Results performed on CIFAR-100, CUB-200 and the
Tiny-Imagenet dataset with different amount of label smoothing. Left: ϵ = 0.1,
Right: ϵ = 0.3. ”CE”, ”LS”, ”B” and ”SD” refers to ”Cross Entropy”, ”Label
Smoothing”, ”Beta Smoothing” and ”Self-Distillation” respectively. The top rows
of each experiment show bar charts of accuracy on test set for each experiment
conducted, while the bottom are bar charts of expected calibration error.
193
ECE Accuracy ECE ECEAccuracy
Accuracy
ECE ECE ECE
Accuracy Accuracy Accuracy
CIFAR-100, ResNet-34 CUB-200, ResNet-34
74 54
53
72 52
0.2
0.2
0.1
0.0 B ST 0.0 B ST
(a) (b)
Tiny-ImageNet, ResNet-34
56.0
55.5
55.0
0.2
0.1
0.0 B ST
(c)
Figure D.4: Additional results to compare Beta smoothing against self-training
explicitly with the EMA predictions. ”B” and ”ST” refer to ”beta smoothing”
and ”self-training” respectively. The top rows of each experiment show bar charts
of accuracy on the test set for each experiment conducted, while the bottom rows
are bar charts of expected calibration error.
D.4 Additional Experiments with Self-Training Using
EMA-Predictions
The proposed beta smoothing involves the use of EMA predictions to rank the
confidence of samples within each minibatch during training in order to achieve
instance-specific regularization. To further demonstrate that the gain in accuracy
and calibration obtained through beta smoothing mainly comes from instance-
specific regularization, we compare Beta smoothing against explicit self-training
using the EMA predictions in which the EMA predictions are directly used as soft
194
ECE Accuracy
ECE Accuracy
ECE Accuracy
labels to compute cross-entropy loss. We follow the training procedure as described
in [201] for self-training with EMA predictions. Results using ResNet for all the
datasets considered in this chapter are summarized in Figure D.4. Beta smoothing
outperforms self-training using EMA predictions on all of the experiments con-
ducted in terms of both accuracy and calibration. As such, while EMA predictions
can be used as a reliable proxy to rank the relative confidence of samples, the
predictions themselves are sub-optimal when used as teachers directly.
D.5 Additional Experiments with CIFAR-10 When Vary-
ing Trainset Size
Recent results show relatively small gain when performing knowledge distillation
on the CIFAR-10 dataset [30, 49]. Our perspective of distillation as regulariza-
tion provides a plausible explanation for this observation. Like all other forms
of regularization, its effect diminishes with increasing the size of training data.
We experimentally verify the claim by training ResNet-34 models with a varying
number of training samples. The experiment are repeated 3 times. Fig. ?? sum-
marizes the results. As expected, increasing sample size leads to an increase in
test accuracy for both of the models. Nevertheless, the relative improvement in
the accuracy of the student model compared to the teacher decreases as the size of
the training set increases, indicating that distillation is a form of regularization.
195
95
90
85
80
75
70 Teacher
Student
65
5000 10000 15000 20000 25000 30000 35000 40000 45000
Training Set Size
(a)
[width=1.0]labelsmoothing/scaledCIFARimprovementresults.pdf
(b)
Figure D.5: Left: Test accuracies of ResNet-34 models on the CIFAR-10 dataset
for the teacher and student models when the training set size is varied. Right: The
relative improvements in accuracy when the training set size is varied.
D.6 Additional Experiments with CIFAR-100 When Vary-
ing Weight Decay
To further demonstrate that distillation is a regularization process, we also con-
duct an additional experiment on the CIFAR-100 dataset using ResNet-34, varying
only the weight decay hyper-parameter. Intuitively, larger weight decay regular-
ization makes NNs less prone to overfitting, which should, in turn, reduced the
additional benefits obtainable from self-distillation, if it is indeed a form of reg-
ularization. To keep the quality of priors identical across all student models, we
196
Test Accuracy
78
76
74
72 Student
Teacher
70
0.0000 0.0002 0.0004 0.0006 0.0008 0.0010
Weight Decay
(a)
3.5
3.0
2.5
2.0
1.5
1.0
0.0000 0.0002 0.0004 0.0006 0.0008 0.0010
Weight Decay
(b)
Figure D.6: Left: Test accuracies of ResNet-34 models on the CIFAR-100 dataset
for the teacher and student models when the weight decay hyper-parameter is
varied. Right: The relative improvements in accuracy when the weight decay
hyper-parameter is varied.
use the same teacher model obtained from using a weight decay of 10−4 for all
distillation. Our results are summarized in Fig. D.6. It is evident that increas-
ing the weight decay hyper-parameter leads to much smaller improvement in test
accuracy. Interestingly, we see a noticeable gain in accuracy for baselines models
trained with cross-entropy when adjusting the weight decay term, contradicting
some of the recent findings that weight decay is ineffective for neural networks.
197
Accuracy Improvement Accuracy
D.7 Additional Experiments on Beta Smoothing
CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12
75.5 77.5
75.0 77.0
74.5
76.5
0.14 0.10
0.12 0.08
0.10 0.06
LS RB B LS RB B
(a) (b)
CUB-200, ResNet-34 CUB-200, DenseNet-121-12
55.0 60
52.5 58
0.10
0.10
0.05 0.05
LS RB B LS RB B
(c) (d)
Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12
56.5 58
56.0 57
55.5
0.06 0.10
0.04 0.05
LS RB B LS RB B
(e) (f)
Figure D.7: Ablation study on Beta smoothing. ”LS”, ”RB” and ”B” refers to
”Label Smoothing”, ”Random Beta Smoothing” and ”Beta Smoothing” respec-
tively. The top rows of each experiment show bar charts of accuracy on the test
set for each experiment conducted, while the bottom rows are bar charts of ex-
pected calibration error.
We conduct an ablation study on the proposed Beta smoothing regularization
in order to demonstrate the importance of relative ranking. To do so, we run ex-
198
ECE Accuracy ECE Accuracy ECE Accuracy
ECE ECE ECE Accuracy
Accuracy Accuracy
periments with the identical setup as described in Section 6.7 for Beta smoothing
with completely randomly assigned soft label noise from Beta distribution instead.
We term this the “random Beta smoothing”. Results are shown in Fig. D.7. For
convenience, we also include results obtained with regular label smoothing as a
benchmark comparison. As seen clearly, the proposed Beta smoothing with rank-
ing obtained from EMA predictions leads to much better results in general in terms
of both accuracy and ECE, suggesting that naively encouraging confidence diver-
sity does not lead to significant improvements, and the relative confidence among
different samples is also an important aspect in order to obtain better student mod-
els. This ablation study also serves as indirect evidence for why self-distillation
still outperforms Beta smoothing - with a pre-trained model, much more reliable
relative confidence among training samples can be obtained.
D.8 Additional Experiments on the Effect of Quality of
Teachers
We also perform an additional experiment with the identical setup as described in
Section 6.7 on cross distillation of the ResNet and the DenseNet models, in which
a ResNet-34 teacher is used to train the DenseNet-100 student and vice versa in
an attempt to examine the effect of better/worse priors in self-distillation. Hyper-
parameters are fixed in this case such that the predictive uncertainty and diversity
associated with the label predictions remain the same as that for self-distillation.
Results are summarized in Fig. D.8. As seen clearly from consistently better/worse
performance of cross distillation for ResNet/DenseNet, better teachers lead to
better performance. Thus, in addition to diversity among teacher predictions, the
199
CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12
79
77
76 78
75 77
0.15
0.10
0.10
0.05 0.05
SD CD SD CD
(a) (b)
CUB-200, ResNet-34 CUB-200, DenseNet-121-12
58 60.5
56 60.0
54 59.5
0.04
0.030
0.03
0.025
0.02 0.020
SD CD SD CD
(c) (d)
Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12
60.0 59
57.5 58
57
0.1 0.05
0.04
0.0 SD CD SD CD
(e) (f)
Figure D.8: Additional results on cross-distillation. ”SD” and ”CD” refers to ”self-
distillation” and ”cross-distillation” respectively. The top rows of each experiment
show bar charts of accuracy on the test set for each experiment conducted, while
the bottom rows are bar charts of expected calibration error.
quality of the instance-specific prior used is also important for better generalization
performance. Lastly, we also see an apparent benefit in terms of model calibration
when a better teacher model is used.
200
ECE Accuracy ECE ECEAccuracy Accuracy
ECE ECE
Accuracy Accuracy
ECE
Accuracy
The interpretation of distillation as sample-specific regularization provides us
with a reasonable explanation of why deeper NNs are potentially better teachers.
With greater capacity, deeper networks can learn better representations that cap-
ture more closely the true underlining relative confidence among samples, thereby
generating better priors and hence better performance. When too expressive mod-
els are used, however, there can be so much overfitting to the ground truth labels
that the meaningful rankings are destroyed, despite better accuracy. Recent find-
ings experimentally corroborate our argument [30]. Similar observations were also
made when label smoothing is applied [153]. From the regularization perspective,
distillation can be also applied to very deep networks for potential improvements,
and shallower teacher models can also serve as teacher models for deeper student
networks.
D.9 Additional Experiments on Varying γ
In addition, we consider a simple variation to distillation loss by varying γ. How-
ever, directly adjusting γ can be problematic in practice. To understand the ef-
fect of changing γ, suppose we have some γ such that [αx]c − 1 < 0 for some
c ∈ {1, ..., k}. Since the minimization objective with respect to this class is
−([αx]c − 1) log([z]c), the closer the [z]c to 0, the smaller the loss function. This
leads to numerical issues as the overall loss function can be pushed to negative
infinity by forcing [z]c arbitrarily close to zero.
To circumvent the numerical problem during optimization, we make the obser-
vation that the above objective is essentially equivalent to setting the particular
element with [αx]c − 1 < 0 to zero. As such, adjusting the threshold γ enables
201
CIFAR-100, ResNet-34 CIFAR-100, DenseNet-100-12
76 79
78
75 77
0.15 0.10
0.10 0.05
SD PD SD PD
(a) (b)
CUB-200, ResNet-34 CUB-200, DenseNet-121-12
56 61
60
55 59
0.04 0.04
0.02 0.02
SD PD SD PD
(c) (d)
Tiny-Imagenet, ResNet-34 Tiny-Imagenet, DenseNet-100-12
58 59.0
57 58.5
0.150 0.050
0.125
0.025
0.100
SD PD SD PD
(e) (f)
Figure D.9: Additional results on pruned distillation. ”SD” and ”PD” refer to
”self-distillation” and ”pruned-distillation” respectively. The top rows of each ex-
periment show bar charts of accuracy on the test set for each experiment conducted,
while the bottom rows are bar charts of expected calibration error.
us to prune out the smallest elements of the teacher predictions. To further force
the pruned elements to zero, a new softmax probability vector is computed with
the remaining elements. In practice, setting the optimal γ can be challenging.
We instead choose to prune out a fixed percentage of classes for all samples. For
202
ECE ECE ECE
Accuracy Accuracy Accuracy
ECE Accuracy ECE ECEAccuracy Accuracy
instance, pruning 50% of the classes for a 100-class classification amounts to us-
ing only the top 50 most confident samples to compute softmax and setting the
remaining to zero. We term this method the pruned-distillation.
We show some preliminary results with pruned-distillation with 50% of the
classes pruned during distillation in Fig. D.9. While the performance overall re-
mains the same for the CIFAR-100 and Tiny-Imagenet datasets, a slight improve-
ment can be seen for CUB-200 in terms of both the accuracy and ECE, suggesting
the method as an easy-to-implement adjustment with no harm.
203