TOWARDS ROBUST VISUAL PERCEPTION SYSTEMS IN REAL-WORLD ENVIRONMENTS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Hubert Lin May 2022 ©c 2022 Hubert Lin ALL RIGHTS RESERVED TOWARDS ROBUST VISUAL PERCEPTION SYSTEMS IN REAL-WORLD ENVIRONMENTS Hubert Lin, Ph.D. Cornell University 2022 Computer vision models are conventionally trained on, and benchmarked against, curated datasets. While very high performance can be achieved in these settings, it is more challenging to deploy these models in the real world. When computer vision models are used in new environments, they must over- come distribution shift between their training data and their test environment. We consider computer vision models to be robust when they can perform well on a variety of images captured from different environments. This dissertation explores several research problems towards building perception systems that are robust in the real world. Building and labeling datasets is a key step for training a strong computer vision system. Capturing and representing a large diverse set of images allows our models to have experience with images from a variety of environments. However, data annotation is difficult, so efficiently labeling data is an important challenge to consider. It is also useful to consider the human visual system as a gold-standard point of reference for a robust visual system, as humans can readily understand new images. Paintings are an interesting type of image which are created by humans for human consumption – a key property of artwork lies in its ability to convey perceptual realism without necessarily being physically realistic. We would like our computer vision models to be able to understand paintings as well, and learning from paintings may allow computer vision models to be more robust. Lastly, even our best efforts to build robust models will not lead to a perfect model, and it is important to reason about failures and uncertainties when these models are used. We touch upon each of these challenges in the dissertation. First, a direct way to overcome distribution shift is to as closely mimic the real world distribution of (image, label) pairs as possible in the datasets we use. However, labeling data is very expensive and time-consuming. In the first work, we propose a new efficient annotation method for semantic segmentation. Our pipeline divides the traditional per-pixel annotation task for an entire image into per-pixel annotation for image subregions. This task is more palatable for annotators and leads to an increase in label quality for a lower cost. Furthermore, we find that only annotating a small number of image subregions is sufficiently informative for models to outperform conventional annotation and inexpensive weakly-supervised annotation methods. Next, we explore paintings as a medium for studying perception systems, and their utility as an alternative source of training data from natural images. The image distribution of paintings are different from the distribution of natural photographs, but paintings often depict meaningful objects and scenery that parallel those found in the real world. Common computer vision models are developed for natural photographs, and most existing labeled datasets focus on natural images. We explore several use cases of such models applied to paintings instead. Furthermore, paintings may emphasize features or invariances that humans utilize for robust perception, as they can be perceptually realistic without being physically realistic. Indeed, for finegrained fabric classification, we find evidence that models trained on paintings focus on cues that are both more interpretable and generalizable. The next work extends our previous findings by systematically exploring the invariances encoded in paintings for natural image recognition. We study how models behave when trained with real paintings and style transfer. Style trans- fer is a type of data augmentation that promises to create painting-like images from natural photographs. We train models on natural photographs, paintings, and/or stylized images, and evaluate them on test data representing real-world distribution shifts. Perception models must overcome such shifts to successfully understand different environments – for example, the test photographs may contain noise, or be drawn from another dataset which was sampled from differ- ent viewpoints than the training images. Our results show that learning from a combination of natural photographs and paintings leads to models that are far more robust than learning from natural photographs alone. Interestingly, we also find that style transfer does not capture the same invariances as paintings, and that paintings are unique among various artforms in enabling recognition models to learn useful invariances for natural image recognition. Finally, we conclude with a real world case study in using visual perception for autonomous navigation. Even the most accurate perception model will not be perfect, and it is important to account for failures when using these models as a building block in an autonomous sytem. We propose a planning pipeline that reasons about uncertainty in semantic segmentations to find a safe path in unknown environments. Given predictions of terrain and obstacles from a view of a scene, the autonomous agent determines which subsequent views of the environment to capture. By taking multiple views of the environment, the agent increases its confidence in successfully selecting a viable path across safe terrain to a goal location. Our results show that this pipeline allows safe paths to be planned when deployed in real world environments with noisy and inaccurate model predictions. These works represent meaningful steps towards tackling some key chal- lenges in building robust perception systems: acquiring high-quality and diverse training data, learning robust features for recognition, and reasoning about perception uncertainties within broader autonomous systems. However, the problem of building a robust perception system is multifaceted in its complexities and challenges, and many exciting research problems remain to be explored in future work. BIOGRAPHICAL SKETCH Hubert Lin was born in Illinois, USA in 1994, and grew up in Michigan, USA and Ontario, Canada. After completing an Honours Bachelor of Science in Physics and Computer Science at the University of Toronto in 2016, he has pursued a PhD in Computer Science at Cornell University. iii This page intentionally left blank. iv ACKNOWLEDGEMENTS My research was supported in part by NSERC (PGS-D 516803 2018), NSF (CHS-1617861, CHS-1513967, CHS-1900783, CHS-1930755), and PERISCOPE MURI (N00014-17-1-2699). I would like to thank my advisor Kavita Bala (KB) for her kind yet firm mentorship. The last several years have been an important period of growth for me, both as a researcher and as an individual. KB’s guidance has been invaluable to me in both of these aspects. I would also like to thank my committee members Bharath Hariharan and David Bindel for their gentle feedback as I approached important milestones during the PhD. All of my research have been the result of successful collaborations with peers and colleagues, and I thank them for their collaborative spirit, insightful discussions, and many hours spent hammering away at problems. I would like to thank Mitchell van Zuijlen, Maarten Wijntjes, and Sylvia Pont for their friendly presence and expertise over the years, all while wrestling with the challenges of meeting across timezones and interdisciplinary research. I spent many nights working with Yutao Han, Jacopo Banfi, and Hadi AlZayer on a variety of interesting robotics problems, and it was always just a little bit harder than we expected to get the robots to finally run. Balazs Kovacs and Paul Upchurch played pivotal roles in mentoring me throughout my early projects, and I thank them for their sharing experience as senior students when I joined the group and throughout my time here. I would like to thank the members of the group for their friendly attitudes and diverse research interests which made for interesting group meeting discussions: Scott Wehrwein, Paul Upchurch, Balazs Kovacs, Fujun Luan, Utkarsh Mall, Hadi AlZayer, Aaron Gokaslan, and many others who were here for a shorter while. I also enjoyed the quiet respectful work environment built by the students in the broader v graphics/vision lab. My friends and family have offered unconditional support over the years. They remind me to look at life beyond work, while also encouraging me to push towards my goals in research. I’m grateful for the friends that I have met here – thank you for all the shared meals, rock climbing sessions, board game nights, laughter, and moments of commiseration. I look forward to many more in the years to come. To my friends from home who have kept in close touch over many many years: you know who you are, and how important you are to me. Thank you for all the video calls, silly group messages, serious discussions, nights of gaming, and for reminding me of home even as we venture out on our own paths in different parts of the world. I would like to thank my partner, Nicole Wang, for her loving support, caring heart, and fun sense of humor. I am inspired by her resilience as a scientist and I cannot wait to see where she chooses to go as she approaches the completion of her PhD. I’d like to conclude by mentioning three important dogs in my life. First, my family’s two miniature schnauzers, Noble and Jake, were with me through many milestones in my life and saw me off to Ithaca for my first day at Cornell. I wish they both could have been here to celebrate this step with me. Finally, Nicole’s westie, Momo – a bundle of youthful energy who always gets into just enough trouble to keep us on our toes, and reminds me to enjoy the little things. vi CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 Introduction 1 2 Efficient Image Annotation for Semantic Segmentation 9 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Related Work and Background . . . . . . . . . . . . . . . . . . . . 13 2.4 Block Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Annotation Interface . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Quality of Block Annotation . . . . . . . . . . . . . . . . . . 16 2.4.3 Viability of Real-World Block Annotation . . . . . . . . . . 20 2.4.4 Annotation Cost and Worker Feedback . . . . . . . . . . . 22 2.4.5 Block Selection . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.6 Compatibility with Existing Annotation Methods . . . . . 24 2.5 Segmentation Performance . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.3 Weakly Supervised Segmentation Comparison . . . . . . . 29 2.6 Block-Inpainting Annotations . . . . . . . . . . . . . . . . . . . . . 32 2.6.1 Block-Inpainting Model . . . . . . . . . . . . . . . . . . . . 32 2.6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.3 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 38 3 Analyzing and Learning from Depictions of Materials in Paintings 40 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 The Materials in Paintings (MIP) Dataset . . . . . . . . . . . . . . 43 3.4 Using Computer Vision to Analyze Paintings . . . . . . . . . . . . 44 3.4.1 Extracting Polygon Segments with Interactive Segmentation 45 3.4.2 Detecting Materials in Unlabeled Paintings . . . . . . . . . 46 3.5 Using Paintings to Build Better Recognition Systems . . . . . . . 50 3.5.1 Learning Robust Cues for Finegrained Fabric Classification 51 3.5.2 Benchmarking Unsupervised Domain Adaptation . . . . . 55 3.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 58 vii 4 Learning Robust Natural Image Recognition from Paintings 61 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Related Work and Background . . . . . . . . . . . . . . . . . . . . 64 4.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Evaluating Robustness . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Style Transfer as Data Augmentation . . . . . . . . . . . . . . . . . 69 4.5.1 Are Painting Style Images Necessary? . . . . . . . . . . . . 70 4.5.2 The Role of Style Diversity . . . . . . . . . . . . . . . . . . 71 4.6 Paintings as Perceptual Data Augmentation . . . . . . . . . . . . . 73 4.6.1 Learning Robust Natural Image Recognition From Paint- ings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6.2 Paintings vs. Other Visual Artforms . . . . . . . . . . . . . 77 4.7 Do Stylized Images and Paintings Induce Similar Invariances? . . 78 4.7.1 Probing Learned Invariances . . . . . . . . . . . . . . . . . 80 4.7.2 The Role of High Frequency Signals . . . . . . . . . . . . . 83 4.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 85 5 Uncertainty-Aware Planning with Semantic Scene Understanding 88 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Related Work and Background . . . . . . . . . . . . . . . . . . . . 92 5.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.1 Predicting Semantic Labels . . . . . . . . . . . . . . . . . . 98 5.5.2 Associating Semantic Labels to a Point Cloud . . . . . . . 99 5.5.3 RRT-Based Multi-hypothesis Planner . . . . . . . . . . . . 100 5.5.4 Next-Best-View (NBV) Planning . . . . . . . . . . . . . . . 102 5.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.6.1 Validation Scenes and Overview . . . . . . . . . . . . . . . 104 5.6.2 Multipath Planner Evaluation . . . . . . . . . . . . . . . . . 105 5.6.3 NBV evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6.4 Path Safety Evaluation . . . . . . . . . . . . . . . . . . . . . 108 5.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 110 6 Conclusion 112 A Appendix: Efficient Image Annotation for Semantic Segmentation 115 A.1 Deeplabv3+ and Mobilenetv2 . . . . . . . . . . . . . . . . . . . . . 115 A.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.1.2 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . 116 A.1.3 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . 117 viii A.2 Block-Inpainting Model . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2.1 Architectural Modifications . . . . . . . . . . . . . . . . . . 118 A.2.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.2.3 Inference Details . . . . . . . . . . . . . . . . . . . . . . . . 118 A.3 Additional Visualizations . . . . . . . . . . . . . . . . . . . . . . . 119 A.3.1 Crowdsourced Annotations (SUNCG/CGIntrinsics) . . . 119 A.3.2 Crowdsourced Annotations (Cityscapes) . . . . . . . . . . 120 A.3.3 Block-Inpainted Labels . . . . . . . . . . . . . . . . . . . . 120 B Appendix: Learning Robust Natural Image Recognition from Paint- ings 125 B.1 Materials Dataset Details . . . . . . . . . . . . . . . . . . . . . . . 125 B.2 Classification Parameters . . . . . . . . . . . . . . . . . . . . . . . 127 B.3 Style Transfer Parameters . . . . . . . . . . . . . . . . . . . . . . . 128 B.4 Visualizations of Stylized Photos . . . . . . . . . . . . . . . . . . . 129 B.5 Style Distance vs Robustness . . . . . . . . . . . . . . . . . . . . . 129 B.6 Biases of Stylization Algorithms . . . . . . . . . . . . . . . . . . . 130 B.7 Power Spectra of Different Image Types . . . . . . . . . . . . . . . 132 B.8 Domain-Invariant Feature Learning . . . . . . . . . . . . . . . . . 132 B.9 Additional Architectures . . . . . . . . . . . . . . . . . . . . . . . . 135 C Appendix: Uncertainty-Aware Planning with Semantic Scene Under- standing 141 C.1 Outdoor Navigation Segmentation Dataset . . . . . . . . . . . . . 141 C.2 Network Architecture, Training, and Inference . . . . . . . . . . . 147 Bibliography 148 ix LIST OF TABLES 2.1 Block vs Full Annotation. Average cost and error statistics per image. Error is measured via IoU against ground truth segments in the synthetic SUNCG/CGIntrinsics dataset. . . . . . . . . . . . 18 2.2 Real-world cost of annotation. Cost evaluated on Cityscapes. Each block is annotated by MTurk workers. Full-image is anno- tated by experts in [28]. Note: [28] annotates instance segments. See table 2.1 for crowd-to-crowd comparison. . . . . . . . . . . . 22 2.3 Block annotation worker feedback. Free-form responses are ag- gregated over SUNCG and Cityscapes experiments, and collected at most once per worker. All 24 sentiments across all 19 worker responses are summarized. . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Semantic segmentation performance when trained on all im- ages. Training with block annotations uses fewer annotated pixels than full annotation but achieves equivalent performance. . . . 28 2.5 Weakly-supervised segmentation performance. Evaluated on Pascal VOC 2012 validation set. Original table from [95]. Blocks (N%) indicates N% of image pixels (N pseudo-checkerboard blocks) are labelled. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Weakly-supervised segmentation performance given equal an- notation time. For time comparison of scribbles against other methods, please refer to [95]. . . . . . . . . . . . . . . . . . . . . . 30 2.7 Block-inpainting with different types of hints. “Every other pixel” annotations represents an ideal sampling of pixels to an- notate as hints – this sampling strategy is infeasible in practice. Relative performance of inpainting by utilizing hints with respect to “every other pixel”-hints is shown. Checkerboard sampling with fully annotated blocks outperform no hints, random blocks (with only semantic boundaries annotated within blocks), and random blocks (with fully annotated blocks). . . . . . . . . . . . 38 3.1 Segmentation Performance. Grabcut Extr is based on [123] with small modifications: (a) minimum cost boundary is computed with the negative log probability of a pixel belonging to an edge; (b) in addition to clamping the morphological skeleton, the ex- treme points centroid and extreme points are clamped; (c) GC is computed directly on the RGB image. DEXTR [110] is pretrained on Pascal-SBD and COCO. Note that Pascal-SBD and COCO are natural image datasets of objects, but DEXTR transfers surpris- ingly well across both visual domain (paintings vs. photos) and annotation categories (materials vs. objects). . . . . . . . . . . . . 47 x 3.2 Image-level Detection Accuracy. Bounding boxes are detected with FasterRCNN trained on paintings. Because the dataset is not exhaustively annotated spatially, image-level accuracy is reported instead of box precision and recall. Overall, images are tagged with the correct materials with high accuracy. . . . . . . . . . . . 49 3.3 Classifier Generalization. Classifiers are trained to distinguish cotton/wool from silk/satin. One classifier is trained on pho- tographs and another classifier is trained on paintings. Both classifiers perform similarly well on images of the same type they were trained on, but the classifier trained on paintings performs better on photographs than vice versa. This suggests that the features learned from paintings are more generalizable for this task on this set of data. . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Human Agreement with Classifier Cues. On average, humans prefer the cues used by the painting-trained classifier to make its predictions over the cues used by the photo-trained classi- fier. Interestingly, the human judgements also indicate that the painting-trained classifier uses cues that are just as good to the cues used by the photo-trained classifier for silk/satin photos despite never seeing a silk/satin photo during training (column 2). A pictorial representation of the results is given in Fig. 3.6. . 55 3.5 Effect of Dataset Size. UDA from photo (source) to painting (tar- get) and painting (source) to photo (target). Source-only refers to a reference baseline where no adaptation is used. The gap be- tween source-only and UDA decreases as data samples increases from 1K images per class to 6K images per class. Furthermore, in contrast to behavior found on existing benchmark datasets, the class-conditional method of CDD does not necessarily outper- form the class-agnostic counterpart MMD. . . . . . . . . . . . . . 58 3.6 Effect of Class Label Estimation. Reducing the reliance class label estimation improves class-conditional UDA when label esti- mation for target data is poor. MMD does not require class label estimation, and so its performance is relatively good here. Due to poor label estimation, we find that IntraCDD (which considers only intraclass discrepancy) outperforms CDD (which considers both intraclass and interclass discrepancy) as IntraCDD relies less on accurately estimated class labels. (green) Assuming per- fect class label estimation using ground truth (GT) labels, CDD recovers performance gains over intraCDD and MMD. . . . . . . 59 4.1 Robustness from Different Artforms. Paintings improve model robustness while more abstract artforms can reduce robustness. (+)/(−) indicate whether an artform improves/reduces model robustness. ± indicates standard deviation over 3 runs. . . . . . 79 xi 4.2 Per-Corruption Accuracy. (blue) SACL generally outperforms both AdaIN and paintings, particularly on noise. (red) Paintings can outperform AdaIN on some corruptions with a large dataset (Materials), but underperform when fewer images are available (PACS). See main text for discussion. ± indicates standard devia- tion over 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 Learning from Stylization and Paintings. Training with both stylized images and paintings improves average robustness to image corruptions and out-of-distribution photos, indicating that the invariances learned from these images are complementary. ± indicates standard deviation over 3 runs. . . . . . . . . . . . . . 83 4.4 Robustness without High Frequency Signals. “LF” denotes fil- tered low frequency images. Photos are always unfiltered. Filter- ing invisible high frequency components mainly impacts noise robustness. (blue) Filtering stylized photos significantly reduces noise robustness while (red) filtering paintings has a relatively smaller effect. ± indicates standard deviations over 3 runs. . . . 85 5.1 Mean and standard deviation of the number of paths found over 300 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 500 trials of path safety evaluation. The columns are the path planning methods used: B1 is the planner based on [79], B2 in- cludes semantic reasoning without any next-best-views (NBVs), and XN is DeepSemanticHPPC (ours) with X NBVs. The rows are the metrics: Safe is the number of trials where the final se- lected path is safe, Unsafe is the number of trials where the final selected path is unsafe (lower is better), CS is the number of trials where a safe path is confirmed with sufficiently high confidence prior to selection, CN is the number of trials where all multipaths are confirmed as unsafe (so no paths are selected). . . . . . . . . 109 B.1 Training datasets are sampled to be as class-balanced as possible. ** indicates that all training samples of that category are included in the training set, and no further samples exist. Natural-10K is a subset of Natural-60K. The test set contains 200 samples of each category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 B.2 Style (Gram Matrix) Distance. Gram matrices computed from ImageNet pretrained ResNet18 features on PACS. Mean distance between (image, stylized image) pairs is reported. ↑ dis- tance implies ↑ style difference. ± denotes standard deviation across 1.5K pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 xii B.3 Effect of Stylization Biases. Per-corruption accuracy for mod- els trained on photos plus photos stylized by themselves. Self- stylization reveals stylization biases in arbitrary style transfer models. Notice that the robustness of models differs between different style transfer methods when self-stylization is applied. 131 B.4 Effect of Domain-Invariant Features. “DA” refers to feature learning with an adversarial domain discriminator loss [46]. Learning domain-invariant features (red) reduces robustness rel- ative to unrestricted feature learning from paintings (blue), but still improves robustness over photo-only. . . . . . . . . . . . . . 134 B.5 Per-Corruption Accuracy (Additional Architectures). Trends across different architectures are generally consistent. For ex- ample, SACL (blue) greatly outperforms AdaIN and paintings (red) for noise robustness. ± indicates standard deviation over 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 xiii LIST OF FIGURES 2.1 Block Annotation Overview. (a) Sub-image block annotations are more effective to gather than full-image annotations (b) Train- ing on sparse block annotations enables semantic segmentation performance equivalent to full-image annotations (c) Block labels can be inpainted with high-quality labels. . . . . . . . . . . . . . . 10 2.2 Block Annotation UI. Annotators are given one highlighted block to annotate with the remainder of the image as context. . . 16 2.3 SUNCG/CGIntrinsics annotation. (a) Ground truth. (b) Block annotation (zoomed-in) (c) Full annotation (zoomed-in). White dotted box highlights an example where block annotation quali- tatively outperforms full annotation. More in appendix. . . . . . 17 2.4 Annotation error rate for block and full annotation for differently sized ground truth segments. Lower is better. . . . . . . . . . . . 17 2.5 Annotation error rate for block and full annotation. Each point represents one image. The same set of images are both block annotated and full-image annotated. The stars represent the centroid (median). Cost/time include estimated cost/time to assign labels for each segment [10]. Lower-left is better. With block annotation, workers (a) choose to work for lower wages and (b) segment more regions for less pay per region. The overall quality is higher for block annotation. . . . . . . . . . . . . . . . . 18 2.6 Crowdsourced vs expert segments. Crowdsourced block- annotated segments are compared to expert Cityscapes segments. Crowdsourced segments are colored for easier comparison. Top- left is a high-quality example. See appendix for more. . . . . . . . 21 2.7 Semantic segmentation performance. Training images are anno- tated with different pixel budgets. Pseudo-checkerboard block annotation outperforms checkerboard and full annotation. . . . . 28 2.8 Block-inpainted labels. Example of human labels vs human Block-50% + inpainted labels. Void labels are masked out. . . . . 35 2.9 Block-Inpainting Model uncertainty versus human pixel-wise agreement for inpainted labels. Curves for different pixel budgets shown for comparison. . . . . . . . . . . . . . . . . . . . . . . . . 36 2.10 Block-Inpainting Model uncertainty versus pixel coverage for human checkerboard + automatic labels. x-axis truncated at 0.05 on left. Curves for different pixel budgets shown for comparison. 37 3.1 Year Distribution of Paintings in Dataset. Each bin equals 20 years. There are peaks in the paintings in the 1700s and 1900s. The former corresponds to the European golden ages; it is less clear what explains the latter peak. . . . . . . . . . . . . . . . . . . 43 xiv 3.2 Examples of Annotated Bounding Boxes. Left to Right: Liquid, Fabric, Ceramic, Metal, and Food. . . . . . . . . . . . . . . . . . . 44 3.3 Extreme Click Segmentations. Left to right: Original Image, Ground Truth Segment, Grabcut Extr Segment, DEXTR COCO Segment. Both Grabcut and DEXTR use extreme points as input. For evaluation, the extreme points are generated synthetically from the ground truth segments. In practice, extreme clicks can be crowdsourced. Bottom-right corner shows the IOU for each segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Detected materials in Unlabeled Paintings. Automatically de- tecting materials can be useful for content retrieval and for filter- ing online galleries by viewer interests. . . . . . . . . . . . . . . . 49 3.5 Classifier Cues. Left to Right: Original Image, Masked Image (Painting Classifier), and Masked Image (Photo Classifier). The unmasked regions represent evidence used by the classifiers for predicting “silk/satin” in this particular image. . . . . . . . . . . 53 3.6 Human Agreement with Classifier Cues. Pictorial representa- tion of user study results from Table 3.4. The y-axis represents how often humans prefer the cues from a classifier trained on the same domain as the test images. It is clear that humans prefer the painting classifier for paintings more than they prefer the photo classifier for photos. Interestingly, the painting and photo classifiers are equally preferred for silk/satin photos despite the painting classifier never seeing a photo during training (bar 2). . 55 4.1 What invariances are learned from real and fake paintings? Left: Natural photographs (black), paintings (magenta), and styl- ized photographs (olive/red/blue) from the Materials dataset (Section 4.4.2), Right: Relative robustness to various types of trans- formations for models trained with different sets of images with respect to a model trained on only natural photos. Stylization algorithms can transform photographs into painting-like images, but it is not clear that models will learn the same invariances from these images. This chapter explores a series of hypotheses to un- derstand the different ways in which style transfer and paintings improve model robustness. . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Image Corruptions. Top-Left to Bottom-Right: Noise(×3), Blur(×4), Weather(×4), Digital(×4). . . . . . . . . . . . . . . . . . 67 xv 4.3 Stylization: Painting vs Photo Styles. Left: PACS, Right: Ma- terials. In general, intradomain stylization (red/green/yellow) improves robustness over no stylization (blue). Further, when sufficient data is available (Materials), intradomain stylization (dashed lines) results in similar robustness gains to conventional painting stylization (solid lines). This means that paintings are not uniquely responsible for robustness gains from stylization. . 71 4.4 Stylization: Unrestricted vs Intraclass Styles. Left: PACS, Right: Materials. Across both datasets, restricting style images to the class as content images (dashed lines) results in smaller robust- ness gains compared to unrestricted stylization (solid lines). This reduction in robustness is explained by the reduction in diversity between content images and style images. . . . . . . . . . . . . . 73 4.5 Learning from Paintings. Left: Clean Accuracy, Right: Corrup- tion Accuracy. Domain-specific classifiers (green) result in the highest robustness while also improving clean accuracy. “LR normalized” refers to fixed effective learning rates to account for additional gradients from the extra classifier head. Even without accounting for domain shifts, training with paintings improves robustness (red/yellow). Results are on Materials. . . . . . . . . 77 4.6 Trade-off Between Photos and Paintings. Left: PACS, Right: Ma- terials. For a fixed annotation budget, learning from both photos and paintings (25%/50% paintings) results in higher robustness than photos alone (0% paintings), with a bit over 1 of the total 3 number of data samples required to be annotated to match the maximal robustness achieved by only photos. . . . . . . . . . . . 77 4.7 Out-of-Distribution Accuracy. Left: PACS, Right: Materials. Training with paintings (red) improves robustness to out-of- distribution photos while training with stylized photos (pur- ple/yellow) hurts robustness. Paintings can improve invariance to viewpoints and lighting by encouraging models to focus on objects / materials of interest over background context. Styliza- tion encourages overfitting, an effect which can be exacerbated with more training samples. . . . . . . . . . . . . . . . . . . . . . 82 4.8 Reducing High-Frequency Signals. Top: Original Image, Bot- tom: Low Frequency Image. Columns 1 and 3 are stylized photos; columns 2 and 4 are artist-created paintings. Reducing the mag- nitude of sufficiently high frequency components from images does not alter perceptual quality of images. At a glance, the top and bottom images are perceived to be identical. . . . . . . . . . 85 xvi 5.1 The DeepSemanticHPPC pipeline. (1) Given an initial view and scene geometry, a multi-hypothesis graph of possible paths is generated. (2) The uncertainty in the scene is iteratively reduced by selecting next-best-views and path costs are updated. (3) This iterative uncertainty-reduction stage is terminated early if a safe path is confirmed or all considered paths are confirmed as unsafe. (4) Finally, a path is selected. . . . . . . . . . . . . . . . . . . . . . 91 5.2 An example point cloud. (a) Image view of a portion of the envi- ronment. (b) Point cloud colored with the most likely class pre- dicted from image (a) (bright green: “grass”; dark green: “tree”; purple: “sidewalk”; dark grey: “road”; light grey: no information available). All the classes except “tree” belong to the set S. The re- gion around the tree is actually mulch/woodchips, which should be classified as “dirt” (belonging to U ). (c) Point cloud colored to show safe (white), unsafe (black), and unclear regionsR (random colors). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 An example graph G = (V,A), with poses. Vertices with c(v) = 0 are in green, while vertices with 0 < c(v) ≤ 1 are in red (the darker, the closer to 1). The blue, green, and red axes indicate the pose of the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 (a) Image from Cass Park in Ithaca. There are multiple different terrains in the scene including grass, mud, and water. The point cloud is annotated with safe (blue) and unsafe (red) regions. (b) Image from Mann Library in Cornell with similarly annotated safe/unsafe regions. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Boxplots of the number of paths found at different number of iterations of the RRT over 300 trials are shown. The outliers are shown as black circles. . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Change in uncertainty of path vertices (y-axis) as the number of NBV measurements increase (x-axis). . . . . . . . . . . . . . . . . 108 5.7 500 trials of path safety evaluation. B1 is the planner based on [79], B2 includes semantic reasoning without any next-best-views (NBVs), and XN is DeepSemanticHPPC (ours) with X NBVs. Green represents trials where a safe path is selected; Red repre- sents trials where an unsafe path is selected; and Blue represents trials where all paths are determined to be unsafe (so no paths are selected). More green and blue is better. . . . . . . . . . . . . 109 xvii A.1 SUNCG/CGIntrinsics Annotation Samples. Top to bottom: (Row 1) Crowdsourced blocks (boundaries). (Row 2) Crowdsourced blocks (synthetic labels). (Row 3) Crowdsourced full (boundaries). (Row 4) Crowdsourced full (synthetic labels). (Row 5) Ground truth. NOTE: Synthetic labels are the majority ground truth label for pixels in each segment. This means finely segmented crowd- sourced segments (such as cushions on couches) will be lost in visualization. White dotted boxes highlight examples where block annotation qualitatively outperforms full annotation. . . . . . . . 121 A.2 Cityscapes Annotation Samples. Top to bottom: (Row 1) Crowd- sourced (boundaries). (Row 2) Crowdsourced (randomly col- ored). (Row 3) Crowdsourced (synthetic labels). (Row 4) Expert Cityscapes. NOTE: Synthetic labels are the majority expert label for pixels in each segment. This means finely segmented crowd- sourced segments (such as sky between leaves) will be lost in visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.3 Block-Inpainting Cityscapes Samples. Top to bottom: (Row 1) Original image. (Row 2) Human labels. (Row 3) Inpainted labels (all). (Row 4) Agreement (row 3 vs row 2). (Row 5) Inpainted labels (<20% relative uncertainty). (Row 6) Agreement (row 5 vs row 2). Void labels and rejected inpainted labels are masked out. 123 A.4 Block-Inpainting ADE20K Samples. Top to bottom: (Row 1) Orig- inal image. (Row 2) Human labels. (Row 3) Inpainted labels (all). (Row 4) Agreement (row 3 vs row 2). (Row 5) Inpainted labels (<40% relative uncertainty). (Row 6) Agreement (row 5 vs row 2). Void labels and rejected inpainted labels are masked out. . . . . . 124 B.1 Arbitrary Stylization Biases. Left to Right: Original image, Im- age stylized by itself using AdaIN, ETNet, and TPFR. Ideally, an image stylized by itself should not change. Notice that style transfer introduces artifacts, shifts in color, and other biases. . . . 131 B.2 Power Spectrum of Images. Left: PACS, Right: Materials. The plots depict the mean power spectrum for different sets of images. Photos stylized by SACL have larger magnitude high frequency components than natural photos or natural paintings. . . . . . . 133 B.3 Stylized Photos (PACS) (1/2). Intradomain refers to stylization with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. (Continued on next page) . . . . . . . . . . . . . . . . . . . . . . . . 137 B.3 Stylized Photos (PACS) (2/2). Intradomain refers to stylization with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists.138 xviii B.4 Stylized Photos (Materials) (1/2). Intradomain refers to styliza- tion with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. (Continued on next page) . . . . . . . . . . . . . . . . . . . . 139 B.4 Stylized Photos (Materials) (2/2). Intradomain refers to styliza- tion with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 xix CHAPTER 1 INTRODUCTION Humans perceive the world around them through many senses: touch, taste, sound, smell, and sight. Each of these senses serve to shape our understanding of the world, and for many of us, visual perception plays a uniquely important role in our lives. From a young age, we learn about the world from observing how people interact, how objects move, and how light bounces off different surfaces in a room. With experience, we learn to scan our environment for important visual cues that tell us where objects might move, which surfaces are safe to walk on, and what obstacles we need to be aware of. Through visual media, whether in the form of photographs, videos, artwork, or written text, we are able to communicate new ideas, find entertainment, and understand different cultures. Can a machine meaningfully perceive the visual world? The field of computer vision aims to understand how we can build computer models that can under- stand visual information. A decade ago, deep learning unveiled its potential in computer vision – by “learning” with a series of interconnected artificial neurons, the seminal AlexNet model boasted a 37% reduction in error rate relative to the best performing method at the time (Table 2 of [78]) on a challenging image recognition task. At the time of writing this dissertation, the original AlexNet pa- per [78] has accumulated over 98000 citations (To the reader: how many citations does AlexNet have now?). Deep learning quickly took over as the paradigm for building high-performing computer vision systems. Today, these systems are widely used, playing significant roles in common consumer products such as cell-phones and social media, and in cutting edge developments in self-driving, 1 medical diagnostics, and drug discovery. Training a deep neural network amounts to optimizing the hundreds of thou- sands to billions parameters that define its artificial neurons; this optimization requires a large amount of data to meaningfully fit the model. Historically, deep learning models have been developed on – and benchmarked against – curated datasets, allowing for controlled scientific comparisons between advancements in new deep learning techniques. On the now-famous ImageNet-1K dataset, computer vision models have long surpassed human performance [59] – a super- human feat (literally!). Of course, the story is not that simple. Early work revealed a weakness of deep models to seemingly innocuous patterns of imperceptible noise – noise so weak in magnitude that human observers could not identify the existence of these patterns in manipulated images. Yet, these patterns would cause deep networks to confidently misidentify the content of manipulated images [156], such as classifying an image of a “bus” as an “ostrich”. This brittleness of deep learning to such adversarial examples gives evidence that the superhuman nature of deep networks hinges on carefully controlled laboratory conditions provided by benchmark datasets. Even without the presence adversarial manipulations designed to break deep learning models, computer vision systems can still struggle to perform well across data drawn different sources. Scientists have studied the ability of models to generalize to different visual domains [168, 193] – can a model trained on photographs learn to recognize objects in paintings?; or, can a model that has only seen images of one city recognize attributes of scenes in another city? In many cases, deep computer vision models are unable to transfer its performance between these different settings. 2 These failures arise from distribution shift between the training data and test environment [136, 182]. Many characterizations of distribution shift have been studied, with a popular assumption being covariate shift in which the distribution of image features are different between training and testing. Suppose a classifier is trained to distinguish cats from dogs. However, imagine all of the training images of cats are of orange cats and all of the training images of dogs are of black dogs. The classifier may place a very high weight on the color of the animal to minimize its loss during training, but this will cause it to perform quite poorly in reality where it may be exposed to cats or dogs of different colors. As deep learning becomes ever more integrated in our lives, and as it con- tinues to play a more significant role in multidisciplinary scientific research, it is important to consider the behavior of these models in different real-world environments. Failures of these models can lead to unintended consequences for human safety or societal fairness. Therefore, there is a clear need for robust perception systems – systems that perform well in different real-world envi- ronments, and hold their own against the naturally-ocurring distribution shifts between the data they are trained on and the environments they are applied in. This dissertation aims to address several facets of this problem. One way to mitigate distribution shift in the real world is to simply learn from more data. Unfortunately, the world is a big place; and the amount of images produced today is never-ending. Proprietary industry datasets, like JFT300M [63, 154] and the Instagram hashtag dataset [105], contain labels for millions to billions of images, but these only represent a fraction of images in the world. What is the best way to take advantage of this data? Although methods for learning directly from unlabeled data are promising, the best performing 3 models are those trained with some amount of annotations [37]. How can we efficiently annotate data, which data samples should be annotated, and what kind of annotations should we strive for? Instead of directly modeling the image distribution by labeling a representive dataset, we can also focus on learning the most useful features for recognition. How do we encourage model to focus on useful features in images? A simple but very effective solution is enforcing robust model invariances via data augmentation. Data augmentations are transformations applied to images that preserve its semantic content [148]. For example, humans are able to recognize most objects in a mirror. Applying a left-right reflection to an image is a common form of data augmentation that corresponds to this. By learning from images and their flipped versions, models are encouraged to be invariant to such mirroring transformations when recognizing objects. Which invariances should be learned by our models? As mentioned in the opening, humans are able to understand the visual world quite well, and it could be beneficial for our computer vision systems to learn similar invariances to the human perception system. Note that data augmentation can also be viewed as expanding or smoothing the empirical distribution of the training dataset through semantically-invariant transformations, and so it is closely related to the data annotation approach discussed previously. Finally, even the best efforts in mitigating distribution shift will never be perfect, and models will inevitably make incorrect predictions. When visual perception models are used in a larger system, how can we account for model failures? This is especially critical for systems where safety is of concern, such as in autonomous navigation. 4 This dissertation explores some of the above questions in the following chap- ters. First, Chapter 2 describes a new data annotation method to efficiently su- pervise perception models [96]. A direct way to overcome distribution shift is to as closely mimic the real world distribution of (image, label) pairs as possible in the datasets we use. However, labeling data is very expensive and time-consuming. For tasks like semantic segmentation, which aims to assign labels to every pixel, annotators must spend minutes to hours to label just a single image [28, 192]. Efficient data labeling can be achieved by simplifying the annotation task through task size reduction; focusing on providing partially- complete, yet informative, labels; or better (semi-automated) labeling tools. We propose to divide the traditional per-pixel annotation task for an entire image into per-pixel annotation for image subregions. This task is more palatable for annotators and leads to an increase in label quality for a lower cost. Furthermore, we find that only annotating a small number of image subregions is sufficiently informative for models to outperform conventional and recent weak annotation methods with any fixed budget. This work was published in ICCV 2019 [96]. Next, Chapter 3 explores paintings as a medium for studying perception systems, and their utility as an alternative source of data from natural images for training models [98]. The image distribution of paintings are different from the distribution of natural photographs, but paintings often depict meaningful objects and scenery that parallel those found in the real world. The human visual system is robust and can readily understand the depictions found in paintings. In recent work, we created a new large-scale dataset of depictions of materials in paintings spanning more than 500 years [162]. Here, we present some studies of 5 how common computer vision models perform when applied to the recognition of such depicted materials, like wood or fabric. Although common computer vision models are developed for natural photographs, our results show that they may be used out of the box or quickly finetuned with labeled data for material understanding in paintings. However, adapting models without labels is difficult, and we found that standard domain adaptation algorithms may reduce performance when applied to this data. Furthermore, an interesting attribute of paintings is that they encode quirks of human visual system through their ability to convey realism without necessarily conforming to physical reality [19, 107]. This suggsts that paintings may emphasize features or invariances that are useful for robust human-like perception. For finegrained fabric classification, we find evidence that models trained on paintings focused on cues that were both more interpretable and generalizable. This work was published in the International Workshop on Fine Art Pattern Extraction and Recognition, ICPR 2020 [98], and builds upon work that was eventually published in PLOS One 2021 [162]. Chapter 4 extends our previous findings by systematically exploring the in- variances encoded in paintings for natural image recognition [97]. We study how models behave when trained with real paintings and style transfer. Style trans- fer is a type of data augmentation that promises to create painting-like images from natural photographs. We train models on natural photographs, paintings, and/or stylized images, and evaluate them on test data representing real-world distribution shifts. Perception models must overcome such shifts to successfully understand different environments – for example, the test photographs may contain noise, or be drawn from another dataset which was sampled from differ- ent viewpoints than the training images. Our results show that learning from a combination of natural photographs and paintings leads to models that are 6 far more robust than learning from natural photographs alone. Interestingly, we also find that style transfer does not capture the same invariances as paintings, and that paintings are unique among various artforms in enabling recognition models to learn useful invariances for natural image recognition. This work was published in CVPR 2021 [97]. Finally, Chapter 5 presents a real world case study in using visual perception for autonomous navigation [55]. Even the most accurate perception model will not be perfect, and it is important to account for failures when using these models as a building block in an autonomous sytem. We propose a planning pipeline that reasons about uncertainty in semantic segmentations to find a safe path in unknown environments. Given predictions of terrain and obstacles from a view of a scene, the autonomous agent determines which subsequent views of the environment to capture. By taking multiple views of the environment, the agent increases its confidence in successfully selecting a viable path across safe terrain to a goal location. Our results show that this pipeline allows safe paths to be planned when deployed in real world environments with noisy and inaccurate model predictions. This work was published in ICRA 2020 [55]. This dissertation touches upon several challenges of building robust percep- tion systems: better data annotation pipelines for acquiring high-quality and diverse training data, learning robust features for recognition, and reasoning about perception uncertainties when using these visual recognition models are used in autonomous systems. The directions we explore in this work represent only a small fraction of the number of exciting research directions towards solv- ing each of these challenges. In Chapter 6, we conclude with a brief discussion of fruitful directions for future work and relevant lines of concurrent research. 7 These include data labeling and sample selection, learning from partially-labeled or unlabeled data, building test benchmarks that represent diverse image dis- tributions, learning robust features with data augmentation or different model architectures, and adapting models to failures on the fly. 8 CHAPTER 2 EFFICIENT IMAGE ANNOTATION FOR SEMANTIC SEGMENTATION 2.1 Overview Semantic segmentation models are tasked with predicting labels for every pixel in an image. Image datasets with high-quality pixel-level annotations are valu- able for learning this task, and datasets with fully-annotated images in which every pixel in an image are labeled ensure that rare classes and small objects are modeled during training. However, full-image annotations are expensive, with experts spending minutes to hours per image. As such, better annotation pipelines are needed. Cost-efficient annotation should satisfy one or more of the following properties: (a) it produces labels that contain highly informative signals for model learning (so that fewer annotations are required to train a strong perception model), (b) the task should be easy or fun to perform (so that the annotation task can be outsourced to crowdworkers without demanding high skill prequisites), and/or (c) the task should be fast to perform (so that more labels can be acquired in a shorter time period). We propose a novel annotation method which produces informative, high quality labels while also being easy to perform. This chapter presents a novel block sub-image annotation as a replacement for conventional full-image annotation. In other words, an image is sub-divided into smaller regions, and a worker is tasked with annotating such regions from various images rather than entire images at once. Despite the attention cost of frequent task switching, we find that block annotations can be crowdsourced at higher quality compared to full-image annotation with equal monetary cost 9 using existing annotation tools developed for full-image annotation. Surprisingly, we find that 50% pixels annotated with blocks allows semantic segmentation to achieve equivalent performance to 100% pixels annotated. This is because sub-image blocks are still densely annotated with valuable semantic boundary information, which is crucial for learning high performing semantic segmentation models. In fact, as little as 12% of pixels annotated allows performance as high as 98% of the performance with dense annotation. In weakly-supervised settings, block annotation outperforms existing methods by 3-4% (absolute) given equivalent annotation time. To recover the necessary global structure for applications such as characterizing spatial context and affordance relationships, we propose an effective method to inpaint block-annotated images with high- quality labels without additional human effort. As such, fewer annotations can also be used for these applications compared to full-image annotation. The work in this chapter was published in ICCV 2019 as “Block Annota- tion: Better Image Annotation with Sub-Image Decomposition” [96] with Paul Upchurch and Kavita Bala. 2.2 Introduction Figure 2.1: Block Annotation Overview. (a) Sub-image block annotations are more effective to gather than full-image annotations (b) Training on sparse block annotations enables semantic segmentation performance equivalent to full-image annotations (c) Block labels can be inpainted with high-quality labels. 10 Recent large-scale computer vision datasets place a heavy emphasis on high- quality fully dense annotations (in which over 90% of the pixels are labelled) for hundreds of thousands of images. Dense annotations are valuable for both semantic segmentation and applications beyond segmentation such as char- acterizing spatial context and affordance relationships [17, 58]. The long-tail distribution of classes means it is difficult to gather annotations for rare classes, especially if these classes are difficult to segment. Annotating every pixel in an image ensures that pixels corresponding to rare classes or small objects are labelled. Dense annotations also capture pixels that form the boundary between classes. For applications such as understanding spatial context between classes or affordance relationships, dense annotations are required for principled conclu- sions to be drawn. In the past, polygon annotation tools have enabled partially dense annotations (in which small semantic regions are densely annotated) to be crowdsourced at scale with public crowd workers. These tools paved the way for the cost-effective creation of large-scale partially dense datasets such as [10, 99]. Despite the success of these annotation tools, fully dense datasets have relied extensively on expensive expert annotators [186, 28, 116, 192, 119] and private crowdworkers [17]. We propose annotation of small blocks of pixels as a stand-in replacement for full-image annotation (figure 2.1). We find that these annotations can be effectively gathered by crowdworkers, and that annotation of a sparse number of blocks per image can train a high performance segmentation network. We further show these sparsely annotated images can be extended automatically to full-image annotations. We show block annotation has: 11 • Wide applicability. (Section 2.4) Block annotations can be effectively crowd- sourced at higher quality compared to full annotation. It is easy to implement and works with existing advances in image annotation. • Cost-efficient Design. (Section 2.4) Block annotation reflects a cost-efficient design paradigm (while current research focuses on reducing annotation time). This is reminiscent of gamification and citizen science where enjoyable tasks lead to low-cost high-engagement work. • Complex Region Annotation. (Section 2.4) Block annotation shifts focus from labeling regions of specific semantic categories (e.g., “label all dogs in this image”) to spatial regions (e.g., “label all regions in this part of the image”). When annotating categorical regions, workers segment simple objects before complex objects. With spatial regions, informative complex regions are forced to be annotated. • Weakly-Supervised Performance. (Section 2.5) Block annotation is compet- itive in weakly-supervised settings, outperforming existing methods by 3-4% (absolute) given equivalent annotation time. • Scalable Performance. (Section 2.5) Full-supervision performance is achieved by annotating 50% of blocks per image. Thus, blocks can be annotated until desired performance is achieved, in contrast to methods such as scribbles. • Scalable Structure. (Section 2.6) Block-annotated images can be effectively inpainted with high quality labels without additional human effort. 12 2.3 Related Work and Background In this section we review recent works on pixel-level annotation in three areas: human annotation, and human-machine annotation, and dense segmentation with weak supervision. Human Annotation. Manual labeling of every pixel is impractical for large- scale datasets. A successful method is to have crowdsource workers segment polygonal regions to click on boundaries. Employing crowdsource workers offers its own set of challenges with quality control and task design [173, 10, 160]. Although large-scale public crowdsourcing can be successful [99] recent bench- mark datasets have resorted to in-house expert annotators [28, 118]. Annotation time can be reduced through improvements such as autopan, zoom [10] and shared polygon boundaries [186]. Polygon segmentation can be augmented by painted labels on superpixel groups [17] and Bezier curves [186]. Pixel-level labels for images can also be obtained by (1) constructing a 3D scene from an image collection, (2) grouping and labeling 3D shapes and (3) propagating shape labels to image pixels [111]. In our work, we investigate sub-image polygon annotation, which can be further combined with other methods (sec. 2.4.) Human-Machine Annotation. Complex boundaries are time-consuming to trace manually. In these cases the cost of pixel-level annotation can be reduced by automating a portion of the task. Matting and object selection [139, 88, 89, 8, 180, 179, 16, 84, 181] generate tight boundaries from loosely annotated boundaries or few inside/outside clicks and scribbles. [124, 110] introduced a predictive method which automatically infers a foreground mask from 4 boundary clicks, and was extended to full-image segmentation in [3]. The number of boundary 13 clicks was further reduced to as few as one by [1]. Predictive methods require an additional human verification step since the machine mutates the original human annotation. The additional step can be avoided with an online method. However, online methods (e.g., [1, 84, 3]) have higher requirements since the algorithm must be translated into the web browser setting and the worker’s machine must be powerful enough to run the algorithm1. Alternatively, automatic proposals can be generated for humans to manipulate: [6] generates segments, [5] generates a set of matting layers, [188] generates superpixel labels, and [134] generates boundary fragments. In our work, we show that human-annotated blocks can be extended automatically into dense annotations (sec. 2.6), and we discuss how other human-machine methods can be used with blocks (sec. 2.4.6). Weakly-Supervised Dense Segmentation. There are alternatives to training with high-quality densely annotated images which substitute quantity for label quality and/or richness. Previous works have used low-quality pixel-level an- notations [195], bounding boxes [125, 75, 137], point-clicks [9], scribbles [9, 95], image-level class labels [125, 147, 4], image-level text descriptions [64] and unla- beled related web videos [64] to train semantic segmentation networks. Combin- ing weak annotations with small amounts of high-quality dense annotation is another strategy for reducing cost [11, 66]. [145] proposes a two-stage approach where image-level class labels are automatically converted into pixel-level masks which are used to train a semantic segmentation network. We find a small num- ber of sub-image block annotations is a competitive form of weak supervision (sec. 2.5.3). 1Offloading online methods onto a cloud service offers a different landscape of higher costs (upfront development and ongoing operation costs). 14 2.4 Block Annotation Sub-image block annotation is composed of three stages: (1) Given an image I , select a small spatial region I ′; (2) Annotate I ′ with pixel-level labels; (3) Repeat (with different I ′) until I is sufficiently annotated. In this chapter, we explore the case where I ′ is rectangular, and focus on the use of existing pixel-level annotation tools. Can block annotations be gathered as effectively as full-image annotations with existing tools? In section 2.4.1, we show our annotation interface. In section 2.4.2, we explore the quality of block annotation versus full-image annotation. In section 2.4.3, we examine block annotation for a real-world dataset. In section 2.4.4, we discuss the cost of block annotation and show worker feedback. In section 2.4.5, we discuss how blocks for annotation can be selected in practice. Finally, in section 2.4.6 we discuss the compatibility of block annotation with existing annotation methods. 2.4.1 Annotation Interface Our block annotation interface is given in figure 2.2 and implemented with existing tools [10]. For full image annotation, the highlighted block covers the entire image. Studies are deployed on Amazon Mechanical Turk. 15 (a) Highlighted block. (b) Finished block annotation. Figure 2.2: Block Annotation UI. Annotators are given one highlighted block to annotate with the remainder of the image as context. 2.4.2 Quality of Block Annotation We explore the quality of block annotations compared to full-image annotations on a synthetic dataset. How does the quality and cost compare between block and full annotations? We find that the average quality for block-annotated images is higher while the total monetary cost is about the same. The average quality of block annotations is consistently higher, including in small regions (figure 2.3 shows an example). The overall block annotation error is 12% lower than the error from full annotation; for regions smaller than 0.5% of the image, the block annotation error is 6% lower. Figure 2.4 presents the annotation error for regions where the ground truth segments are within some range of sizes proportional to the full image size (480×640 px). Block annotations consistently have a lower error rate in these ranges, indicating that block annotation is advantageous regardless of the scale of segments. For completeness, the rightmost bars show the error rate for segments of all sizes. In figure 2.5, the cost and quality of block versus full image annotation is shown. Remarkably, we find that workers are willing to work on block annotation tasks for a significantly lower hourly wage. This indicates that block annotation is 16 (a) (b) (c) Figure 2.3: SUNCG/CGIntrinsics annotation. (a) Ground truth. (b) Block anno- tation (zoomed-in) (c) Full annotation (zoomed-in). White dotted box highlights an example where block annotation qualitatively outperforms full annotation. More in appendix. Figure 2.4: Annotation error rate for block and full annotation for differently sized ground truth segments. Lower is better. more intrinsically palatable for crowdworkers, in line with [67] which shows task design can influence quality of work. Moreover, workers are more likely to over-segment objects with respect to ground truth (e.g. individual cushions on a couch, handles on cabinets) with block annotation tasks. Note that block boundaries may also divide semantic regions. Table 2.1 contains additional statistics. Despite similar costs to annotate an image in blocks or in full, we show 17 in section 2.5 that competitive performance is achieved with less than half of the blocks annotated per image. (a) (b) Figure 2.5: Annotation error rate for block and full annotation. Each point represents one image. The same set of images are both block annotated and full-image annotated. The stars represent the centroid (median). Cost/time include estimated cost/time to assign labels for each segment [10]. Lower-left is better. With block annotation, workers (a) choose to work for lower wages and (b) segment more regions for less pay per region. The overall quality is higher for block annotation. . Block Full Error 0.253 0.286 Error (small regions) 0.636 0.677 $ / hr $1.40 / hr $3.12 / hr Total cost $2.00 $2.05 Total cost (median) $1.99 $2.23 # segments 95.68 38.95 $ / segment $0.0215 $0.0595 Table 2.1: Block vs Full Annotation. Average cost and error statistics per im- age. Error is measured via IoU against ground truth segments in the synthetic SUNCG/CGIntrinsics dataset. Study Details. For these experiments, we chose to use a synthetic dataset. While human annotations may contain mistakes, synthetic datasets are gen- erated with known ground truth labels with which annotation error can be 18 computed. The CGIntrinsics dataset [94] contains physically-based renderings of indoor scenes from the SUNCG dataset [151, 190]. We use the more realistic CGIntrinsics renderings and the known semantic labels from SUNCG. The labels are categorized according to the NYU40[52] semantic categories. Due to the nature of indoor scenes, the depth and field of view of each image is smaller than outdoor datasets. The reduced complexity means that crowdworkers are able to produce good full-image annotations for this dataset. We select MTurk workers who are skilled at both full-image annotation and block annotation in a pilot study (a standard quality control practice [10]). The final pool consists of 10 workers. Image difficulty is estimated by counting the maximum number of ground truth segments in a fixed-size sliding window. Windows, mirrors, and void regions are masked out in the images so that workers do not expend effort on visible content for which ground truth labels do not exist (such as objects seen through a window or mirror). We manually cull images that include transparent glass tables which are not visible in the renderings, or doorways through which visible content can be seen but no ground truth labels exist. After filtering, twenty of the one hundred most difficult images are selected. We choose a block size so that an average of 3.5 segments are in each block. This results in 16 blocks per image. For each task, a highlighted rectangle outlines the block to be annotated. We find that workers will annotate up to the inner edge of the highlighted boundary. Therefore, we ensure the edges of the rectangle do not overlap with the region to be annotated. Workers are paid $0.062 per block annotation task and $0.96 per full-image task. Bonuses up to 1.5 times the base pay are awarded to attempt to raise the effective hourly wage for difficult tasks to $4 / hr. Our results show that 2$ refers to USD in throughout this chapter. 19 workers are willing to work on block annotation tasks beyond the time threshold for bonuses, effectively producing work for an hourly wage significantly lower than the intended $4 / hr. On the other hand, workers do not often exhibit this behavior with full annotation tasks. Different workers may work on different blocks belonging to the same image. We use two forms of quality control: (1) annotations must contain a number of segments greater than 25% of the known number of ground-truth segments for that task and (2) annotations cannot be submitted until at least 10 seconds / 3 minutes (block / full) have passed. All submissions satsifying these conditions are accepted during the user study. For an overview of QA methods, please refer to [135]. Labels are assigned by majority ground-truth voting, with cost estimated from [10]. To evaluate the quality of annotations in an image with K classes, we measure the class-balanced error rate (class-balanced Jaccard distance): K 1 ∑ (FPc + FNc) error rate = (2.1) K (TPc + FP + FN )c=1 c c = 1−mIOU 2.4.3 Viability of Real-World Block Annotation How does block annotation fare with a real-world non-synthetic dataset? To study the viability of block annotating real-world datasets with scalable crowd- sourcing, we ask crowdworkers to annotate blocks from images in Cityscapes [28]. We choose Cityscapes for the annotation complexity of its scenes – 1.5 hours of expert annotation effort is required per image. In contrast, other datasets such as [116, 17] require less than 20 minutes of annotation effort per image. 20 We expect crowd work to be worse than expert work, so it is a surprisingly positive result that the quality of the crowdsourced segments are visually com- parable to the expert Cityscapes segments (figure 2.6). Some crowdsourced segments are very high quality. segments (20% have fewer segments, and 33% have the same # of segments). A summary of the cost is given in table 2.2 which compares public crowdworkers to trained experts. It is feasible block annotation time will decrease with expert training. Given 100 uniformly sized blocks per image, we ask an expert to create equal-quality block and full annotations; we find one block is 1.56% of the effort of a full image. Figure 2.6: Crowdsourced vs expert segments. Crowdsourced block-annotated segments are compared to expert Cityscapes segments. Crowdsourced segments are colored for easier comparison. Top-left is a high-quality example. See ap- pendix for more. Study Details. We searched for workers who produce high-quality work in a pilot study and found a set of 7 workers. These workers were found within a hundred pilot HITs (for a total cost of $4). We approved all of their submissions during the user study. We do not restrict workers from annotating outside of the 21 Block (Crowd) Full (Expert [28, 119]) $ / Task $0.13 - Time / Task 2 min 1.5 hr Table 2.2: Real-world cost of annotation. Cost evaluated on Cityscapes. Each block is annotated by MTurk workers. Full-image is annotated by experts in [28]. Note: [28] annotates instance segments. See table 2.1 for crowd-to-crowd comparison. block, and we do not force workers to densely annotate the block. We do not include the use of sentinels or tutorials as in [10]. Thirteen randomly selected validation images from Cityscapes are annotated by crowdworkers. Each image is divided into 100 uniformly shaped blocks. A total of 650 (50 per image) are annotated in random order. Workers are paid $0.06 per task. Workers are automatically awarded bonuses so that the effective hourly wage at least $5 for each block, with bonuses capped at $0.24 to prevent abuse. For one block, the total base payout is $0.06 with an average of $0.0636 in bonuses over 93 seconds of active work. On average, each annotated block contains 3.5 segments. Assigning class labels will cost an additional $0.01 and 26 seconds [10]. To be consistent with Cityscapes, we instruct workers to not segment windows, powerlines, or small regions of sky between leaves. However, workers will occasionally choose to do so and submit higher quality segments than required. 2.4.4 Annotation Cost and Worker Feedback Our costs (tables 2.1, 2.2) are aligned with existing large-scale studies. Large-scale datasets [10, 99] show that cultivating good workers produces high quality data at low cost. Table 2 of [56] reports a median wage of $1.77/hr to $2.11/hr; the 22 median MTurk wage in India is $1.43/hr [57]. For “image transcription”, the median wage is $1.13/hr over 150K tasks. Workers gave overwhelmingly positive feedback for block annotation (table 2.3), and we found that some workers would reserve hundreds of block annotation tasks at once. Only 3 out of the 57 workers who successfully completed at least one pilot or user study task requested higher pay; all other given feedback was positive or neutral. In contrast, our pilot studies showed that workers are unwilling to accept full-image annotation tasks if the payment is reduced to match the wage of block annotation. We conjecture that task enjoyment leads to long term high-quality output (c.f. [67]). “Nice” “Good” “Fun”“Happy” “Easy” “Okay” Release Increase “Great” More HITs Pay # 8 5 4 2 2 3 Table 2.3: Block annotation worker feedback. Free-form responses are aggre- gated over SUNCG and Cityscapes experiments, and collected at most once per worker. All 24 sentiments across all 19 worker responses are summarized. 2.4.5 Block Selection Our experiments show that workers are comfortable annotating between 3 to 6 segments per block. Therefore, block size can be selected by picking a size such that the average number of segments per block falls in this range. For a novel dataset, this can be done fully labelling several samples and producing an estimate from the fully labelled samples. Without priors on spatial distri- bution of rare classes or difficult samples within an image, a checkerboard or pseudo-checkerboard pattern of blocks focuses attention (across different tasks) 23 uniformly across the image. Far apart pixels within an image are less correlated than neighboring pixels. Therefore, it is good to sample blocks that are spread out to encourage pixel diversity within images. 2.4.6 Compatibility with Existing Annotation Methods Block annotation is compatible with many annotation tools and innovations besides polygon boundary annotation. Point-clicks and Scribbles. Annotations such as point clicks or scribbles are faster to acquire than polygons, which leads to a larger and more varied dataset at the same cost. Combining this with blocks will further increase annotation variety due to the diversity that come from annotating a few blocks in many images over annotating fewer number of images fully. Additionally, [9, 11] show that the most cost-effective method for semantic segmentation is a combination of densely annotated images and a large number of point clicks. The densely annotated images can be replaced by polygon block annotations since they also contain class boundary supervision for the segmentation network. Superpixels. Superpixel annotations enable workers to mark a group of visually-related pixels at once [17]. This can reduce the annotation time for background regions and objects with complex boundaries. Superpixel annota- tion can be easily deployed to our block annotation setting. Polygon Boundary Sharing. Boundary sharing reuses existing boundaries so that workers do not need to trace each boundary twice [186]. This approach can be easily deployed in our block annotation setting. 24 Curves. Bezier tools allow workers to quickly annotate curves [186]. It can be easily deployed in our block annotation setting but it may be less effective on long curves since each part of the curve must be fit separately. Interactive Segmentation. Recent advances in interactive segmentation (e.g., [1, 110, 3]) utilize neural networks to convert sparse human inputs into high quality segments. For novel domains without large-scale training data, block- annotated images can act as cost-efficient seed data to train these models. Once trained, these methods can be applied directly to each block, although further analysis should be conducted to explore the efficiency of such an approach due to block boundaries splitting semantic regions. 2.5 Segmentation Performance How well do block annotations serve as training data for semantic segmentation? In section 2.5.1, the experimental setup is summarized. In section 2.5.2, we evaluate the effectiveness of block annotations for semantic segmentation. In section 2.5.3, we compare block annotation with existing weakly supervised segmentation methods. 2.5.1 Experimental Setup Pixel Budget. We vary the “pixel budget” in our experiments to explore seg- mentation performance across a range available annotated pixels. “Pixel budget” refers to the % of pixels annotated across the training dataset, which can be 25 controlled by varying the number of annotated images, the number annotated blocks per image, and the size of blocks per image. Our block sizes are fixed in our experiments. Block Size. We divide images into a 10-by-10 grid for our experiments. Block Selection. We experiment with two block selection strategies: (a) Checkerboard annotation and (b) Pseudo-checkerboard annotation. Checker- board annotation means that every other block in a variable number of images are annotated. Pseudo-checkerboard annotation means that every N blocks are anno- # pixels in dataset tated in every image, where N is . For example, with a pixel budget pixel budget equivalent to 25% of the dataset, every fourth block is annotated for the entire dataset. At pixel budget 50%, checkerboard and pseudo-checkerboard are identical. For the remainder of the chapter, “Block-X%” refers to pseudo-checkerboard annotation in which X% of the blocks per image are annotated. Sementation Model. We use DeepLabv3+[24] initialized with the official pre- trained checkpoint (pretrained on ImageNet [33] + MSCOCO [99] + Pascal VOC [39]). The network is trained for a fixed number of epochs. See appendix for additional details. Datasets. Cityscapes is a dataset with ground truth annotations for 19 classes with 2975 training images and 500 validation images. ADE20K contains ground truth annotations for 150 classes with 20210 training images and 2000 validation images. These datasets are chosen for their high quality dense ground truth annotations and for their differences in number of images / classes and types of 26 scenes represented. The block annotations are synthetically generated from the existing annotations. 2.5.2 Evaluation Blocks vs Full Image. How does block annotation compare to full-image an- notation for semantic segmentation? We plot the mIOU achieved when trained on a set of annotations against pixel budget in figure 2.7. For both Cityscapes and ADE20K, block annotation significantly outperforms full-image annotation. The performance gap widens as the pixel budget is de- creased – at pixel budget 12%, the reduction in error from full annotation to block annotation is 13% on Cityscapes (10% for ADE20K). For any pixel bud- get, pseudo-checkerboard block annotation annotates fewer pixels per image which means more images are annotated. Therefore, our results indicate that the quantity of annotated images is more valuable than the quantity of annota- tions per image. The pseudo-checkerboard block selection pattern consistently outperforms the checkerboard block selection pattern and full annotation. Number of Blocks vs Optimal Performance. How many blocks need to be annotated for segmentation performance to approach the performance achieved by training on full-image annotations for the entire dataset? In table 2.4, we show results when the network is trained on the full dataset compared to pseudo- checkerboard blocks. Remarkably, we find that checkerboard blocks with 50% pixel budget allow the network to achieve similar performance to the full dataset with 100% pixel budget, 27 Figure 2.7: Semantic segmentation performance. Training images are annotated with different pixel budgets. Pseudo-checkerboard block annotation outperforms checkerboard and full annotation. Optimal (Full) Block-50% Block-12% Cityscapes 77.7% 77.7% 74.6% ADE20K 37.4% 37.2% 36.1% Table 2.4: Semantic segmentation performance when trained on all images. Training with block annotations uses fewer annotated pixels than full annotation but achieves equivalent performance. indicating that at least 50% of the pixels in Cityscapes and ADE20K are redundant for learning semantic segmentation. Furthermore, with only 12% of the pixels in the dataset annotated, relative error in segmentation performance is within 12%/2% of the optimal for Cityscapes/ADE20K. These results suggest that fewer 28 than 50% of the blocks in an image need to be annotated for training semantic segmentation, reducing the cost of annotation reported in section 2.4. Block Locations vs Performance. How does the location of blocks sampled within an image affect semantic segmentation performance? In this experiment, we train a model with either (a) blocks sampled in a checkerboard pattern or (b) uniformly randomly sampled from the same grid. In both cases, 50% of the blocks are labelled in each image. The network trained on randomly sampled blocks achieves 77.1% mIOU while the network trained on checkerboard blocks achieves 77.7% mIOU. The increase in performance in checkerboard annotations over randomly sampled blocks is due to pixel diversity (pixels far apart in an image are less correlated than neighboring pixels). This is similar to the effect observed in the first ex- periment which shows that pixel diversity due to image diversity increases performance. These observations are aligned with our expectations in section 2.4.5. In this experiment, we used all images from Cityscapes and created: (a) random block annotations (sample 50% of blocks for each image) and (b) checker- board block annotations. To ensure that results are not due to sampling bias, we create split (a) three times (i.e. sample 50% of blocks per image three times) and average the results over the three splits. 2.5.3 Weakly Supervised Segmentation Comparison Block annotation can be considered a form of weakly supervised annotation where a small number of pixels in an image are labelled. Representative works in this area include [95, 9, 126, 125, 30]. Table 3 of [95] is replicated here (table 2.5) for 29 reference, and extended with our results. All existing results show performance with a VGG-16 based model. We train a MobileNet based model which has been shown to achieve similar performance to VGG-16 (71.8% vs 71.5% Top-1 accuracy on ImageNet) while requiring fewer computational resources [65, 141]. Our fully-supervised implementation pretrained on ImageNet achieves 69.6% mIOU on Pascal VOC 2012 [39]; in comparison, the reference DeepLab-VGG16 model achieves 68.7% mIOU [23] and the re-implementation in [95] achieves 68.5% mIOU. Method Annotations mIOU (%) MIL-FCN [126] Image-level 25.1 WSSL [125] Image-level 38.2 point sup. [9] Point 46.1 ScribbleSup [95] Point 51.6 WSSL [125] Box 60.6 BoxSup [30] Box 62.0 ScribbleSup [95] Scribble 63.1 Ours: Block-1% Pixel-level Block 61.2 Ours: Block-5% Pixel-level Block 67.6 Ours: Block-12% Pixel-level Block 68.4 Full Supervision Pixel-level Image 69.6 Table 2.5: Weakly-supervised segmentation performance. Evaluated on Pascal VOC 2012 validation set. Original table from [95]. Blocks (N%) indicates N% of image pixels (N pseudo-checkerboard blocks) are labelled. Cityscapes Ours: Block Coarse Full Supervision(7 min) (7 min [28]) (90 min [28]) mIOU (%) 72.1 68.8 77.7 Pascal Ours: Block Scribbles Full Supervision(25 sec) (25 sec [95]) (4 min [116]) mIOU (%) 67.2 63.1 [95] 69.6 Table 2.6: Weakly-supervised segmentation performance given equal annota- tion time. For time comparison of scribbles against other methods, please refer to [95]. 30 Performance Comparison. With only 1% of the pixels annotated, block anno- tation achieves comparable performance to existing weak supervision methods. Based on our results in section 2.4.2, the cost of annotation for 1% of pixels with blocks will be 100× less than the cost of full-image annotation. Increasing the budget to 5%-12% significantly increases performance. With 12% of pixels annotated with blocks, the segmentation performance (error) is within 98% (4%) of segmentation performance (error) with 100% of pixels annotated. Note that block annotations can be directly transformed into gold-standard fully dense annotations by simply gathering more block annotations within an image. This is not feasible with other annotations such as point clicks, scribbles, and bounding boxes. Furthermore, in section 2.6, we demonstrate a method to transform block annotations into dense annotations without any additional human effort. Equal Annotation Time Comparison. Given equal annotation time, block an- notation significantly outperforms coarse and scribble annotations by ∼3-4% mIOU (table 2.6). On Pascal, 97% of full-supervision mIOU is achieved with 1/10 annotation time. We convert annotation time to number of annotated blocks as follows. Block annotation may use up to 2.2× the time of full-image annotation. Given an image divided into 100 blocks, an annotation time of T leads to T 0.022F blocks annotated, where F is the full-image annotation time. 31 2.6 Block-Inpainting Annotations Although block annotations are useful for learning semantic segmentation, the full structure of images is required for many applications. Understanding the spatial context or affordance relationships [17, 58] between classes relies on un- derstanding the role of each pixel in an image. Shape-based retrieval, object counting [87], or co-occurrence relationships [115] also depend on a global un- derstanding of the image. The naive approach to recover pixel-level labels is to use automatic segmentation to predict labels. However, this does not leverage existing annotations to improve the quality of predicted labels. In section 2.6.1, we propose a method to inpaint block-annotated images by using annotated blocks as context. In section 2.6.2, we examine the quality of these inpainted annotations. 2.6.1 Block-Inpainting Model The goal of the block-inpainting model is to inpaint labels for unannotated blocks given the labels for annotated blocks in an image. For full implementation details, please refer to the appendix. Architecture. The block-inpainting model is based on DeepLabv3+. The input layer is modified so that the RGB image, I ∈ Rh×w×3, is concatenated with multichannel “hint” (ala [189]) of 1-0 class labels W ∈ Rh×w×K where K is the number of classes. At inference time, the hint contains known labels for the annotated blocks of an image which serve as context for the inpainting task. Hidden layers are augmented with dropout which will be used to control quality 32 by estimating epistemic uncertainty [44, 45]. Estimating Uncertainty. Inpainting fills all missing regions without consid- ering the trade off between quantity and quality. Existing datasets have high- quality annotations for 92-94% of pixels [192, 17]. Therefore, we modify our network to produce uncertainty estimates which allow us to explicitly control this trade off. The uncertainty of predictions is correlated with incorrect pre- dictions [86, 81]. Uncertainty is computed by activating dropout at inference time. The predictions are averaged over the g trials giving us U ∈ Rh×w, a matrix of uncertainty estimates per image. We take the sample standard deviation corresponding to the predicted class for each pixel to be the uncertainty. For each pixel (i, j), the mean softmax vector over g trials is: ∑g p(i,j)(y|I,W ) µ(i,j) = t=1 (2.2) g where p(y|I,W ) ∈ RK is the softmax output of the network. The corresponding uncertainty vector is: √√√√ √∑g (p(i,j)(y|I,W )− µ(i,j))2 U′(i,j) = t=1 (2.3) g − 1 Thus, the uncertainty for each pixel (i, j) is: U (i,j) = U ′(i,j), where (i,j)m m = arg maxµk (2.4) k Training. Block annotations serve both as hints and targets. This means that no additional data (or human annotation effort) is required to train the block- inpainting model. For our experiments, we use (synthetically generated) Block- 50% annotations. For each image, half of annotated blocks are randomly selected 33 online at training time to be hints. All of the annotated blocks are used as targets. This encourages the network to “copy-paste” hints in the final output while leveraging the hints as context to inpaint labels for regions where hints are not provided. 2.6.2 Evaluation Quality of Inpainted Labels. How good are inpainted labels? We compare labels produced by the block-inpainting network with low U (i,j) against the known human labels in Cityscapes and ADE20K. The block-inpainting model produces labels whose human-agreement is competitive with that achieved by human annotators. We inpaint Block-50% annotations in this experiment. At a relative uncertainty threshold of 0.2 on Cityscapes (0.4 on ADE20K), over 94% of the pixels are labelled – recall that existing datasets have over 92-94% [192, 17] pixels annotated. For Cityscapes, the mean pixel agreement is 99.8% and the class-balanced error rate is 3.1%, while for ADE20K, the mean pixel agreement is 98.7% and the class-balanced error rate is 28%. Recall that ADE20K features significantly more semantic class categories than Cityscapes. These machine- human label agreements are extremely good. Previous work show that human- human label agreement across annotators is 66.8% to 73.6% while annotator self-agreement is 82.4% to 97.0% [192, 17]. Human annotators fail to agree in non- trivial fashion – [192] shows that annotator self-agreement fails in three ways: variations in complex boundaries (32%), incorrect naming of ambiguous classes (34%), and failure to segment small objects (34%). In figure 2.8, a visualization of labels generated by the block-inpainting model is shown. The number of pixel disagreements decreases with a higher uncertainty threshold. 34 (a) Full human labels (b) Original image (c) Inpainted labels (all) (d) Label agreement (white) (e) Inpainted labels (<20% relative uncer- (f) Label agreement (white) tainty) Figure 2.8: Block-inpainted labels. Example of human labels vs human Block- 50% + inpainted labels. Void labels are masked out. Block Inpainting vs Automatic Segmentation. Consider a scenario in which a small number of pixels in a dataset are annotated, and the remainder are auto- matically labelled to produce dense annotations. Why should block inpainting be used instead of automatic segmentation? Full pixel-level labels produced by block inpainting are superior to automatic segmentation. On Cityscapes, automatic segmentation achieves 78% validation mIOU while block inpainting Block-50% annotations achieves 92% validation mIOU. With Block-12% annotations, au- tomatic segmentation achieves 75% validation mIOU while block inpainting achieves 82% validation mIOU. 35 Figure 2.9: Block-Inpainting Model uncertainty versus human pixel-wise agree- ment for inpainted labels. Curves for different pixel budgets shown for compari- son. 2.6.3 Ablations Effect of Uncertainty Threshold vs Pixel Agreement. How does the uncer- tainty threshold affect pixel agreement for block-inpainting? In figure 2.9, we show the mean pixel agreement with human labels for varying thresholds. Lower uncertainty threshold for rejection results in higher pixel agreement. The pixel agreement with lower pixel budgets are shown for comparison. The pixel budget is the number of block-annotated pixels in the dataset with which the block- inpainting model is trained. All experiments use checkerboard block hints. Effect of Uncertainty Threshold vs Pixel Coverage. How does the uncertainty threshold affect pixel coverage for block-inpainting? In figure 2.10, we show the mean pixel coverage for varying uncertainty thresholds as a fraction of maximum uncertainty. Lower uncertainty threshold for rejection results in lower pixel coverage. The pixel coverage with lower pixel budgets are shown for comparison. The pixel budget is the number of human block-annotated pixels in the dataset with which the block-inpainting model is trained. All experiments use checkerboard block hints. 36 Figure 2.10: Block-Inpainting Model uncertainty versus pixel coverage for hu- man checkerboard + automatic labels. x-axis truncated at 0.05 on left. Curves for different pixel budgets shown for comparison. Block Selection vs Block-Inpainting Quality. How does the checkerboard pattern compare to other block selection strategies as hints to the block-inpainting model? Intuitively, it is easier to infer labels for pixels that are close to pixels with known labels than for pixels that are further away. Consider a scenario in which every other pixel in an image is annotated. Reasonably good labels for the unannotated pixels can be inferred with a simple nearest-neighbors algorithm. In practice, it is impossible to precisely annotate single pixels in an image. However, we can approximate the same properties of labelling every other pixel by labelling every other block instead (i.e., a checkerboard pattern). In table 2.7, we show the block-inpainting model mIOU when different types of hints are given. The rightmost column (“every other pixel”) is not feasible to collect in practice. Checkerboard annotations outperform random block annotations even though the network is trained to expect random block hints. Providing only boundary annotations within each block (i.e. annotating pixels within 10 pixels of each boundary in each block) allows the network to achieve nearly the same performance as full block hints. This suggests that the most informative pixels for the block-inpainting model are those near a 37 None Random Random Checker Every Other(Semantic Boundaries) (Full) (Full) Pixel Rel. mIOU 0.77 0.90 0.92 0.95 1.0 Table 2.7: Block-inpainting with different types of hints. “Every other pixel” annotations represents an ideal sampling of pixels to annotate as hints – this sampling strategy is infeasible in practice. Relative performance of inpainting by utilizing hints with respect to “every other pixel”-hints is shown. Checkerboard sampling with fully annotated blocks outperform no hints, random blocks (with only semantic boundaries annotated within blocks), and random blocks (with fully annotated blocks). boundary. 2.7 Discussion and Future Work An incredibly large number of images exist in the world – from social media to car dashcam reels to security camera footage to other publicy shared photogra- phy collections. It is important to efficiently annotate such unlabeled images with useful labels to train a strong computer vision model. In this chapter, we have introduced a new annotation method for efficiently assigning semantic segmenta- tion labels to the vast number of unlabeled images at our disposal. We proposed block annotation as a crowdworker-friendly replacement for traditional full- image annotation. For training semantic segmentation models, Block-12% offers strong performance at 1/8th of the monetary cost. Block-5% offers competitive weakly-supervised performance at equal annotation time to existing methods. For optimal semantic segmentation performance, or to recover global structure with inpainting, Block-50% can be utilized; as we found in our experiments, not every pixel in each image needs to be annotated for maximal performance to be achieved. 38 There are many directions for future work. Our crowdworker subimage an- notation tasks are similar to full-image annotation tasks, so it may be possible to improve the gains with more exploration and development of boundary marking algorithms similar to those designed for traditional full-image annotation. We have explored some block patterns and further exploration may reveal even bet- ter trade-offs between annotation quality, cost and image variety. Beyond block subimages, it can useful to focus on alternative shapes based on image content; however, non-rectangular annotation areas may pose additional challenges for workers who are familiar with conventional image annotation tasks. Another interesting direction is acquiring instance-level annotations by merging segments across block boundaries. Finally, active learning can be used to select blocks of rare classes, and workers can be assigned blocks so that annotation difficulty matches worker skill. Notes and Acknowledgements. I appreciate the efforts of the anonymous MTurk workers who participated in our user studies. 39 CHAPTER 3 ANALYZING AND LEARNING FROM DEPICTIONS OF MATERIALS IN PAINTINGS 3.1 Overview Computer vision models are often both trained on and applied to natural images. In this chapter, we examine the relationship between such visual recognition systems and the rich information available in paintings. Paintings are an inter- esting form of visual information – although they may depict objects and other meaningful content, paintings are not necessarily photorealistic and thus repre- sent a different image distribution than the distribution of natural photographs. Despite this, humans can still recognize the objects and materials found in paint- ings. Furthermore, paintings can encode invariances of the human visual system due to their ability to convey realism without necessarily conforming to physical reality [19, 107], and it may be beneficial to train our models on paintings to learn similar human-like invariances. Existing work has focused primarily on understanding objects in paintings; here, we explore depictions of materials found in paintings. Material perception is interesting as it relies on a complex combination of different cues: shape, glossi- ness, texture, color, and more [43]. Our experiments are based on a large-scale an- notated database of material depictions in paintings (Materials in Paintings [162]; https://materialsinpaintings.tudelft.nl), which was the result of a long-term collaboration with colleagues at TU Delft. First, we find that visual recognition systems designed for natural images can work surprisingly well on paintings. In particular, we find that interactive segmentation tools can be 40 used to cleanly annotate polygonal segments within paintings, a task which is time consuming to undertake by hand. We also find that FasterRCNN, a model which has been designed for object recognition in natural scenes, can be quickly repurposed for detection of materials in paintings. Second, we show that learn- ing from paintings can be beneficial for neural networks that are intended to be used on natural images. We find that training on paintings instead of natural images can improve the quality of learned features and we further find that a large number of paintings can be a valuable source of test data for evaluating domain adaptation algorithms. The work in this chapter was published in the International Workshop on Fine Art Pattern Extraction and Recognition, ICPR 2020 as “Insights From A Large-Scale Database of Material Depictions In Paintings” [98], and the Materials In Paintings dataset was published in PLOS One 2021 as “Materials In Paintings (MIP): An interdisciplinary dataset for perception, art history, and computer vision” [162], with Mitchell van Zuijlen, Maarten Wijntjes, Sylvia Pont, and Kavita Bala. 3.2 Introduction Deep learning has enabled the development of high performing recognition systems across a variety of image-based tasks [47, 54, 183]. These systems are often trained on natural photographs with applications in real world recognition like self-driving. Furthermore, applying recognition systems to large collections of images can also reveal cultural trends or give us insight into the visual patterns in the world (e.g. [113, 106, 100]). Human-created images, such as paintings, are 41 particularly interesting to analyze from this perspective. Artistic depictions can reveal insights into culturally relevant ideas throughout time, as well as insights into human visual perception through the realism depicted by skilled artists. Whereas most computer vision systems focusing on digital art history are concerned with object recognition (e.g., [29]), it is the depiction of space and materials that visually characterized the course of art history. The depiction of space has had considerable attention in scientific literature [122, 174, 73, 129] while recently the depiction of materials has gained scientific interest [13, 175, 130, 176]. Therefore, it is interesting to investigate the interplay between deep learning systems designed for natural image analysis and the rich visual information found in paintings, especially with respect to artistic depictions of materials. The remainder of this chapter is organized into three parts. In Section 3.3, we briefly describe the dataset that subsequent experiments are based on. In Section 3.4, we explore how deep learning systems that have primarily been developed for use on natural photographs can be used to analyze paintings. Specifically, we explore (a) segmentation and (b) detection of materials in paintings. Recognition of materials in paintings can be useful for digital art history as well as general public interest. In Section 3.5, we explore how paintings can be a useful source of data from which better recognition systems can be built. Specifically, we investigate (c) the generalizability and interpretability of classifiers trained on paintings, and we investigate (d) the role that a large-scale painting dataset can play in evaluating visual recognition models. 42 3.3 The Materials in Paintings (MIP) Dataset All experiments in this chapter utilize data from the Materials in Paintings dataset (https://materialsinpaintings.tudelft.nl), a large-scale annotated dataset of physical material depictions in paintings. This dataset was the result of a longterm collaboration with our collaborators at TU Delft: Mitchell van Zuijlen, Sylvia Pont, and Maarten Wijntjes. Extensive details and analysis of this database are available in a separate manuscript [162]. For context and complete- ness, we summarize a few relevant details here. The dataset consists of 19K high resolution paintings downloaded from the online collections of international art galleries, which span over 500 years of art history. The galleries with corre- sponding number of paintings are: The Rijksmuseum (4,672), The Metropolitan Museum of Art (3,222), Nationalmuseum (3,077), Cleveland Museum of Art (2,217), National Gallery of Art (2,132), Museo Nacional del Prado (2,032), The Art Institute of Chicago (936), Mauritshuis (638), and J. Paul Getty Museum (399). The distribution of paintings by year is shown in Fig. 3.1. The dataset includes crowdsourced extreme click [123] bounding box annotations over 15 material categories, which are further delineated into 50 finegrained categories. Fig. 3.2 shows a few examples of the annotated bounding boxes available in the dataset. Figure 3.1: Year Distribution of Paintings in Dataset. Each bin equals 20 years. There are peaks in the paintings in the 1700s and 1900s. The former corresponds to the European golden ages; it is less clear what explains the latter peak. 43 Figure 3.2: Examples of Annotated Bounding Boxes. Left to Right: Liquid, Fabric, Ceramic, Metal, and Food. 3.4 Using Computer Vision to Analyze Paintings Research in computer recognition systems have focused primarily on natural images. For example, semantic segmentation benchmarks (of objects, ‘stuff’, or materials) [80, 99, 40, 17, 10, 11] emphasize parsing in-the-wild photos, with applications in robotics, self-driving, and so forth. However, the analyses of paintings can also benefit from the use of visual recognition systems. Paintings can encode both cultural and perceptual biases, and being able to analyze paint- ings at scale can be useful for a variety of scientific disciplines including digital art history and human visual perception. In this section, we explore the effectiveness of interactive segmentation meth- ods (which can be used to select regions of interest in photographs for the purpose of image editing or data annotation) when applied to paintings. We also explore how well an object bounding box detector can be finetuned to detect materials depicted in unlabeled paintings, which could be used for content-based retrieval. 44 3.4.1 Extracting Polygon Segments with Interactive Segmenta- tion Polygon segmentation masks are useful for reasoning about boundary rela- tionships between different semantic regions of an image, as well as the shape of the regions themselves. However, annotating segmentations is expensive and many modern datasets rely on expensive manual annotation methods [10, 99, 192, 17, 28]. Recent work has focused on more cost effective annotation methods (e.g. [96, 110, 12, 101]). The use of interactive segmentation methods that transform sparse user inputs into polygon masks can ease annotation diffi- culty. For paintings, it is unclear whether these methods (especially deep learning methods trained on natural images) would perform well. Semantic boundaries in paintings likely have a different, and more varying structure than in photos. Paintings can have ambiguous or fuzzy boundaries between objects or materials [114] which can potentially be problematic for color-based methods. This can be due to variations in artistic style which can emphasize different aspects of depictions – for example, Van Gogh uses lines and edges to create texture, but such edges could potentially appear as boundaries to a segmentation model. In this experiment, we study the difficulty of segmenting paintings and whether innovations are necessary for existing methods to perform well. Experimental Setup We experiment with GrabCut [139] (an image-based approach) and DEXTR [110] (a modern deep learning approach). We evaluated the performance of these methods against 4.5k high-quality human annotated segmentations from [163]. 45 The inputs to these methods are generated from the extreme points of the regions we are interested in. We use a variant of the GrabCut initialization proposed in [123], as well as a rectangular initialization for reference. For DEXTR, we consider models pretrained on popular object datasets [99, 40] as a starting point. Results We found that both GrabCut and DEXTR perform quite well on paintings. Sur- prisingly, DEXTR transfers quite well to materials in paintings despite being trained only with natural photographs of objects. The performance of DEXTR can be further improved by finetuning on COCO with a smaller learning rate (10% of original learning rate for 1 epoch). Finetuning DEXTR on Grabcut segments or iteratively finetuning with output of DEXTR does not seem to yield further improvements. The performance is summarized in Table 3.1, and samples are visualized in Fig. 3.3. 3.4.2 Detecting Materials in Unlabeled Paintings To allow the public to view and interact with art collections, museums and galleries provide extensive online functionality to search and navigate through the collections. Currently, to our knowledge, no online collections allows online visitors to query the collection for depicted materials within painting, which can be of interest to the public. Furthermore, depiction of materials plays a crucial role in characterizing art history. Detecting materials within novel paintings will be particularly beneficial to digital art historians who study materials such as stone [7, 36] or skin [14, 85]. Having access to specific materials can also digital 46 Segmentation mean IOU (%) Grabcut Grabcut DEXTR DEXTR DEXTR Rectangle Extr Pascal-SBD COCO Finetuned 44.1 72.4 74.3 76.4 78.4 DEXTR Finetuned IOU By Class (%) Animal Ceramic Fabric Flora Food 76.9 86.8 79.1 77.0 87.5 Gem Glass Ground Liquid Metal 74.4 83.2 69.6 73.0 75.5 Paper Skin Sky Stone Wood 86.1 78.9 78.5 81.7 67.4 Table 3.1: Segmentation Performance. Grabcut Extr is based on [123] with small modifications: (a) minimum cost boundary is computed with the negative log probability of a pixel belonging to an edge; (b) in addition to clamping the morphological skeleton, the extreme points centroid and extreme points are clamped; (c) GC is computed directly on the RGB image. DEXTR [110] is pretrained on Pascal-SBD and COCO. Note that Pascal-SBD and COCO are natural image datasets of objects, but DEXTR transfers surprisingly well across both visual domain (paintings vs. photos) and annotation categories (materials vs. objects). art historians to compare these depictions directly with respect to painting style or technique. We experiment with automatic bounding box detection to ease access to material depictions in unlabeled collections. Experimental Setup We train a FasterRCNN [138] bounding box detector to localize and label material boxes with on 90% of annotated paintings in the dataset, and evaluate on the remaining 10%. Default COCO hyperparameters from [178] are used. Given the non-spatially-exhaustive nature of the annotations, many detected bounding boxes will not be matched against labeled ground truth boxes. However, the dataset is exhaustively annotated at an image level, and therefore, we report image-level accuracies. This can be interpreted as the accuracy of the model in 47 Figure 3.3: Extreme Click Segmentations. Left to right: Original Image, Ground Truth Segment, Grabcut Extr Segment, DEXTR COCO Segment. Both Grabcut and DEXTR use extreme points as input. For evaluation, the extreme points are generated synthetically from the ground truth segments. In practice, ex- treme clicks can be crowdsourced. Bottom-right corner shows the IOU for each segmentation. tagging each image with the types of materials present. The validity of each localized box can be further quantified through a user study, but we did not perform this study at this time. Results Table 3.2 shows the performance. We found that the FasterRCNN model is able to accurately detect materials in paintings by finetuning on the annotated bounding boxes directly without any changes to the network architecture or 48 training hyperparameters. It is certainly promising to see that an algorithm designed for object localization in natural images can be readily applied to material localization in paintings. A qualitative sample of detected bounding boxes is given in Fig. 3.4. To improve the spatial-specificity of the detected materials, it can be interesting to train an instance detector like MaskRCNN on segments extracted using methods discussed in the previous section. It would also be useful to combine material recognition with conventional object-based detection to extract complementary forms of information that improve the ability for users to filter data by their specific needs. Class Accuracy (%) (Mean = 83.3%) Animal Ceramic Fabric Flora Food 85.6 92.7 66.0 85.0 94.9 Gem Glass Ground Liquid Metal 88.4 91.3 86.5 86.4 70.7 Paper Skin Sky Stone Wood 92.4 70.2 89.4 74.8 74.9 Table 3.2: Image-level Detection Accuracy. Bounding boxes are detected with FasterRCNN trained on paintings. Because the dataset is not exhaustively an- notated spatially, image-level accuracy is reported instead of box precision and recall. Overall, images are tagged with the correct materials with high accuracy. Liquid Fabric Wood Stone Paper Figure 3.4: Detected materials in Unlabeled Paintings. Automatically detecting materials can be useful for content retrieval and for filtering online galleries by viewer interests. 49 3.5 Using Paintings to Build Better Recognition Systems In recent work for machine perception systems, art has been used in various ways. Models that learn to convert photographs into painting-like or sketch-like images have been studied extensively for their application as a tool for digital artists [71]. Recent work has shown that such neural style transfer algorithms can also produce images that are useful for training robust neural networks [49]. Artworks have also been used directly to evaluate the robustness of neural networks under “domain shifts” in which a model trained to recognize objects from photographs are shown artistic depictions of such objects instead [91, 127]. We use the MIP dataset of material depictions in paintings to explore two directions. First, we hypothesize that the perceptually focused depictions of artists can allow neural networks to learn better cues for classification. We find that learning from paintings can improve the interpretability of the cues it uses for its predictions. In a second experiment, we investigate the utility of the MIP dataset as a benchmark for computer vision models under domain shifts for material classification; this dataset also features more samples than is available than is typical in existing domain adaptation benchmark datasets. We find that existing domain adaptation algorithms can fail to behave as expected in this setting. 50 3.5.1 Learning Robust Cues for Finegrained Fabric Classifica- tion The task of distinguishing between images of different semantic content is a standard recognition task for computer vision systems. Increasing attention is being given to “fine-grained” classification where a model is tasked with distin- guishing images of the same broad category (e.g. distinguishing different species of birds or different types of flora [172, 164, 161]). Fine-grained classification is particularly challenging for deep learning systems. Such a task depends on rec- ognizing specific attributes for each finegrained class; in comparison, classifiers can perform well on coarse-grained classification by relying on context alone. We hypothesize that the painted depictions of materials can be beneficial for this task. Since some artistic depictions focus on salient cues for perception through perceptual shortcuts [108, 19, 35], it is possible that a network trained on such artwork is able to learn a more robust feature representation by focusing on these cues. Experimental Setup We experiment with the task classifying cotton/wool versus silk/satin. The latter can be recognized through local cues such as highlights on the cloth; such cues are carefully placed by artists in paintings. To understand whether artistic depictions of fabric allow a neural network to learn better features for classification, we train a model with either photographs or paintings. High resolution photographs of cotton/wool and silk/satin fabric and clothing (dresses, shirts) are downloaded and manually filtered from publicly available photos licensed under the Creative 51 Commons from Flickr. In total, we downloaded roughly 1K photos. We sample cotton/wool and silk/satin samples from our dataset to form a corresponding dataset of 1K paintings. Generalizability of Classifiers. Does training with paintings improve the gener- alizability of classifiers? To test cross-domain generalization, we test the classifier on types of images that it has not seen before. A classifier that has learned more robust features will perform better on this task than one that has learned to clas- sify images based on more spurious correlations. We test the trained classifiers on both photographs and paintings. Interpretability of Classifier Cues. Are the cues used by each classifier inter- pretable to humans? We produce evidence heatmaps with GradCAM [142] from the feature maps in the network before the fully connected classification layer. We extract high resolution feature maps from images of size 1024 × 1024 (for a feature map of size 32 × 32). The heatmaps produced by GradCAM show which regions of an image the classifier uses as evidence for a specific class. If a classifier has learned a good representation, the evidence that it uses should be more interpretable for humans. For both models, we compute heatmaps for test images corresponding to their ground truth label. We conduct a user study on Amazon Mechanical Turk to find which heatmaps are preferred by humans. Users are shown images with regions corresponding to heatmap values that are above 1.5 standard deviations above the mean. Fig 3.5 illustrates an example. Our user study resulted in responses from 85 participants, 57 of which were analyzed after quality control. For quality control, we only kept results from participants who spent over 1 second on average per task item. 52 Figure 3.5: Classifier Cues. Left to Right: Original Image, Masked Image (Paint- ing Classifier), and Masked Image (Photo Classifier). The unmasked regions represent evidence used by the classifiers for predicting “silk/satin” in this particular image. Results We find that the classifier trained with paintings exhibits better cross-domain generalization, and uses cues that humans prefer over the photo classifier. This suggests that paintings can improve the robustness of classifiers for this task of fabric classification. Generalizability of Classifiers. In Table 3.3, the performance of the two classifiers are summarized. We find that both classifiers perform similarly well on the domain they are trained on. However, when the classifiers are tested on cross- domain data, we find that the painting-trained classifier performs better than the photo-trained classifier. This suggests that the classifier trained on paintings has learned a more generalizable feature representation for this task. Interpretability of Classifier Cues. Overall, we find that the classifier trained on paintings uses evidence that is better aligned with evidence preferred by humans (Table 3.4 and Fig. 3.6). Due to domain shifts when applying classifiers to out- of-domain images, we would expect the cues selected by the painting classifier to be preferable on paintings, and the cues selected by the photo classifier to be 53 preferable on photos. Interestingly, this does not hold for photos of satin/silk (column 2 of Table 3.4) – we find that users equally prefer the evidence selected by the painting classifier to the evidence selected by the photo classifier. This suggests that either (a) the painting classifier has learned the “correct” human- interpretable cues for recognizing satin/silk, or (b) that the photo classifier has learned to classify satin/silk based on some spurious contextual signals. We asked users to elucidate their reasoning when choosing which set of cues they preferred. In general, users noted that they preferred the network which picks out regions containing the target class. Therefore, it seems that the network trained on paintings has learned better to distinguish fabric through the actual presence of such fabrics in the image over other contextual signals. Taken together, our results provide evidence that a classifier trained on paint- ings can be more robust than a classifier trained on photographs. It would be interesting to explore this further. A limitation of this study is the relatively small number of data samples, and very limited number of material types (two: cot- ton/wool and silk/satin) that we explored. Are there other materials or objects which deep neural networks can learn to recognize better from paintings than photographs? Photo→ Photo Painting→ Painting MEAN F1 Score 79.6% 80.5% Photo→ Painting Painting→ Photo MEAN F1 Score 49.5% 57.8% Table 3.3: Classifier Generalization. Classifiers are trained to distinguish cot- ton/wool from silk/satin. One classifier is trained on photographs and another classifier is trained on paintings. Both classifiers perform similarly well on im- ages of the same type they were trained on, but the classifier trained on paintings performs better on photographs than vice versa. This suggests that the features learned from paintings are more generalizable for this task on this set of data. 54 Cotton/Wool Silk/Satin Cotton/Wool Silk/Satin Photos Photos Paintings Paintings MEAN Photo Classif. Preferred 64.7 ± 3.5% 48.9 ± 3.1% 26.8 ± 2.5% 39.1 ± 2.1% 44.9 ± 1.9% Painting Classif. Preferred 35.3 ± 3.5% 51.1 ± 3.1% 73.2 ± 2.5% 60.9 ± 2.1% 55.1 ± 1.9% Table 3.4: Human Agreement with Classifier Cues. On average, humans prefer the cues used by the painting-trained classifier to make its predictions over the cues used by the photo-trained classifier. Interestingly, the human judgements also indicate that the painting-trained classifier uses cues that are just as good to the cues used by the photo-trained classifier for silk/satin photos despite never seeing a silk/satin photo during training (column 2). A pictorial representation of the results is given in Fig. 3.6. 100 80 60 40 20 wool/ silk/ wool/ silk/ cotton sattin cotton sattin 0 Photos Paintings Figure 3.6: Human Agreement with Classifier Cues. Pictorial representation of user study results from Table 3.4. The y-axis represents how often humans prefer the cues from a classifier trained on the same domain as the test images. It is clear that humans prefer the painting classifier for paintings more than they prefer the photo classifier for photos. Interestingly, the painting and photo classifiers are equally preferred for silk/satin photos despite the painting classifier never seeing a photo during training (bar 2). 3.5.2 Benchmarking Unsupervised Domain Adaptation In unsupervised domain adaptation (UDA), models are trained on a ‘source’ dataset with annotated labels as well as an unlabeled ‘target’ dataset. The goal is to train a model which performs well on unseen target dataset samples. Existing 55 Cues from own domain preferred domain adaptation benchmark datasets for classification focus primarily on object recognition and tend to be limited in number of data samples, with most class categories containing on the order of 1000 samples or fewer (for example, refer to Table 1 of [127]). In contrast, the dataset we use here has the unique properties of (a) focusing primarily on material classification and (b) containing on the order of 10-30K for 9 of the 15 annotated classes (e.g. fabric, wood), with the remainder in the range of 2K-5K (e.g. ground). This positions this data as a valuable addition for benchmarking for UDA algorithms. Experimental Setup For this study, we focus on a family of domain adaptation algorithms which aim to explicitly minimize feature discrepancy across the source and target do- mains. Existing work has shown that class-conditional UDA in which labels are estimated for target domain samples during training can be better than class- agnostic UDA where adaptation is performed without using any estimated label information at all. We choose CDD [72] and MMD [103, 159] as representative methods for class-conditional and class-agnostic discrepancy minimization. CDD estimates class labels for the target domain via clustering; for details, refer to the original paper. All methods are trained with default settings from publicly available source code for CDD, which includes the use of domain batch normal- ization [21]. We selected 10 material categories: ceramic, fabric, foliage, glass, liquid, metal, paper, skin, stone, and wood. For our painting dataset, we sam- pled as-class-balanced-as-possible from these classes to form a dataset with 10K samples and a dataset with 60K samples. A corresponding photograph dataset is constructed from Opensurfaces/MINC/COCO with 10K and 60K samples as 56 well. Results We find that the studied domain adaptation algorithms can indeed behave differently than would be expected from results on existing benchmark datasets. This is could due to more data being available or a more difficult domain shift than conventional adaptation benchmark datasets. Effect of Dataset Size. Results are summarized in Table 3.5. With the conven- tional 1K samples per class, we confirm domain adaptation yields gains over source-only as found on existing benchmark datasets. In contrast to results on existing benchmarks however, we find that class-conditional adaptation does not necessarily outperform class-agnostic adaptation. We hypothesize this occurs due to failures in target label estimation for the class-conditional case – we dis- cuss this further below. Next, with 6K samples per class (which is 6×more data samples per class than conventional UDA benchmarks), we find that source-only (i.e., no adaptation) performs very competitively. In fact, source-only strictly out- performs adaptation for Painting to Photo transfer in this data regime! This result suggests that domain adaptation is useful in lower data regimes, but source-only is a competitive alternative when more data is available. Whether this is due to the classification task itself, quirks specific to this dataset, or something else is an interesting direction to explore. We leave a deeper exploration of the source of negative transfer when MMD or CDD are applied to this dataset as future work. Effect of Class Label Estimation. As found above, class-conditional adaptation can underperform class-agnostic adaptation despite utilizing more information. 57 As class-conditional adaptation depends on estimated target labels, large domain shifts that hamper label estimation can harm adaptation. To confirm this, we consider two experiments: CDD with intraclass discrepancy minimization only (instead of both intraclass minimization and interclass maximization), and CDD with ground truth labels (i.e., perfect label estimation). Results are in Table 3.6. In both cases, we see performance improves. In the case where perfect label estimation is assumed, then CDD does outperform intraCDD and MMD as found on existing datasets. Therefore, estimating class labels for domain adaptation is useful in practice, but only if the labels are estimated sufficiently well. ResNet18 1K imgs/class per domain Photo→ Painting Painting→ Photo Source-Only 35.2% (—) 46.9% (—) MMD 46.1% (+10.9%) 56.4% (+9.5%) CDD 40.5% (+5.3%) 57.4% (+10.5%) ResNet18 6K imgs/class per domain Photo→ Painting Painting→ Photo Source-Only 38.7% (—) 53.6% (—) MMD 43.5% (+4.8%) 51.5% (–2.1%) CDD 35.6% (–3.1%) 49.4% (–4.2%) Table 3.5: Effect of Dataset Size. UDA from photo (source) to painting (target) and painting (source) to photo (target). Source-only refers to a reference baseline where no adaptation is used. The gap between source-only and UDA decreases as data samples increases from 1K images per class to 6K images per class. Furthermore, in contrast to behavior found on existing benchmark datasets, the class-conditional method of CDD does not necessarily outperform the class- agnostic counterpart MMD. 3.6 Discussion and Future Work In this chapter, we explored how modern deep learning tools developed for natural images can be used to analyze paintings, and in turn, how paintings can 58 ResNet18 1K imgs/class per domain Photo→ Painting Painting→ Photo Source-Only 35.2% (—) 46.9% (—) MMD 46.1% (+10.9%) 56.4% (+9.5%) IntraCDD 44.4% (+9.2%) 58.5% (+11.6%) CDD 40.5% (+5.3%) 57.4% (+10.5%) IntraCDD w/ GT labels 57.6% (+22.4%) 64.4% (+17.5%) CDD w/ GT labels 61.6% (+26.4%) 72.5% (+25.6%) Table 3.6: Effect of Class Label Estimation. Reducing the reliance class label estimation improves class-conditional UDA when label estimation for target data is poor. MMD does not require class label estimation, and so its performance is relatively good here. Due to poor label estimation, we find that IntraCDD (which considers only intraclass discrepancy) outperforms CDD (which considers both intraclass and interclass discrepancy) as IntraCDD relies less on accurately estimated class labels. (green) Assuming perfect class label estimation using ground truth (GT) labels, CDD recovers performance gains over intraCDD and MMD. be used to improve deep learning systems in a series of experiments. Paintings represent an interesting distribution of images. Despite not necessarily being photorealistic, paintings still convey meaningful imagery that evoke similar feelings as natural photos. Paintings are crafted by humans for humans, and understanding paintings can deepen our understanding of cultures and human perception. As most computer vision systems are built for natural images, labeled datasets for paintings do not exist at the same scale as those for photos. Our results in this chapter demonstrate that interactive annotation tools can work surprisingly well on paintings, suggesting that these tools can be used for easing the construction of labeled painting datasets. We also showed that labeled paintings can be used to finetune object detectors into material detectors. Automated methods for retrieving depictions of specific materials can be very valuable for digital art historians, who have studied specific styles or techniques with respect to depicted 59 content in paintings. These experiments represent a step towards understanding how to create better tools for managing painting data. More annotations and automated content analysis of paintings can prove to be valuable for many disciplines, including those beyond computer vision and deep learning. On the other hand, deep learning systems may benefit from learning from paintings as well. As paintings emphasize useful perceptual cues so that humans can understand the depictions in paintings, training a computer vision model on paintings may to allow models to learn such cues for recognition. Our experiment with glossy versus rough fabric classification showed that models trained on paintings were indeed able to focus on more interpretable cues. While this study was limited to a small specific classification task (wool/cotton versus silk/satin), it would be interesting to explore further whether models trained on paintings can learn robust representations for more general settings. We explore this direction further in the next chapter. Finally, extensive existing efforts have been made in creating algorithms for adapting models to new image distributions. We found that two existing unsupervised domain adaptation algorithms can struggle to perform well when used to adapt a model trained for material recognition from photos to paintings or vice versa. By evaluating models and algorithms in their ability to enable generalization to new domains during testing, deep learning systems and training algorithms can be implicitly guided towards more robust design paradigms. Notes and Acknowledgements. I thank Mitchell for his efforts in leading the collection of the Materials in Paintings dataset, upon which this work was per- formed. 60 CHAPTER 4 LEARNING ROBUST NATURAL IMAGE RECOGNITION FROM PAINTINGS 4.1 Overview A common method to improve a model’s robustness to distribution shifts is through data augmentation. These are image transformations that are applied to existing training images that preserve the semantic content of each image. Data augmentations encourage models to learn desired invariances, such as invariance to horizontal flipping or small changes in color. Beyond simple geometric or photometric transformations, recent work has shown that style transfer can be used as a form of data augmentation to encourage invariance to textures [49]. Style transfer is a technique which aims to create painting-like images from photographs [71]; often, these methods rely on operations that manifest as changes in texture throughout the original photograph. However, a stylized photograph is not quite the same as an artist-created painting. Artists depict perceptually meaningful cues in paintings so that humans can recognize salient components in scenes, an emphasis which is not enforced in style transfer. As found in the previous chapter, learning from paintings may allow models to better grasp important cues for generalizable recognition. Therefore, in this chapter, we study how style transfer and paintings differ in their impact on model robustness. First, we investigate the role of paintings as style images for stylization-based data augmentation. Arbitrary style transfer algorithms transform a photo by transferring the style from some source style im- age; conventionally, this style image is selected from a set of paintings. However, 61 we find that style transfer functions well even without paintings as style images, suggesting that the power of style transfer as data augmentation is orthogonal to the style found in actual paintings. Second, we show that learning from real paintings as a form of perceptually-grounded implicit data augmentation can im- prove model robustness. That is, each painting can be considered an augmented version of some photograph of a scene, even though the photograph and/or scene may not necessarily exist. Finally, we investigate the invariances learned from stylization and from paintings, and show that models learn different invari- ances from these differing forms of data. Our results provide insights into how stylization improves model robustness, and provide evidence that artist-created paintings can be a valuable source of data for improving model robustness to naturally-occuring distribution shifts. The work in this chapter was published in CVPR 2021 as “What Can Style Transfer and Paintings Do For Model Robustess?” [97] with Mitchell van Zuijlen, Sylvia Pont, Maarten Wijntjes, and Kavita Bala. 4.2 Introduction Model robustness can be defined as the capability of a model to generalize to unseen image distributions. These can be the result of real-world effects, like weather and camera noise [60], adversarial noise [102], or distribution shifts due to differences in environments in which the images are captured. The performance of standard recognition models can degrade drastically in these settings, but robust models are critical for applications such as self-driving or medical diagnostics. 62 Content Images Learned Style Transfer Content Noise Invariance TrainImages Weather Invariance Style Images Content Images Arbitrary Style Transfer Neural Network Style Viewpoint and Lighting Images  InvarianceBlur Invariance Arbitrary Style Transfer Digital Invariance Figure 4.1: What invariances are learned from real and fake paintings? Left: Natural photographs (black), paintings (magenta), and stylized photographs (olive/red/blue) from the Materials dataset (Section 4.4.2), Right: Relative ro- bustness to various types of transformations for models trained with different sets of images with respect to a model trained on only natural photos. Stylization algorithms can transform photographs into painting-like images, but it is not clear that models will learn the same invariances from these images. This chapter explores a series of hypotheses to understand the different ways in which style transfer and paintings improve model robustness. A common strategy is to improve generalization through data augmentation [187, 34, 61, 102]. Conventional data augmentation applies transformations to encourage invariance to heuristic rules (e.g., flipping for invariance to image mirroring). Recent work has found that image stylization can encourage models to learn invariance to texture [49]. While style transfer has focused on visual fidelity [71], we argue that current style transfer models do not yet fully capture the essence of artistic paintings. For example, a family of style transfer algorithms act by manipulating feature distributions to create a stylized photo which holisti- cally mimics a painting [93] – in effect, mid-level textures are manipulated in the stylized photo. However, paintings are more than a style filter applied to a photo. An artist can choose lighting, contours, and scene context to convey realism in important scene regions while foregoing perceptual details less important areas. This artistic manipulation can affect our perceptual understanding of the scene. In this chapter, we explore a series of hypotheses to understand how style 63 transfer and paintings impact model robustness. Fig. 4.1 illustrates that various types of images can differently affect model robustness. First, we examine how style images play a role in stylization-based data augmentation in Section 4.5. Second, we investigate the role of paintings as a form of training data, and contrast it to other artforms such as sketches in Section 4.6. Finally, we probe models to empirically understand their learned invariances, and discuss how style transfer and artistic paintings can contribute to robust natural image recognition models in Section 4.7. Our contributions are: • We demonstrate that arbitrary style transfer can be used as effective data augmentation even without painting style images. We attribute their effec- tiveness to the diversity of style images rather than artistic style. • We argue that paintings can be considered a form of perceptual data aug- mentation, and demonstrate that it can improve model robustness. We contrast paintings with other forms of art such as sketches. • We explore the invariances learned from arbitrary style transfer, learned artistic style transfer, and paintings. We find that models do not learn the same invariances from stylized photos and paintings, and show that the learned invariances are complementary. 4.3 Related Work and Background Model Robustness. Recent work in robustness for CNNs has focused on both adversarial robustness [18] as well as robustness to real-world transformations [60, 50]. This view of model robustness is human-centric, where the settings considered are those where the human visual system has been shown to be 64 robust (e.g., [144, 50, 49]), rather than enforcing model robustness under arbitrary settings. A related line of work is in domain generalization, where the task is to generalize to unseen domains, (e.g., [51, 112, 92, 91]), by learning a shared representation on a set of seen domains. While a common justification for domain generalization is model robustness, domain generalization is subtly different. Domain generalization algorithms assume the target domain is unspecified, and do not rely on domain-specific signals at inference time. However, robust natural image recognition can benefit from learning from natural images directly. Data Augmentation. Data augmentations are transformations applied to im- ages to enforce useful model invariances. Beyond basic transformations like flipping, recent work in data augmentation has focused on more complex aug- mentations such as image occlusion [34], class-mixing [187], and compositions of transformations [61]. Data-driven augmentations such as adversarial or styliza- tion transformations [170, 102, 68] can also be used to model nuanced invariances. Style Transfer. Style transfer aims to transform photos into painting-like im- ages by transferring artistic styles. While increasing attention has been given to arbitrary style transfer (e.g., [68, 150, 146, 155, 167]) which aims to efficiently transfer unseen styles, artist-specific style transfer models (e.g., [140, 77]) are typically able to better capture nuances from a collections of images. Beyond its role as a tool for artistic creation, stylization has also been used as a form of data augmentation to enforce invariances to textures [49], as well as regularization for tasks such as human re-identification [70]. 65 4.4 Preliminaries 4.4.1 Evaluating Robustness We evaluate robustness to common image corruptions and distribution shifts from the training distribution. These settings serve as a proxy for real-world robustness. Furthermore, the behavior of models on these scenarios gives us insight into the invariances learned – for example, a model which is robust to noise has likely learned to be more invariant to (i.e., to rely little on) high- frequency signals in an image. All experiments use an ImageNet-pretrained ResNet18 architecture, and results are averaged over three independent runs. For complete training details and experiments with alternative architectures, please refer to the appendix. Common Image Corruptions. Common image corruptions are inspired by transformations that can be encountered in real-world settings [60]. There are 15 corruptions which span 4 broad categories (noise, blur, weather, and digital) with 5 severity levels per corruption. We use the released code to corrupt our test images. Figure 4.2 illustrates these corruptions. For each corruption, we compute the mean accuracy over each severity, and then compute the mean over each set of broad corruption categories C. Given a model Θ, the mean corruption accuracy is: 66 1 ∑ AccMean(Θ) = AccC(Θ) (4.1) 4 C ∑ ∑51 where AccC(Θ) = Acc(Θ,Dcorr,s) 5nC corr∈C s=1 Dcorr,s denotes the test dataset of images transformed by corruption corr with severity s. Small Distribution Shifts. Out-of-distribution photographs will be used to evaluate robustness to small domain shifts not unlike the domain shifts that mod- els must overcome when they are used in different real world environments. For the PACS dataset, we use a subset of the YFCC100M dataset [158] as the out-of- distribution test set. This subset is curated by downloading 100 images per class and then manually filtering to remove irrelevant retrievals down to 50 images per class. This test set is released for reproducibility. For the Materials dataset, we use the Flickr Material Database (FMD) [144] as the out-of-distribution test set. Figure 4.2: Image Corruptions. Top-Left to Bottom-Right: Noise(×3), Blur(×4), Weather(×4), Digital(×4). 67 4.4.2 Datasets We select datasets which contain both photographs and paintings, and con- duct experiments across two recognition tasks (object classification and material classification). Object Classification. We use the PACS dataset [91] which consists of 10K images across 7 categories and 4 domains (photographs, paintings, cartoons, and sketches). Material Classification. We construct a dataset from existing large-scale pho- tograph datasets [10, 11, 17], and a large-scale painting dataset with material annotations [162]. We will refer to this dataset as ‘Materials’. This dataset consists of 120K images across 10 categories and 2 domains (photographs and paintings). See appendix for details. [49] found that stylization-based augmentation can reduce bias towards textures, but material recognition relies on texture under- standing [2]. As such, it is interesting to explore whether stylization can improve robustness for this task. 4.4.3 Notation Some common notation used throughout is given here. Let Dn be a set of natural photographs and Dp be a set of paintings. For each image x, its class label is denoted by yx. Finally, let l(ŷ, y) denote the cross entropy loss. 68 4.5 Style Transfer as Data Augmentation Style transfer aims to transform the style of an image into the style of another set of images [71]. There is evidence [49] that training on stylized images [68] can improve object recognition on ImageNet by encouraging networks to focus more on shape than texture. In this view, we can consider style transfer as a form of data augmentation. Style transfer is often applied with painting style images from datasets such as Wikiart [157, 177]. In its role as a tool to mimic artistic creation, this is certainly appropriate. However, in its role as a form of data augmentation, it is not strictly necessary for the style images to be paintings. Indeed, arbitrary stylization methods can be applied to any pair of content and style images (hence ‘arbitrary’). Although work such as [49] utilize style transfer in the conventional manner with painting styles, it’s important to ask whether models can learn robust invariances from photo style images alone. To answer this question in a general way, we experiment with three repre- sentative deep-learning based arbitrary style transfer methods. Each of these methods act in deep feature space, but follow a different paradigm: AdaIN [68] transfers style by matching the mean and standard deviation of features, ET- Net [150] iteratively refines a stylized image by computing residual error maps, and TPFR [155] transfers style by recombining features in the content image to match those of the style image. We explore the following: • Hypothesis H1. Painting styles are necessary for stylization-based aug- mentation to improve robustness. • Hypothesis H2. Style image diversity is important. 69 4.5.1 Are Painting Style Images Necessary? We experiment with: (a) a network trained with photos plus photos stylized by paintings and (b) a network trained with photos plus photos stylized by other photos. We will refer to (b) as “intradomain stylization” as photos are being stylized by other photos from within the same domain. For reference, we also consider (c) a network trained with photos alone (no stylization). Specifically, let φ(x, xs) be an arbitrary stylization algorithm which stylizes content image x with style image xs. For a network Θ, the objectives are given by: [1( )] (a) minEx,xs∼Dn,Dp [ (l(Θ(x), yx) + l(Θ(φ(x, xs)), yx)Θ 21 )] (b) minEx,xs∼[Dn,Dn l(Θ](x), yx) + l(Θ(φ(x, xs)), yx)Θ 2 (c) minEx∼Dn l(Θ(x), yx) (4.2) Θ In practice, we approximate the objectives by sampling xs once for each x instead of minimizing over all independent combinations of x and xs. The results are shown in Fig. 4.3. Across both PACS and Materials, we find that intradomain stylization significantly improves robustness over the photo-only baseline. With a large dataset (Materials), we find that intradomain stylization can meet or even exceed the performance of conventional painting- based stylization. Thus, in contrast to common practice, stylization-based data augmentation does not need painting style images. This finding is also sup- ported by recent work which shows that online feature moment matching across different training images is an effective form of data augmentation [90] (which we can frame as roughly equivalent to intradomain stylization with AdaIN), 70 and work which shows stylization with images from non-painting domains (including intradomain stylization) can be useful for domain generalization [149]. We have shown explicitly here that intradomain stylization can replace painting stylization for robust natural image recognition when enough data is available. Answer to H1: Intradomain stylization can improve network robustness to an extent that is comparable to painting stylization when there is sufficient data – that is, paintings do not play a unique role when arbitrary style transfer is used as data augmentation. Num Data Samples Figure 4.3: Stylization: Painting vs Photo Styles. Left: PACS, Right: Materials. In general, intradomain stylization (red/green/yellow) improves robustness over no stylization (blue). Further, when sufficient data is available (Materials), intradomain stylization (dashed lines) results in similar robustness gains to conventional painting stylization (solid lines). This means that paintings are not uniquely responsible for robustness gains from stylization. 4.5.2 The Role of Style Diversity The finding that intradomain stylization can be comparable to painting styliza- tion leads to the hypothesis that it is the diversity in image statistics between style and content images that plays a key role. For example, consider AdaIN – the extent to which images are transformed by stylization depends on the magnitude 71 of the difference in feature distribution moments between the content image and the style image. This is why intradomain stylization is comparable to painting stylization on a large dataset like Materials. We test this hypothesis by restricting the style photo for intradomain styliza- tion to be drawn from images that share the same class label as the content image. With this restriction, the style images are likely to be more similar to the content image given that they share similar semantic content. Let Dyn be the subset of natural photographs with class label y. Then, the objective is given by: [ [1( )]] minEx∼D yxn Ex ∼D l(Θ(x), yx) + l(Θ(φ(x, xs)), yx) (4.3)s Θ n 2 In general, we find that this restriction does indeed reduce the effectiveness of intradomain stylization (Fig. 4.4). As an exception, TPFR does not appear to rely heavily on the choice of style images. This can be explained by the adversarial loss used in TPFR – the decoder is trained explicitly to fool a style discriminator that discriminates between stylized images and real paintings during training. Therefore, it is possible that the decoder is encoding painting- like style signals regardless of the style image used. This also suggests that a style transfer algorithm which explicitly transfers painting styles can be useful instead of relying on a diverse style dataset during training (we explore this in Section 4.7). In general, biases in stylization models can contribute to improved robustness independently of style images. Answer to H2: Access to style images which are diverse with respect to content images is key for stylization-based augmentation. Against conventional wisdom, style images need not contain statistics that manifest as visible textures or artistic style per se. 72 As long as each style image is sufficiently different from its corresponding content image, it will suffice. “Sufficiently different” means “depicting different semantic content” in our analysis here. Interestingly, we found that style differences measured by the Gram matrix distance between a stylized image and its original counterpart do not correlate with robustness (see appendix) – further analysis is left for future work. Num Data Samples Figure 4.4: Stylization: Unrestricted vs Intraclass Styles. Left: PACS, Right: Materials. Across both datasets, restricting style images to the class as content images (dashed lines) results in smaller robustness gains compared to unre- stricted stylization (solid lines). This reduction in robustness is explained by the reduction in diversity between content images and style images. 4.6 Paintings as Perceptual Data Augmentation In Section 4.5, we found that stylization as data augmentation works well as long as the set of style images are diverse. This diversity does not necessarily depend on the image statistics found specifically in paintings. If sufficiently diverse mid-level statistics is found by stylization with photos, then perhaps photos can fulfill the role of paintings entirely. Instead, we argue that paintings are more than just a set of mid-level style 73 features overlaid on top of a photograph. Our key insight is that perceptually realistic paintings can be considered a form of ‘perceptual’ data augmentation. Unconstrained by physical reality, artists are free to depict varying level of perceptual realism [19]. Paintings are perceptually realistic in regions where the artist has deemed viewer attention should be focused. For example, a painting of a giraffe might include perceptually relevant details on the giraffe itself while the background is depicted in an less realistc and more abstract manner. In a collection of paintings, important cues for objects or materials of interest are depicted frequently in a perceptually sound manner while unimportant details are abstracted away. Even so, the domain shift between paintings and photos can be problematic, and it is likely that models trained on paintings will fail to perform well on photos if domain shift is not accounted for. Furthermore, many of the arguments made for paintings above can also apply to other artforms, and it is interesting to consider alternatives. We explore the following: • Hypothesis H3. (a) Learning from paintings improves natural image ro- bustness after accounting for domain shift, and (b) this improvement is greater than that found from photos alone. • Hypothesis H4. Other artforms can encode similar invariances to paint- ings. 74 4.6.1 Learning Robust Natural Image Recognition From Paint- ings A classifier trained directly on both photos and paintings is required to learn boundaries that satisfy both of these domains. Consequently, the accuracy on photographs can suffer. Since our goal is to train a robust model for natural image classification, we alleviate this by considering two alternatives: (a) a shared feature extractor with multiple domain-specific classifiers (multitask learning) or (b) a photo-only classifier that is finetuned after shared feature learning. For reference, we also consider the default option of training (c) a joint classifier on both photos and paintings. Specifically, let Θf be a feature extractor (i.e., ResNet18 without the final fully connected layer). Let η be a linear classifier (i.e., a fully connected layer). Then the objective for (a) is given by: [1( min Exn,xp∼Dn,Dp l((ηn ◦Θf )(xn), yx2 nΘf ,ηn,ηp ))]+ l((ηp ◦Θf )(xp), yxp) (4.4) For (b), two objectives are optimized sequentially: [1( (i) min Exn,xp∼Dn,Dp l((ηn ◦Θf )(xn), yx2 n))]+Θf ,ηn [ l((ηn ◦Θf )(xp]), yxp) (ii) minExn∼Dn l((ηn ◦Θf )(xn), yxn) (4.5) ηn For (c), the objective is simply Eq. 4.5(i). In all cases, the model defined by 75 (ηn ◦Θf ) is used at inference time. Both options (a) and (b) allow paintings to be used for feature learning while keeping the inference classifier specific to photos. Results are summarized in Fig. 4.5. Despite domain differences between photos and paintings, the default classifier (c) has improved robustness over a classifier that is trained on photos alone. A finetuned classifier (b) does not yield much improvement over the default option (c), while domain-specific classifiers (a) do yield significant improvement. This suggests that paintings are useful for feature learning since they can guide the feature extractor towards perceptually relevant features, but constraining the feature space to jointly separate photos and paintings across different classes can restrict the breadth of learned features. The clean accuracy of a joint classifier (finetuned or not) suffers since it can no longer rely on some photo-specific features for classification. We will use domain-specific classifiers in remaining experiments unless otherwise specified.1 Answer to H3a: Surprisingly, we find that paintings can improve model robustness out-of-the-box without accounting for domain shift. However, accounting for domain shift with domain-specific classifiers increases both clean accuracy and robustness signifi- cantly. To control for robustness gains from photos, we assume a 1:1 cost for pho- tos:paintings with a fixed annotation budget. Fig. 4.6 shows that it is beneficial to allocate up to 50% of any annotation budget for paintings with respect to model robustness. Answer to H3b: Using paintings is cost-effective – annotating a combination of photos and paintings results in higher robustness over photos alone for any fixed budget. 1We experimented with domain-specific classifiers in the context of stylization, but found they did not improve robustness over a joint classifier. 76 Num Paintings (Plus 10K Photos) Figure 4.5: Learning from Paintings. Left: Clean Accuracy, Right: Corruption Accuracy. Domain-specific classifiers (green) result in the highest robustness while also improving clean accuracy. “LR normalized” refers to fixed effective learning rates to account for additional gradients from the extra classifier head. Even without accounting for domain shifts, training with paintings improves robustness (red/yellow). Results are on Materials. Total Data Samples Figure 4.6: Trade-off Between Photos and Paintings. Left: PACS, Right: Mate- rials. For a fixed annotation budget, learning from both photos and paintings (25%/50% paintings) results in higher robustness than photos alone (0% paint- ings), with a bit over 1 of the total number of data samples required to be 3 annotated to match the maximal robustness achieved by only photos. 4.6.2 Paintings vs. Other Visual Artforms Many artforms are created with an artistic emphasis on perceptually important cues. For example, a line sketch is an abstraction which focuses on salient con- tours to depict recognizable objects. While sketches are quite good at abstracting away unimportant signals, they also abstracts away many realistic cues in favor 77 of a sparse line-based representation. In the following experiment, we consider models trained on photographs with different visual artforms. Table 4.1 summarizes results across four datasets. We find that robustness can be harmed by sparse visual representations like PACS line sketches or Domain- Net quickdraw. However, DomainNet sketches, which include more realistic shading and detail, do improve robustness. This is aligned with our expectation that the inclusion of perceptually relevant cues is important for feature learning. VisDA renderings are untextured and shaded with a single directional light source and ambient lighting. Similar to line sketches, we find that these minimal renderings reduce model robustness. Answer to H4: Our results position paintings as a unique artform for improving model robustness due to their fine balance between perceptual realism and abstraction. 4.7 Do Stylized Images and Paintings Induce Similar Invari- ances? As shown in Sections 4.5 and 4.6, both stylized images and paintings can im- prove model robustness. We argued that paintings are a form of perceptual data augmentation in which artists manipulate perceptual cues to emphasize salient regions of scenes. However, it remains unclear whether models are indeed learning perceptual invariances from paintings – it is possible that the robust- ness gains from paintings arise purely through their mid-level image statistics and textures instead. If paintings are improving robustness through different mechanisms than stylized photos, we can expect different behavior from models 78 Training Data (# Samples) Mean Corruption Acc (%) Materials Photo (30K) 54.73±0.25 Photo + Painting (15K + 15K) 56.31±0.27 (+) PACS Photo (1500) 76.16±0.34 Photo + Painting (750 + 750) 79.41±0.55 (+) Photo + Cartoon (750 + 750) 75.38±0.36 (−) Photo + Sketch (750 + 750) 73.85±0.39 (−) DomainNet [127] Photo (120K) 36.59±0.12 Photo + Painting (90K + 30K) 39.00±0.14 (+) Photo + Sketch (90K + 30K) 37.57±0.22 (+) Photo + Clipart (90K + 30K) 37.00±0.07 (+) Photo + Quickdraw (90K + 30K) 35.87±0.20 (−) Photo + Infograph (90K + 30K) 34.60±0.18 (−) VisDA [128] Photo (30K) 65.97±0.33 Photo + Rendering (15K + 15K) 63.90±0.21 (−) Table 4.1: Robustness from Different Artforms. Paintings improve model ro- bustness while more abstract artforms can reduce robustness. (+)/(−) indicate whether an artform improves/reduces model robustness. ± indicates standard deviation over 3 runs. trained on stylized photos and paintings. To investigate how stylized photos and paintings act on model robustness, we empirically probe models to understand their learned invariances. We explore the following: • Hypothesis H5. Models trained on stylized photos and paintings learn different invariances to (a) common image corruptions and (b) viewpoint and lighting shifts, and so (c) models can learn complementary invariances by training on both paintings and stylized photos. • Hypothesis H6. Stylization injects high-frequency signals that improve model robustness. 79 4.7.1 Probing Learned Invariances To explore the relative invariances learned by different models, we consider the behavior of models on various types of common image corruptions. We also consider behavior on out of distribution images – in general, these images have a different distribution of viewing angles, viewing scales, and lighting than the original training photos. We experiment with models trained on paintings and AdaIN-stylized photos. In addition to arbitrary style transfer, it is natural to consider learned artistic style transfer. We experiment with SACL [140], which transfers the style of various artists independently with separately trained models. We stylize each photo with a random artist to parallel the real painting datasets which include multiple artists and styles. Behavior with respect to common corruptions is summarized in Table 4.2. Stylization and paintings both consistently improve robustness to each form of common corruption. On average, SACL outperforms both AdaIN and paintings, giving credence to an argument that stylization methods with strong biases (i.e., learned styles) may be more practical than real paintings or arbitrary stylization methods that depend on a diverse style set (c.f. Section 4.5.2). Observe that the relative performance of paintings fluctuates between datasets – paintings outperform AdaIN on noise and digital on Materials but underperform AdaIN on PACS. As discussed earlier, a collection of paintings encodes perceptual invariances. Since these invariances are not agreed upon a priori for every painting, it follows that a large set of paintings is required to adequately capture implicitly encoded perceptual invariances. Finally, all methods are similarly invariant to weather and digital transformations. This can be explained by their mid-level statistics. Weather transformations such as snow, fog, and frost are effectively 80 overlaid textures on an image while digital transformations such as pixelate and elastic transform resemble the fuzzy boundaries found in both types of images. Answer to H5a: Both stylization and paintings improve robustness to various image corruptions. However, learned stylization strictly outperforms paintings, suggesting that invariances from learned style transfer supersedes those from paintings with respect to common corruptions. Method Noise Blur Weather Digital Materials (30K Samples/Domain) Photo-Only 43.70±0.65 58.76±0.14 55.25±0.33 61.20±0.69 Photo + AdaIN 47.33±0.22 65.09±0.21 61.78±0.18 61.41±0.16 Photo + SACL 61.87±0.16 64.36±0.20 57.49±0.24 66.55±0.17 Photo + Painting 49.82±0.56 61.03±0.13 56.69±0.10 64.15±0.14 PACS (1.5K Samples/Domain) Photo-Only 62.64±1.48 72.75±0.04 83.24±0.22 86.33±0.14 Photo + AdaIN 70.17±1.70 81.18±0.20 88.37±0.23 89.32±0.19 Photo + SACL 85.98±0.56 84.61±0.15 89.73±0.33 88.74±0.48 Photo + Painting 68.83±0.83 75.80±0.95 86.88±0.66 87.07±0.14 Table 4.2: Per-Corruption Accuracy. (blue) SACL generally outperforms both AdaIN and paintings, particularly on noise. (red) Paintings can outperform AdaIN on some corruptions with a large dataset (Materials), but underperform when fewer images are available (PACS). See main text for discussion. ± indicates standard deviation over 3 runs. Performance with respect to out-of-distribution images is summarized in Fig. 4.7. In striking contrast to the robustness against image corruption results above, stylization consistently harms robustness. The reduced performance of stylization can be explained by model overfitting to view- or lighting-specific signals in the original photo dataset, as the signals in common between a clean photo and its stylized counterpart are seen twice as often by the network during training. On the other hand, paintings are not simply a transformed photograph, and thus do not suffer from this problem. A straightforward explanation of the robustness found through paintings is in the differences in viewpoints and lighting depicted 81 compared to photos due to circumstance (that is, the paintings simply depict more diverse scenes than the photos). However, paintings are constrained by cultural norms and artistic conventions [153, 107], so it is unlikely that artistic paintings contain a more diverse set of viewpoints than in-the-wild photos. Instead, we argue it is the emphasis on depicting regions of interest with recognizable characteristics while de-emphasizing details in the background that is helping networks to learn better viewpoint invariance from paintings. The model is better able to learn to focus on the objects or materials themselves over background context. Answer to H5b: For viewpoint and lighting transformations found in out-of- distribution images, using stylization consistently hurts performance while using paint- ings consistently improves performance. Num Samples Per Domain Figure 4.7: Out-of-Distribution Accuracy. Left: PACS, Right: Materials. Train- ing with paintings (red) improves robustness to out-of-distribution photos while training with stylized photos (purple/yellow) hurts robustness. Paintings can im- prove invariance to viewpoints and lighting by encouraging models to focus on objects / materials of interest over background context. Stylization encourages overfitting, an effect which can be exacerbated with more training samples. Since the behavior of models trained on stylized photos and paintings are indeed different, we explore whether models trained on both sources of data learn complementary invariances, or if the differences result in conflicting behavior. Our results in Table 4.3 suggests the former. 82 Answer to H5c: Training with both paintings and stylized photos improves robustness in a complementary manner. Method MEAN Corr. OOD Materials (30K Samples/Domain) Photo-Only 48.03±0.21 54.73±0.25 41.33±0.62 Photo + SACL 48.56±0.45 62.67±0.03 34.54±0.91 Photo + Painting 50.92±0.22 57.92±0.09 43.92±0.47 Photo + SACL + Painting 51.49±0.69 61.47±0.50 41.50±1.38 PACS (1.5K Samples/Domain) Photo-Only 79.37±0.17 76.16±0.34 82.57±0.00 Photo + SACL 82.35±0.37 87.27±0.10 77.43±0.84 Photo + Painting 82.54±0.59 79.65±0.49 85.43±0.70 Photo + SACL + Painting 85.42±0.18 87.31±0.30 83.52±0.27 Table 4.3: Learning from Stylization and Paintings. Training with both stylized images and paintings improves average robustness to image corruptions and out-of-distribution photos, indicating that the invariances learned from these images are complementary. ± indicates standard deviation over 3 runs. 4.7.2 The Role of High Frequency Signals We have focused our intuitions about the source of invariances learned from stylization and paintings through the visible structure of these images. Existing work has shown that CNNs can learn to extract features from high frequency signals in images [166, 102]. It is also well-known that deconvolutional decoders, such as those used in stylization models, can introduce artifacts in images [120]. It is difficult to form intuitions about these signals, but we can measure whether they play a significant role in improving model robustness. We apply an ideal circular low-pass filter to zero out high-frequency com- ponents. Given an image I , the filtered frequency components of the image are: 83 Xfiltered = F(I) C (4.6) where Cij = 1r<τ (r(i, j)) F denotes the discrete 2D Fourier transform, 1 denotes the indicator function, and τ is the radius of the low-pass filter. We set τ = 60 in our experiments. Fig. 4.8 illustrates images before and after filtering at image resolution 224 × 224. Note that the filtered images are perceptually identical to the original images at a glance. Therefore, we can train models on the filtered images to measure the impact of the visually negligible high frequency signals which were filtered out. Table 4.4 summarizes the results. With filtered images, robustness against noise drops significantly for models trained on photos stylized with SACL. This means visible high frequency textures (such as the brush strokes in a Monet stylized photo) are not enough to explain robustness against noise. This effect of invisible high-frequency signals on noise is similar to evidence that learning from adversarial perturbations improves robustness to high frequency corruptions [185]. On the other hand, the effect of high frequency signals on the noise robustness of paintings is much smaller. Answer to H6: For learned style transfer, it is the presence of invisible high frequency signals that are doing the heavy lifting against noise. In contrast, paintings are primarily improving invariance towards noise through visible human-perceivable signals. 84 Figure 4.8: Reducing High-Frequency Signals. Top: Original Image, Bottom: Low Frequency Image. Columns 1 and 3 are stylized photos; columns 2 and 4 are artist-created paintings. Reducing the magnitude of sufficiently high frequency components from images does not alter perceptual quality of images. At a glance, the top and bottom images are perceived to be identical. Method Noise Blur Weather Digital OOD Materials (30K Samples/Domain) Photo-Only 43.70±0.65 58.76±0.14 55.25±0.33 61.20±0.69 41.33±0.62 Photo + SACL 61.87±0.16 64.36±0.20 57.49±0.24 66.55±0.17 34.54±0.91 Photo + Painting 49.82±0.56 61.03±0.13 56.69±0.10 64.15±0.14 43.92±0.47 Photo+SACL (LF) 45.82±1.36 64.24±0.39 57.06±0.13 66.37±0.29 36.92±1.15 Photo+Painting (LF) 44.95±0.66 60.87±0.29 56.82±0.23 63.69±0.46 41.21±0.56 PACS (1.5K Samples/Domain) Photo-Only 62.64±1.48 72.75±0.04 83.24±0.22 86.33±0.14 82.57±0.00 Photo + SACL 85.98±0.56 84.61±0.15 89.73±0.33 88.74±0.48 77.43±0.84 Photo + Painting 68.83±0.83 75.80±0.95 86.88±0.66 87.07±0.14 85.43±0.70 Photo+SACL (LF) 77.55±2.60 85.4±0.11 88.93±0.22 88.53±0.15 77.43±0.47 Photo+Painting (LF) 71.16±1.31 75.97±0.71 86.82±0.37 87.35±0.36 83.71±0.40 Table 4.4: Robustness without High Frequency Signals. “LF” denotes filtered low frequency images. Photos are always unfiltered. Filtering invisible high frequency components mainly impacts noise robustness. (blue) Filtering stylized photos significantly reduces noise robustness while (red) filtering paintings has a relatively smaller effect. ± indicates standard deviations over 3 runs. 4.8 Discussion and Future Work In this chapter, we performed an extensive exploration of style transfer and artistic paintings for model robustness. We found that style transfer is able to 85 improve model robustness without painting style images at all (H1). Instead, stylization relies on a combination of diversity between style-content image pairs and learned biases to improve model robustness (H2). This suggests that although style transfer can improve model robustness, it does so independently from paintings (or the styles found within paintings). This is contrary to con- ventional wisdom that asserts style transfer improves model robustness through its ability to mimic the textures and styles found in real paintings. Next, we proposed the direct use of paintings as a form of implicit perceptual data aug- mentation. We confirmed our hypothesis that learning from real paintings can improve robustness, and saw greater gains by accounting for the domain shift between paintings and photos (H3). When considered against other artforms such as sketches or cartoons, we found that the fine balance of abstraction and realism in paintings allowed it to uniquely enhance model robustness while other artforms do not directly lead to improved robustness (H4). Finally, we found that models learn different invariances from paintings and stylized photos, and that robustness can be improved by training on both forms of data (H5,H6). Both forms of images allowed models to improve robustness to common corruptions in images. However, models trained on stylized photos found reduced robustness to viewpoint, illumination, or other context distribution shifts while paintings improved robustness to such shifts. From a practical standpoint, our results suggest that learned stylization meth- ods should be considered over arbitrary style transfer methods in data aug- mentation pipelines. Our results also suggest that training with paintings is a straightforward way to improve model robustness, and should be used if they are available. 86 There are interesting research directions for future exploration. Work has been done to improve the controls available in style transfer or image editing models [27, 169, 48]. It would be interesting to apply these controls in a perceptually- grounded manner when style transfer is applied to mimic the artistic process. In this chapter, we have found that artforms like sketches are unable to improve model robustness. It would be interesting to explore how coarser abstractions found in art can be leveraged for model robustness, perhaps by encouraging models to learn a hierarchy of invariances. 87 CHAPTER 5 UNCERTAINTY-AWARE PLANNING WITH SEMANTIC SCENE UNDERSTANDING 5.1 Overview Computer vision models are often used within larger systems, where their per- ception of the world feeds into other components. For autonomous navigation, visual reasoning is useful for understanding safe terrain and obstacles in the environment. This understanding can allow the autonomous agent to plan safe paths and perform their tasks. However, different environments will inevitably introduce distribution shift, and even the best computer vision models will make incorrect predictions at times. The previous chapters have explored the question of how to reduce the impact of distribution shift from training to testing. Now, we will explore the important question of how to account for model failures when they arise during deployment, and how a system can still leverage uncertain model predictions. In this chapter, we propose a pipeline for planning in unstructured environ- ments. This is a challenging task that relies on perception, scene reconstruction, and reasoning about various uncertainties to be successful. To perceive the envi- ronment, we use a semantic segmentation model trained to segment terrain and obstacles in scenes. Given a view of the environment, the model allows the agent to reason about potentially safe or unsafe paths through predicted terrain. The model is augmented with dropout to compute uncertainties so that the agent can account for potentially incorrect model predictions. With sparse next-best-view measurements, the agent is able to find new measurements that decreases the 88 uncertainty in the predicted semantics of the environment. Overall, our algo- rithmic pipeline consists of: a deep Bayesian neural network which segments surfaces with uncertainty estimates; a flexible point cloud scene representation; a next-best-view planner which minimizes the uncertainty of scene semantics using sparse visual measurements; and a hypothesis-based path planner that proposes multiple kinematically feasible paths with evolving safety confidences given next-best-view measurements. Our pipeline iteratively decreases semantic uncertainty along planned paths, filtering out unsafe paths with high confidence. We show that our framework plans safe paths in real-world environments where existing path planners typically fail. The work in this chapter was published in ICRA 2020 as “DeepSemanticH- PPC: Hypothesis-based Planning over Uncertain Semantic Point Clouds” [55] with Yutao Han, Jacopo Banfi, Kavita Bala, and Mark Campbell. Yutao, Jacopo, and I contributed equally to this work. 5.2 Introduction Path planning for complex outdoor environments is challenging due to the un- structured nature of environments that do not fall neatly into discretized space. Moreover, different terrain surface types can be difficult to detect with traditional sensing modalities. In indoor environments, a grid space representation with lidar sensors is sufficient [152, 69, 83]. Outdoor environments exhibit complex geometries and surface types, which are difficult–if not impossible–to differ- entiate using just lidar data. Therefore, a more flexible scene representation, surface classification using computer vision techniques, and reasoning about 89 scene uncertainties are necessary. Previous work for outdoor planning has focused on classifying terrain and surface roughness using SVM classifiers [165, 104], neural networks [22, 32], and various other computer vision techniques [109, 42]. While these techniques can differentiate between simple terrain types, they do not model the inherent uncertainties and ambiguities in complex scenes which makes it difficult to differentiate between terrain types (e.g., an offroad robot driving through a patch of grass with small rocks). Many current outdoor planning approaches still rely on grid maps which do not model the complex geometry of an outdoor scene (e.g., a field with irregular bumps and rocks) [121, 41]. Recent work models outdoor maps for planning with a point cloud [79], which is more flexible and suitable for unstructured scenes; however, [79] uses traditional lidar sensing which cannot differentiate between different surface types as broadly as camera- based computer vision. In this chapter, we present DeepSemanticHPPC (Deep Semantic Hypothesis- based Planner over Point Clouds), a novel algorithmic pipeline for planning over uncertain semantic point clouds, which leverages a Bayesian neural network (BNN) [45, 74] to extract principled estimates of segmentation uncertainty. This allows our framework to reason about ambiguous terrain as well as robustly handle false positive detections by taking additional measurements to reduce semantic uncertainty in the scene. However, each measurement is costly due to the computationally expensive nature of Bayesian neural networks operating on a robotic platform with limited computing power. Our planner hence employs next-best-view (NBV) techniques [15, 62, 31] to optimize for new measurements. DeepSemanticHPPC includes: 90 Start: Initial Capture 2D View of Select Next Best Position Scene View Po ene s + ten f sc c sa tial ge o m anti ue n fet s i pa Ima ene taint y cer e t c r t s h S ce ai + n ntie Semantic u s segmentation + Scene semantics uncertainty Update 3D Scene + uncertaintyDeep BNN Semantics and Multi-Hypothesis Uncertainty Planner ne BV Sce ry ore Net o m safe geo m N OR budge t ted wi th c Given: 3D Scene deteEnd: Select Safest path ertain ty Point Cloud low un c Path to Navigate Figure 5.1: The DeepSemanticHPPC pipeline. (1) Given an initial view and scene geometry, a multi-hypothesis graph of possible paths is generated. (2) The uncertainty in the scene is iteratively reduced by selecting next-best-views and path costs are updated. (3) This iterative uncertainty-reduction stage is terminated early if a safe path is confirmed or all considered paths are confirmed as unsafe. (4) Finally, a path is selected. • the employment of a deep Bayesian neural network [45, 74] to obtain surface and obstacle semantics with uncertainty estimates for unstructured outdoor environments; • a flexible point cloud scene representation; • a next-best-view planner which minimizes the uncertainty of terrain se- mantics using sparse visual measurements; • a hypothesis-based path planner (extending [79]) that proposes multiple kinematically feasible paths with evolving safety confidences given the NBV measurements. Experimental results with real environments show that our pipeline plans safe paths in real-world environments where existing path planners typically fail. Fig. 5.1 illustrates DeepSemanticHPPC. In the first stage, a multi-hypothesis 91 planner generates multiple hypotheses of possible safe paths given a scene belief. In the second stage, a NBV function calculates NBV poses and associated re- wards. These poses and rewards are input to a NBV selection block which selects the best feasible NBV. A BNN extracts semantic segmentations and associated uncertainties from the NBV measurement, which are used to generate a new scene belief. The new scene belief reduces hypothesis uncertainty, and the second stage is repeated for a set number of iterations. Finally, a safe path (hypothesis) with high confidence is selected; if all paths are classified as unsafe, then no path is selected. The algorithm terminates once a path is confirmed safe or all paths are confirmed unsafe. 5.3 Related Work and Background RRT-Based Non-Holonomic Planning over a Point Cloud We build upon an existing rapidly-exploring random tree (RRT) [82] based planner for finding kinematically feasible trajectories over non-planar point cloud environments [79]. 6D robot poses are expressed by transformation matrices belonging to the Special Euclidean Group SE(3). A matrix TMR specifies the position and orientation of a robot-fixed coordinate frame R expressed in a given reference map frame M. [79] considers the following planning problem: given start and goal poses TMS, TMG, and a point cloudM = {mi}with mi ∈ R3, compute a connecting trajectory π : R>0 → SE(3). The trajectory has to satisfy a number of constraints up to a given degree of approximation: contact with the terrain surface, static traversability (e.g. bounded roll and pitch angles), and kinematic constraints – including bounded continuous curvature. Trajectories are represented as piecewise continuous functions in the 6D space of robot poses, and are specified by a sequence of nodes 92 π̂ = [N k], where each N k is a tuple (TMRk , τ k,wk, κk). Here, TMRk is a 6D pose attached to the terrain surface, τ k ∈ [0, 1] is the associated static traversability value, wk is a parameter vector specifying a short planar trajectory segment connecting TMRk to the next pose in the sequence, and κ k is the curvature at the beginning of the trajectory segment. wk specifies a trajectory segment as a cubic curvature polynomial [117] evolving along the planar patch defined by the xy plane of the coordinate frame R attached to TMRk . The end point of such a trajectory segment gives the subsequent pose TMRk+1 through a projection on the terrain surface via f : (M,TMR) 7→ TMR; f queries M for the K nearest- neighbors of the end point, which can be thought of as the points the robot will lie on at TMRk+1 (K depends on the size of the robot and point cloud density). We use φ(N k+1) to denote such points. Leveraging the above trajectory representation, [79] proposes to define a small set of motion primitives (short trajectory segments) and use them to grow two RRTs, one from the start pose and one from the goal pose, and iteratively try to connect them. Each new pose is associated with a node N k, which is accepted in the tree only if τ k > 0. [79] also proposes a technique to derive a better trajectory (in terms of smoothness and distance) starting from an initial one. This second optimization stage is not explicitly considered in this work, because it is easily generalized and applied to all “safe” trajectories according to our method. Segmentation with Bayesian Neural Networks Although segmentation net- works for 3D point clouds exist [132, 133, 194, 131, 171], 3D data repositories are focused on object recognition and part segmentation (e.g. [20]) or only contain a small number of scenes [53]. In contrast, existing large-scale image segmen- tation datasets [76, 17, 192, 11] contain varied surfaces and obstacles in diverse 93 real-world outdoor scenes. Therefore, we leverage a state-of-the-art image seg- mentation network [24] (Section 5.5.1), and update the point cloud environment from per-pixel image segmentations (Section 5.5.2). Furthermore, we augment the network to estimate output uncertainty, allowing uncertainty in surface pre- dictions to be embedded in the point cloud. Uncertainty in surface type is used to guide path-safety evaluation. At inference time, forward passes with active dropout layers can be inter- preted as an approximation of the posterior distribution of model weights of a neural network [45, 74]. The uncertainty of predictions are computed by taking the sample standard deviation across multiple forward passes. For each pixel (i, j)X of image X , the mean softmax vector over T forward passes is: T 1 ∑ p(i,j)X (i,j) = s Xt (y|X) (5.1)T t=1 where s(i,j)X (y|X) ∈ RC is the softmax output of the network. The corresponding uncertainty vector on p(i,j)X is: √ √√√√∑T ( (i,j) )X (i,j) 2s Xt (y|X)− p σ(i,j)X = t=1 (5.2) T − 1 In our framework, image X corresponds to a view with known camera parameters. p(i,j)X and σ(i,j)X are combined with existing measurements for each point m ∈M that maps to pixel (i, j)X (section 5.5.2). 94 (a) (b) (c) Figure 5.2: An example point cloud. (a) Image view of a portion of the environ- ment. (b) Point cloud colored with the most likely class predicted from image (a) (bright green: “grass”; dark green: “tree”; purple: “sidewalk”; dark grey: “road”; light grey: no information available). All the classes except “tree” belong to the set S. The region around the tree is actually mulch/woodchips, which should be classified as “dirt” (belonging to U ). (c) Point cloud colored to show safe (white), unsafe (black), and unclear regionsR (random colors). 5.4 Approach Overview The planner presented in Section 5.3 does not leverage critical semantic infor- mation about terrain types. In this work, we assume an initial point cloud is given in the formM = {ei}, where each element e is a tuple (mi,pii ,σi). Here, mi ∈ R3 as before; pi = (pi1, pi2, . . . , piC) is a vector specifying the probabilities that the point belongs to each one of the C possible semantic classes (gravel, water, etc.); σi = (σi1, σi2, . . . , σiC) is a vector specifiying the uncertainties of p i, as discussed in Section 5.3. Points are initialized with uniform semantic probabili- ties and maximum uncertainties. Updating the semantic point cloud is discussed in Section 5.5.2. Fig. 5.2(b) shows a pointcloud obtained in a real (ambiguous) environment, where each point mi is associated with the color corresponding to the class label j whose pij is maximum. We assume the semantic classes have been partitioned into two sets: the safe set S (e.g. gravel, grass) and the unsa∑fe set U (e.g. water, sno∑w). For each point mi, the points are defined as pi i iS = j∈S pj , pU = 1 − piS = j∈U pij , and 95 √∑ √∑ σi = min( i2 i2j∈S σj , j∈U σj ). Each point is then classified as: • safe if piS − wσσi ≥ θs; • unsafe if piU − w σiσ ≥ θu; • unclear otherwise. Intuitively, this implies that points are safe/unsafe given high probability (piS , piU ) and low uncertainty (σ i), and unclear otherwise. wσ, θs, θu are defined by the mission planner (with 1− θs < θu). We useMsafe,Munsafe,Munclear to denote the partition ofM obtained from the above classification. Note that a point is labeled safe (unsafe) even with large uncertainty on pi (piS U ), provided there is small uncertainty on piU (p i S). This is captured by the min in the definition of σ i. For example, the network may be uncertain between gravel and grass (both safe), but it is sure that the point is neither water nor snow (both unsafe). Consider now a trajectory π̂ = [N k], and recall that φ(N k) denotes the set of points on which the robot lies when at pose TMRk . Depending on the semantic information initially available, it might be very difficult –if not impossible– to immediately find a trajectory whose node points φ(N k) all belong to Msafe. DeepSemanticHPPC works in two stages: 1) Compute a set of candidate paths traversing different unclear regions, and 2) Reduce the uncertainty of such paths by taking new views in the proximity of the robot’s starting position of the most promising path. We relax the path planning problem in a natural way – instead of reaching a specific goal pose, we require the robot to reach a goal pose region G defined 96 around TMG. Then, the points inMunclear are organized into a set R of unclear regions. To build the set R, we use the following two-stage clustering process: first, DBSCAN [38] performs a large-scale clustering of the points inMunclear, obtaining a set of large unclear regions R̂. Then, the points of each r̂ ∈ R̂ are further partitioned according to their most likely class (treating the points not associated with any prediction as belonging to a special class); DBSCAN is called again on each partition. Fig. 5.2(c) shows the result of this process on our example with θs = 0.9, θu = 0.3, wσ = 3. The remaining task is to compute a set of candidate paths from TMS to G. Section 5.5.3 presents a variant of a standard RRT algorithm which constructs multiple hypothesis for the safest path, traversing different unclear regions. These paths are stored in the directed graph G = (V,A), where each v ∈ V is associated with a potential trajectory node N v and each a ∈ A represents the existence of a short trajectory segment connecting two poses. The cost for each n∑ode is based on how far it is from satisfying our safety constraint: p̄v = 1 | N v | i∈φ(N v) min(1,max(0, θ i i s− pS +wσσ )) for for each v ∈ V . However, if theφ( ) node region sufficiently intersects with an unsafe region, the cost is infinite; and if the node lies entirely in a safe region, the cost is zero. Summarizing, c(v) is defined as:  0 if |Msafe ∩ φ(N v)| = |φ(N v)| c(v) = ∞ if |M ∩ φ(N v unsafe )| ≥ φ (5.3) vp̄v otherwise, where φv is a user-defined threshold. The first condition can be relaxed for the 97 nodes in close proximity of the starting pose. Although a vertex with infinite cost is never obtained when the candidate paths are initially computed, its cost might tend to infinity when additional views are taken during the NBV stage (Section 5.5.4). A predefined number of NBV iterations are run. At each iteration, the m most promising (shortest) paths according to the above cost function are computed by a k-shortest paths algorithm (we use Yen’s [184]). The associated vertices v such that 0 < c(v) < ∞ are then considered in the function that computes the best additional view. The value of m is also decided by the mission planner: m = 1 corresponds to an aggressive setting, while m > 1 is preferred given a large temporal budget for taking additional views. 5.5 Technical Details 5.5.1 Predicting Semantic Labels We curate a segmentation dataset for outdoor navigation in unstructured envi- ronments from the existing large-scale COCO panoptic dataset [76]. First, images of unstructured outdoor scenes are selected using a Places365 [191] classifier. An image is kept if (a) the classifier’s top one (highest) prediction is an unstruc- tured outdoor category with > 50% probability, or (b) two or more of the top five predictions are unstructured outdoor categories. Next, the 133 categories in COCO panoptic are merged: (a) all outdoor terrains (e.g. grass, dirt, snow, pavement) are retained; (b) obstacles are merged into four categories: fixed ob- stacles (e.g. buildings), moving human-made obstacles (e.g. vehicles), humans, and animals; (c) all indoor categories are removed. Our final dataset consists 98 of 34K training images with 22 categories. Our navigation segmentation cate- gories (from COCO panoptic), and filtered list of COCO panoptic images are at: https://deepsemantichppc.github.io For our network architecture, we use DeepLabv3+[24] with Xception65 [25] backbone augmented with dropout in the middle and exit flow blocks for se- mantic segmentations. At inference time, 50 forward passes are used to predict semantics and uncertainties (Eqs. (5.1)-(5.2)). 5.5.2 Associating Semantic Labels to a Point Cloud Given a viewpoint with known pose and camera intrinsics, image segmentation probabilities and uncertainties are mapped to the point cloud. To map pixel (i, j)X : 1. Estimate depth map DX for view X. 2. Each pixel is backprojected to a single point corresponding to the center of the projected pixel. This approximation does not hold for pixels with very large depths, but typically produces good results. Backprojected pixel point m̃(i,j)X is computed as: m̃(i,j)X (i,j) = PX [DX I3×3|[0 0 1] T ]TK−1X [i j 1] T (5.4) where KX , PX are intrinsic and pose matrices. 3. Each backprojected point is merged with its nearest neighbor mnn in the point cloudMwithin threshold distance R. If a backprojected point does not have a neighbor within the threshold, the point is discarded. 99 4. p(i,j)X and σ(i,j)X are merged with existing measurements for point mnn. The combined measurement is the best linear unbiased estimator under the simplifying assumption that the per-class predictions are independent. Given a set of K measurements {(p(k),σ(k))}, the combined measurement (p̃, σ̃) for class c is: ( ∑ √√K √√∑ )K1p̃ = w(k)p(k)c c , σ̃c = (w(k))2(σ(k))2Z k=1 k=1 ∑ (k)(σ )−2 ∑where w(k) cc = , Z s.t. p̃c′ = 1 (5.5)(k) (σ )−2c′ c′∈C c′∈C 5.5.3 RRT-Based Multi-hypothesis Planner The algorithm to compute G = (V,A) starts by building an initial RRT from a root vertex vs associated with the projected start pose TMS via a predefined set of motion primitives. Instead of building two RRTs as is done in [79], we build a single RRT from the start pose and bias the sampling to the point that is closest to the projected ideal goal pose. Sampling is performed on the points inM not lying in the forbidden points set F , which is initialized as F ←Munsafe. Once the first path π̂ = [N v] is found, the algorithm examines which regions in R are traversed by the vertices of π̂ by checking their intersections with each set of points φ(N v). Except for regions containing points belonging to the K-nearest neighbors of TMS and TMG, all regions traversed by π̂ are placed into the removal candidates set C. A heuristic is then used to decide which region(s) should be removed from subsequent planning stages. In this work, we simply remove the largest region ĉ, 100 and F is updated as F ← F ∪ ĉ. When ĉ is removed, the RRT vertices lying on it, including those of π̂, are removed from the RRT. The algorithm then proceeds to optimize and find a new path to the goal, by continuously expanding the RRT component containing vs. This time, however, the algorithm also tries to connect (by computing ad hoc trajectory segments) new vertices to those that were disconnected from the RRT due to the removal of ĉ. When a new path is found, the process repeats. If at any iteration, the algorithm finds a path only contained in the regions of TMS and TMG, it can be restarted with a different random seed. Graphs produced by different random seeds are merged at the end. Fig. 5.3 shows an example multi-hypothesis graph computed on the example of Section 5.4 (Fig. 5.2). If any path computed on the multi-hypothesis graph G = (V,A) has zero cost, the robot starts to follow that path since all the underlying points lie inMsafe. Otherwise, the robot enters the NBV stage described below. Figure 5.3: An example graph G = (V,A), with poses. Vertices with c(v) = 0 are in green, while vertices with 0 < c(v) ≤ 1 are in red (the darker, the closer to 1). The blue, green, and red axes indicate the pose of the robot. 101 5.5.4 Next-Best-View (NBV) Planning A set of n viable NBV poses Tviable = {TMV1 , . . . ,TMVn} is computed by growing a RRT starting from TMS. The candidate poses Tviable are a subsample of the RRT vertices. All points lying withinMunclear are treated as unsafe, and sampling is performed on the safe points lying within a given radius r from TMS. The robot should not travel too far to take a new view; otherwise it is more appropriate to follow the most promising path of G = (V,A). A reward J(TMV ) is calculated for each candidate pose TMV . The NBV posej j TNBV is selected by picking the pose with the highest value of J that also allows a safe path back to the current pose from which all the paths can be followed. J(TMV ) = βdD + βγγ + βvisNvis + βQQ, (5.6)j where D is a distance metric, γ is a viewing angle metric, Nvis is the number of visible vertices from TMV , and Q is the average information gain over visiblej vertices from TMV . The weight βd, βγ, βvis, βQ sum to 1 and D, γ, Nj vis, and Q are normalized. Distance and change in viewing angle are used based on the assumption that closeness and view diversity will reduce vertex uncertainty. Nvis puts more weight on candidate poses which have higher chances of reducing the uncertainty of multiple segments of the multipath graph G = (V,A). Q represents the expected reduction in uncertainty in the graph vertices given TMV . Each of these components are defined as follows.j Begin by defining the set of vertices v ∈ VNBV, where VNBV is the set of unclear vertices belonging to the m most promising paths (see the end of Section 5.4). For each candidate pose TMV ∈ Tviable, only vertices visible from TMV arej j 102 considered in calculating the reward. Visible vertices are defined as vertices which occupy greater than a predefined number of pixels in the image plane rendering of the point cloud from TMV . Visible vertices are added to the setj v ∈ Vvis,j , where Vvis,j ∈ VNBV. To calculate D, the distances from TMV to v ∈ Vj vis,j are normalized. Since a lower distance should correspond to a higher reward, we subtract the normalized distances from 1. To calculate γ: the set of negative cosine distances of the angle between TMS and TMV to v ∈ Vj vis,j are used. Nvis is the size of Vvis,j . The information gain metric Q(v) is calculated for each v ∈ VNBV. Q(v) represents the expected reduction in uncertainty for each v ∈ VNBV, and is a function of the visibility and uncertainty of the points lying in φ(N v). The visibility I(v,TMV ) of a vertex v ∈ VNBV, given a candidate pose Tj MV ,j is the pixel coverage of φ(N v) in the rendered image plane of TMV . Per-pointj bounding squares of size equal to half the point cloud resolution are used to compute occlusions and pixel coverage for surface points. The number of pixels that φ(N v) occupies in the rendering is the predicted visibility I(v,TMV ) of v atj TMV .j The uncertainty σ(v) of a vertex v ∈ VNBV is the average sum of the uncertain- ties of the points in φ(N v): C 1 ∑ ∑ σ(v) = σj |φ(N v)| i (5.7) i∈φ(N v) j=1 The information gain metric Q(v) of vertex v ∈ VNBV can be formally written as 103 Q(v) = αiI(v, TMV ) + α σ(v), (5.8)j σ where αI , ασ are weights that sum to 1 and I(v,TMV ) and σ(v) are normalized.j 5.6 Validation 5.6.1 Validation Scenes and Overview (a) (b) Figure 5.4: (a) Image from Cass Park in Ithaca. There are multiple different terrains in the scene including grass, mud, and water. The point cloud is anno- tated with safe (blue) and unsafe (red) regions. (b) Image from Mann Library in Cornell with similarly annotated safe/unsafe regions. For real world validation, we collect data of two different scenes using the ZED stereo camera from Stereolabs. One scene is at Cass Park in Ithaca (Fig. 5.4(a)) and the other scene is next to the Mann Library in Cornell University (Fig. 5.4(b)) and These scenes are selected due to their varying (but common) terrain types. The scenes are representative of common unstructured outdoor environments. The ZED camera API is used to extract depth maps and generate point cloud reconstructions of the scenes. For these scenes, we heuristically select a set of candidate NBV poses instead of growing a RRT from TMS, and assume a path between these poses and the start 104 pose exists. This is because we did not implement the algorithm onboard a robot for real-time NBV selection, and we could not feasibly sample all possible NBVs from the test scenes prior to our experiments. Candidate NBV poses are selected to be (a) near the start pose and (b) oriented in the general direction of the goal while covering a wide range of the scene. Our method (and baseline methods) choose NBVs from the set of candidate NBV poses. We did also implement our full pipeline in the AirSim simulator [143] and confirmed that the RRT-based NBV selection module works as intended. Nevertheless, our path uncertainty and safety evaluations will be focused on the real world scenes as the simulated scenes are far more simplistic than the real world. 5.6.2 Multipath Planner Evaluation The multipath planner is evaluated on the simulated Africa dataset from Airsim. To evaluate the ability of the multipath planner to plan a diverse set of paths, we compare the number of paths planned to the number of iterations the RRT in the multipath planner runs for. The number of iterations is 5000 per trial. 300 trials were run with uniformly sampled starting and goal poses within a ten meter radius of a predefined start and goal anchor locations. Due to the exclusion of the forbidden regions F after every new path is found, Table 5.1 demonstrates the ability of the multipath planner to plan a spatially diverse set of paths. A boxplot illustrating the median, quartiles, and outliers is shown in 5.5. RRT Iterations 1000 2000 3000 4000 5000 # Paths (Mean) 0.65 1.42 2.39 3.82 5.14 # Paths (Std) 0.77 1.49 2.02 2.46 2.97 Table 5.1: Mean and standard deviation of the number of paths found over 300 trials. 105 Figure 5.5: Boxplots of the number of paths found at different number of itera- tions of the RRT over 300 trials are shown. The outliers are shown as black circles. 5.6.3 NBV evaluation To evaluate the performance of the NBV function, we examine the change in uncertainty of the path vertices as the number of NBVs increases. The complete NBV reward function (Eq. 5.6) is compared against: (a) random selection, (b) geometry-only reward, and (c) uncertainty-only reward. For the geometry-only NBV reward, we set {βQ} to zero, and for the uncertainty-only NBV reward, we set {βd, βγ, βvis} to zero. The change in uncertainties summed across all the classes and points for each path vertex averaged over 500 trials is shown in Fig. 5.6. In our experiments, the full NBV reward weights are set as follows: 106 {βd = 0.4, βγ = 0.05, βvis = 0.25, βQ = 0.3, αI = 0.5, ασ = 0.5} for the Mann Library scene, and {βd = 0.15, βγ = 0.05, βvis = 0.2, βQ = 0.6, αI = 0.3, ασ = 0.7} for the Cass Park scene. A higher weight is assigned to uncertainty terms for the Cass scene because the boundaries between surface types (e.g. water, mud, grass) are more ambiguous than in the Mann scene. For both scenes, the full NBV reward function consistently achieves the lowest uncertainty with 2 or more NBVs (Fig. 5.6). This illustrates the importance of both geometry and uncertainty terms in the reward function. In the Mann scene, the baseline reward functions converge to a higher uncertainty than the full reward function. Due to the small size of the scene and the nature of all candidate NBVs being oriented towards the goal pose, random selection performs quite well. It initially outperforms uncertainty-only which does not take into account point visibility or viewpoint diversity, while random sampling implicitly selects diverse views. In the Cass scene, the baseline reward functions converge to a higher uncertainty than the full reward function, except for uncertainty-only which converges to the same point as the full reward function. The overall uncertainty in the BNN predictions are much higher at Cass park, so heavily weighting the uncertainty in the reward function performs well. In the Mann scene, geometry-only outperforms uncertainty-only, whereas the opposite result holds for the Cass scene. Mann has less inherent ambiguity so geometry terms are more important, while Cass has more ambiguous regions so uncertainty terms are more important. 107 Figure 5.6: Change in uncertainty of path vertices (y-axis) as the number of NBV measurements increase (x-axis). 5.6.4 Path Safety Evaluation To evaluate the real world application of DeepSemanticHPPC, we study the safety of selected paths. We annotate a point cloud (Fig. 5.4 (a)/(b)right) with MeshLab [26] into ground truth safe and unsafe regions. For Mann library mulch (dirt) is labeled as unsafe, and for Cass Park mud (dirt) and water are labeled as unsafe. Any path which contains a vertex that overlaps with the unsafe region with over Nunsafe points is unsafe. We set Nunsafe = 4. Two baselines are considered: (a) B1: planning without semantic information (based on [79]) and (b) B2: planning with semantic information from a single initial view without taking any NBV measurements to reduce path uncertainty. We also study the performance of DeepSemanticHPPC as the number of NBVs increase. Table 5.2 and Figure 5.7 shows the path safety results for the Mann Library and Cass Park scenes over 500 trials. Since both safe and unsafe terrain surfaces can be geometrically similar, baseline B1 cannot reliably avoid unsafe semantic regions. Baseline B2 performs significantly better than B1 because of the inclu- sion of semantic surface types in the planner. However, because the semantic segmentations can be incorrect, especially in regions with high uncertainty, this 108 Mann B1 B2 1N 2N 3N 4N 5N Safe % 0 23.4 81.6 79.8 81.4 83.6 86.0 Unsafe % 100 76.6 18.4 15.4 11.2 6.6 3.8 CS % N/A N/A 0 0 4.0 18.0 28.0 CN % N/A N/A 0 4.8 7.4 9.8 10.2 Cass B1 B2 1N 2N 3N 4N 5N Safe % 13.2 33.4 54.0 57.4 57.4 59.0 59.2 Unsafe % 86.8 66.6 46.0 41.6 39.8 37.4 36.6 CS % N/A N/A 0 0 0 0 0 CN % N/A N/A 0 1.0 2.8 3.6 4.2 Table 5.2: 500 trials of path safety evaluation. The columns are the path planning methods used: B1 is the planner based on [79], B2 includes semantic reasoning without any next-best-views (NBVs), and XN is DeepSemanticHPPC (ours) with X NBVs. The rows are the metrics: Safe is the number of trials where the final selected path is safe, Unsafe is the number of trials where the final selected path is unsafe (lower is better), CS is the number of trials where a safe path is confirmed with sufficiently high confidence prior to selection, CN is the number of trials where all multipaths are confirmed as unsafe (so no paths are selected). Mann Library Cass Park Figure 5.7: 500 trials of path safety evaluation. B1 is the planner based on [79], B2 includes semantic reasoning without any next-best-views (NBVs), and XN is DeepSemanticHPPC (ours) with X NBVs. Green represents trials where a safe path is selected; Red represents trials where an unsafe path is selected; and Blue represents trials where all paths are determined to be unsafe (so no paths are selected). More green and blue is better. planner still plans over unsafe terrain. Our DeepSemanticHPPC framework is significantly better than the two base- 109 lines. NBVs allow the planner to discard unsafe paths as semantic uncertainties decrease. With just one NBV, the percentage of safe paths taken increases drasti- cally from 23.4% to 81.6% (Mann) and increases from 33.4% to 54.0% (Cass). As NBVs increase, the percentage of safe paths selected generally increases while the percentage of unsafe paths selected decreases. With 5 NBVs, 86% (59.2%) of paths selected for Mann (Cass) are safe, and only 3.8% (36.6%) of paths selected for Mann (Cass) are unsafe. With 5 NBVs, uncertainty is sufficiently reduced so that in 10.2% (4.2%) of trials, all multipaths are confirmed to be unsafe for Mann (Cass) and no path is selected. The complexity of the Cass scene is reflected in these results. 5.7 Discussion and Future Work In this chapter, we presented DeepSemanticHPPC, a novel framework for plan- ning in unstructured outdoor environments while accounting for uncertain terrain types. Our framework plans multiple feasible paths to a goal location and attempts to select a safely navigable path. For unstructured outdoor environ- ments, existing planners have focused on assessing navigability via geometric constraints such as terrain steepness. Yet, some terrain types, such as mud or wa- ter, may be unsafe for traversal for robots and cannot be easily detected through environment geometry alone. In this work, we utilized a semantic segmentation model to perceive terrain and obstacles in the environment. However, the per- ception model may fail to accurately parse regions of the scene due to occlusions, unsuitable lighting, or unfamiliar features different from what was encountered during training. Therefore, we proposed to account for model uncertainties during planning, as highly uncertain regions may be correlated with incorrect 110 semantic predictions. To refine ambiguities, the framework focuses on selecting new views of the world to measure that reduce uncertainties along potential paths to the goal. Our real world experimental results showed that DeepSe- manticHPPC reduces semantic uncertainty in planned paths and increases the safety of paths planned in environments with unsafe terrains. In particular, this framework greatly outperforms planners that focus only on terrain geome- try, or planners that do not attempt to take intelligent measurements to reduce perception uncertainty along planned paths. In the future, it would be interesting to implement DeepSemanticHPPC on a robot for real-time navigation experiments. In this work, we have assumed a point cloud map of the environment already exists to separate reasoning about semantics from noisy geometry. However, real-time exploration of an unknown environment requires online mapping and reconstruction as well. It would be interesting to explore the ability to build the point cloud online and incorporate scene geometry uncertainties into this pipeline. Notes and Acknowledgements. We thank Eric Wu and Hadi AlZayer for in- sightful discussions and assistance with preliminary software implementation. 111 CHAPTER 6 CONCLUSION For both scientists working at the cutting edge of research and consumers en- joying everyday technology, computer vision is playing an increasingly larger role in modern society. Given the diverse set of environments in which systems that rely on visual perception are deployed, it is pertinent to design systems that are able to perform well in different settings. In this dissertation, we explored several problems related to building such robust visual perception systems for real-world deployment. In the first part of the dissertation, we explored efficient data annotation as a means of creating large-scale datasets that better represent the image distribu- tions found in the real world. A salient challenge for robust deep learning in the real-world is in overcoming distribution shifts between the training data and the deployment environment. Large-scale data annotation is time-consuming and expensive, with high quality labels typically produced by skilled expert workers. We designed a method that is friendly to lower-skilled crowdworkers, while also producing annotations that can effectively supervise computer vision models. In the next two parts of the dissertation, we explored paintings as an alter- native image distribution from natural photographs for deep learning systems to be applied to and trained on. Paintings are created by humans for humans – as such, they represent an interesting collection of images to be analyzed, con- taining insights into both human culture and visual perception. We further explored the role of paintings as a source of data for learning better models intended for deployment in real world settings. As paintings are semantically meaningful without necessarily being photorealistic, they implicitly represent 112 semantics-preserving image transformations that are applied to some set of nat- ural photographs. Learning from images altered by these transformations can encourage computer vision models to be invariant to such transformations. We studied the robustness of models trained on paintings and images transformed by style transfer (which are handcrafted or learned photo-to-painting transfor- mations); we showed that real paintings induce different invariances in visual perception models than style transfer. In the last part of the dissertation, we explored how noisy visual perception can be used as a building block in a larger system to guide an autonomous agent for path planning and safe navigation. Reasoning about model uncertainty and capturing new measurements of the environment to reduce uncertainty allows our proposed framework to plan safe paths in environments with initially unknown terrain types, despite potentially incorrect model predictions. The problem of building robust perception systems for real-world environ- ments is extremely challenging, and there are many fruitful directions to explore. As the amount of visual data in the world grows everyday, it is important to continue exploring methods to better take advantage of these images and videos. Which data samples can bring our training data distribution closer to the test dis- tribution? Active learning and rare example mining can be used to find the most meaningful data samples to label. Better semi-automated annotation tools can ease the workload of workers who are tasked with high quality data annotation. Learning directly from unlabeled data without any human annotations through self-supervisory signals found in images or videos is a promising area of research; combining insights from unsupervised learning with supervised learning can lead to better semi-supervised frameworks. It is also useful to consider a variety 113 of image distributions during training, and to consider benchmarking models against a wider set of test distributions when designing new computer vision models. Domain generalization research has focused on both building algorithms that can learn from multiple distributions, as well as constructing meaningful benchmarks that better capture the wide variety of real world environments that perception models can be applied in. Robust features can be learned through various data augmentations or new model architectures with different inductive biases. Finally, I believe that continual learning, online characterizations of model uncertainties, and anomaly detection are important research directions to follow – models will inevitably face new, uncertain situations, and it is valuable for them to have the ability to adapt to new information and to provide accurate feedback about their confidence in new situations. 114 APPENDIX A APPENDIX: EFFICIENT IMAGE ANNOTATION FOR SEMANTIC SEGMENTATION This is an appendix for Chapter 2. In Sections A.1 and A.2, we detail the parameters used in our experiments to reproduce our results. In Section A.3, we include additional visualizations of crowdsourced annotations and inpainted labels. A.1 Deeplabv3+ and Mobilenetv2 In our experiments, we use the open-source implementation of Deeplabv3+ found at https://github.com/tensorflow/models/tree/master/research/ deeplab. For our comparison against weakly supervised methods, we use the open-source implementation of Mobilenetv2 found at https://github.com/ tensorflow/models/tree/master/research/slim/nets/mobilenet. A.1.1 Architecture For Deeplabv3+, we use Xception65 [25] as the backbone. ASPP atrous rates are set to 6,12,18 for output stride 16 and 12,24,36 for output stride 8 as in [24]. Training uses output stride 16 while evaluation uses output stride 8. Mobilenetv2 does not use ASPP or decoder. 115 A.1.2 Training Procedure Training hyperparameters are based on [24]. All hyperparameters use settings in [24] unless otherwise noted here. Batch normalization parameters are not finetuned due to low batch size (GPU memory constraints). Batch normaliza- tion parameters are initialized and frozen with the pretrained checkpoint. All Deeplabv3+ experiments initialize the network with the official pretrained model on ImageNet + MSCOCO + Pascal VOC and are trained for 100K iterations. Ex- periments with Mobilenetv2 initialize the network with the official Mobilenetv2 (width 1.0, input resolution 224) ImageNet model. We use polynomial learning rate decay as in [24]. During training time, we ignore the loss for pixels whose predicted class has a softmax probability greater than 0.95. This can be consid- ered a form of hard negative mining where samples (pixels) with high confidence are ignored. Our preliminary experiments suggest that this appears to improve rate of convergence. In any experiments that use fewer than 100% of the images in the training dataset, the images are shuffled and the first N images are selected. The shuffling order is determined once and fixed across experiments. Cityscapes. Deeplabv3+. Learning rate 0.0005. Batch size 2 with input crop size 769× 769. ADE20K. Deeplabv3+. Learning rate 0.001. Batch size 4 with input crop size 513× 513. Pascal VOC. Mobilenetv2. Learning rate 0.001. Batch size 16 with input crop size 513× 513. Batch norm parameters are finetuned. 116 A.1.3 Evaluation Procedure After training for a fixed 100K iterations (no early stopping), models are tested on the validation set. 100K iterations is chosen based on settings in [24] which uses 90K iterations for Cityscapes. For consistency, 100K iterations is also used for ADE20K and Pascal VOC. Evaluation is performed at single scale without flipping on full resolution images. For completeness, we report here the valida- tion mIOU when the network is trained out-of-the-box with the full training data using the outlined procedure. Cityscapes. Validation images are padded to 1025× 2049. Validation mIOU is 77.7%. ADE20K. Validation images are padded to 513 × 513. Validation mIOU is 37.39%. Pascal VOC. Validation images are padded to 513× 513. Validation mIOU is 69.6%. A.2 Block-Inpainting Model The block-inpainting model is based on Deeplabv3+ with some architectural and training modifications. 117 A.2.1 Architectural Modifications The input to the block-inpainting model is a tensor of shape h × w × (3 + K) where K is the number of classes in the dataset (see main text). The weights of the first layer must be expanded to accommodate the K additional channels. These weights are initialized from a unit normal distribution. For each dataset, the weights are initialized once and then fixed for every experiment. To compute uncertainty, dropout is added to the middle and exit flow blocks of the Xception65 backbone. The dropout keep probability is 0.8 as in [86]. No dropout is added to the decoder as preliminary experiments suggest that this will degrade performance. A.2.2 Training Details During training time, the block-inpainting model is trained with randomly drawn block hints from the set of block-annotated images (see main text). The block-inpainting model is trained for 100K iterations following section A.1. The entire set of block-annotated images are used as training targets. A.2.3 Inference Details At inference time, the block-inpainting model keeps dropout activated to estimate uncertainty [45]. The block-inpainting model outputs are averaged over 100 forward passes (100 is found to be sufficient in [44]) to form the final prediction. The uncertainty is computed by taking the sample variance of the softmax 118 probabilities for the predicted class (see main text). A.3 Additional Visualizations We show samples of crowdsourced annotations and block-inpainted labels. A.3.1 Crowdsourced Annotations (SUNCG/CGIntrinsics) Figure A.1 shows a sample of five annotated images. This figure is best viewed in color on screen with high zoom. See main text for annotation details. For each image, segments from block annotation and segments from full annotation are shown. Synthetic labels are assigned using majority ground truth voting. Note that assigning synthetic labels in this way will cause detailed crowdsourced segmentations to be lost. See main text for estimate of cost to assign labels. For regions without segments, “void” label is assigned (color is black). For comparison, the dataset ground truth is shown in the final row. With block annotation, notice that workers segment small regions (e.g. the stool in image 1, row 2; the chair back in image 2, row 2; and the faucets in image 4, row 2) and oversegment regions (e.g. the cushions on the couch in image 3, row 1). With full annotation, notice that workers miss large regions, perhaps due to fatigue (e.g. the right window in image 3, row 4). 119 A.3.2 Crowdsourced Annotations (Cityscapes) Figure A.2 shows a sample of three block-annotated images. This figure is best viewed in color on screen with high zoom. Regions that are not block-annotated are masked out in this visualization. At annotation time, the worker sees the entire image for context (see main text). See main text for annotation details. For easier comparison against Cityscapes, crowdsourced segments are col- ored. Colors are random because class labels have not been assigned to crowd- sourced segments (see main text for estimate of cost to assign labels). Crowd- sourced segments with synthetic class labels are also included. Synthetic labels are assigned by taking the majority class label for the expert-labelled pixels in the crowdsourced segment. Note that assigning synthetic labels in this way will cause detailed crowdsourced segmentations to be lost. These visualizations are included to provide a better sense of the crowdsourced segments across the entire image rather than within individual blocks. Cityscapes segments are colored by class with “void” labels masked out. In the main text, we compare the number of segments to expert full-image annotation. To compare the number of segments, we compute the number of label connected-components in expert annotations. Blocks with more than 50% void expert labels are ignored for a fair comparison. A.3.3 Block-Inpainted Labels We show set of samples of automatically assigned labels by the block-inpainting model (trained and tested with Block-50%) in figures A.3, A.4. For compari- 120 Figure A.1: SUNCG/CGIntrinsics Annotation Samples. Top to bottom: (Row 1) Crowdsourced blocks (boundaries). (Row 2) Crowdsourced blocks (synthetic labels). (Row 3) Crowdsourced full (boundaries). (Row 4) Crowdsourced full (synthetic labels). (Row 5) Ground truth. NOTE: Synthetic labels are the majority ground truth label for pixels in each segment. This means finely segmented crowdsourced segments (such as cushions on couches) will be lost in visualiza- tion. White dotted boxes highlight examples where block annotation qualitatively outperforms full annotation. son, we show the human expert labels and the agreement between the block- inpainting model labels and the human labels (agreement is in white). With uncertainty threshold 0.2 on Cityscapes and 0.4 on ADE20K, over 94% of the pixels in the images are labelled by the block-inpainting model. 121 Figure A.2: Cityscapes Annotation Samples. Top to bottom: (Row 1) Crowd- sourced (boundaries). (Row 2) Crowdsourced (randomly colored). (Row 3) Crowdsourced (synthetic labels). (Row 4) Expert Cityscapes. NOTE: Synthetic labels are the majority expert label for pixels in each segment. This means finely segmented crowdsourced segments (such as sky between leaves) will be lost in visualization. 122 Figure A.3: Block-Inpainting Cityscapes Samples. Top to bottom: (Row 1) Orig- inal image. (Row 2) Human labels. (Row 3) Inpainted labels (all). (Row 4) Agreement (row 3 vs row 2). (Row 5) Inpainted labels (<20% relative uncer- tainty). (Row 6) Agreement (row 5 vs row 2). Void labels and rejected inpainted labels are masked out. 123 Figure A.4: Block-Inpainting ADE20K Samples. Top to bottom: (Row 1) Original image. (Row 2) Human labels. (Row 3) Inpainted labels (all). (Row 4) Agreement (row 3 vs row 2). (Row 5) Inpainted labels (<40% relative uncertainty). (Row 6) Agreement (row 5 vs row 2). Void labels and rejected inpainted labels are masked out. 124 APPENDIX B APPENDIX: LEARNING ROBUST NATURAL IMAGE RECOGNITION FROM PAINTINGS This is an appendix for Chapter 4. This appendix provides additional details to enable reproducibility, and additional visualizations and results to complement the main findings of the chapter. In Section B.1, we specify the creation of the Materials dataset. In Section B.2, we detail the experimental setup for the classifi- cation robustness experiments. In Section B.3, we describe the implementation and parameters used for style transfer, and we show visualizations of these methods in Section B.4. In Section B.5, we extend the discussion in Section 4.5 of the main text by analyzing the effect of stylization strength versus robustness. In Section B.7, we visualize the power spectra of stylized images and compare them to natural images. Finally, in Section B.8, we frame model robustness as domain generalization and discuss how domain-invariance can affect model robustness. B.1 Materials Dataset Details In the main text, we have briefly described the two primary datasets on which we focused our experiments. PACS [91] is a standard benchmark dataset while Materials is a novel dataset of photographs and paintings that was created by sampling image patches from existing datasets with material annotations. In this section, we give additional information on the creation of Materials. This dataset is released for reproducibility at https://github.com/hubertsgithub/ style_painting_robustness. 125 Natural photographs. We acquired image patches from OpenSurfaces[10], COCO stuff [17], and MINC-2500 [11]. To create image patches for image classifi- cation from segmentation annotations, we constructed bounding boxes around segments, and cropped out these bounding boxes to form image patches. We con- structed square bounding boxes with side length equal to 150% of the minimum side length of tight bounding box around the segment. Non-tight bounding boxes are used since it is important to include some context for the patch. We also sampled from MINC-2500 which already contains annotated image patches that do not require additional processing. Image crops that extend beyond the boundary of the full image are padded to square with ImageNet mean padding, and all final images patches are resized to 224×224. We sampled from Open- Surfaces and MINC first, before sampling from COCO if necessary. We created subsets of data of up to 60K photos, and each subset was created to be as-class- balanced-as-possible. For illustration, we provide per-class counts for two such subsets of data in Table B.1. Natural-10K Count Natural-60K Count Ceramic 1000 Ceramic 3132** Fabric 1000 Fabric 8006 Foliage 1000 Foliage 8006 Glass 1000 Glass 7216** Liquid 1000 Liquid 7174** Metal 1000 Metal 7204** Paper 1000 Paper 3258** Skin 1000 Skin 2276** Stone 1000 Stone 5716** Wood 1000 Wood 8006 Table B.1: Training datasets are sampled to be as class-balanced as possible. ** indicates that all training samples of that category are included in the training set, and no further samples exist. Natural-10K is a subset of Natural-60K. The test set contains 200 samples of each category. 126 Paintings. We sample paintings across the same material categories as above from [162]. We only sample patches that are at least 128×128 pixels in area to avoid very low-resolution annotations. The image patches are padded and resized in the same manner as above, and data is also sampled to be as-class- balanced-as-possible. B.2 Classification Parameters For all classification experiments, we use the following setup. • Network architecture: ResNet18, ImageNet pretrained. • Training hyperparameters: 30 epochs with initial learning rate (LR) 1e-3, LR reduced to 1e-4 at epoch 24. The LR of the classification layer is increased by 10×. • Optimizer: SGD with 0.9 momentum. • Training data augmentation: horizontal flipping, random scaling, color jitter, and ImageNet normalization. • For experiments that train a model on both photos and stylized photos, all photos are stylized exactly once offline and included in the training set as an independent image from the original photo. • Evaluation accuracies are averaged over 3 independent runs for each ex- periment. Our results do not appear sensitive to the choice of training hyperparame- ters. Therefore, we train all networks with this configuration and evaluate the final model after training. Our experiments suggest that this training schedule 127 is sufficient for convergence without overfitting across all datasets we experi- mented with. Increasing training epochs to 100 or more does not improve results. Increasing number of training epochs is required if starting from random initial- ization, but ImageNet pretraining is standard practice so we do not extensively experiment with random initialization. B.3 Style Transfer Parameters For all applications of style transfer used in this work, we use pretrained models from publicly available implementations. The sources are provided here: • AdaIN [68]: https://github.com/bethgelab/stylize-datasets • ETNet [150]: https://github.com/zhijieW94/ETNet • TPFR [155]: https://github.com/nnaisense/conditional-style-transfer • SACL [140]: https://github.com/CompVis/adaptive-style-transfer Our initial experiments showed that applying style transfer at 224×224 res- olution yielded visually poor results (except for AdaIN). Therefore, we apply style transfer at a higher resolution and downsample the final result to 224×224. For AdaIN, ETNet, and SACL, we apply style transfer at 768×768 resolution. For TPFR, we apply style transfer at 512×512 instead of 768×768 due to GPU memory constraints. All other hyperparameters are set to the default settings found in the implementations for each respective method. 128 B.4 Visualizations of Stylized Photos We show examples of images stylized by various style transfer methods on PACS (Fig. B.3) and Materials (Fig. B.4). The visualizations also include examples of intradomain stylization in which images are stylized by photos instead of by paintings. Notice that intradomain stylization yields stylizations that are, in general, visually similar to stylizations with painting style images. Overall, stylizations across all methods are holistically similar to natural paintings. B.5 Style Distance vs Robustness In Section 4.5, we found that arbitrary stylization with style images that share the same semantic content as the content image (“intraclass stylization”) results in lower gains in robustness. Since images with similar semantic content may be more visually similar, this suggests that intraclass stylization will lead to less stylized images, i.e. weaker augmentation. To verify this, we measured style differences via the Gram matrix distance between stylized images and their original counterparts. Table B.2 summarizes differences on PACS. While intraclass stylization does result in smaller differences in style for each method, the Gram matrix distance across methods is not necessarily correlated with gains in robustness. For example, ETNet produces the largest style differences overall, but previous results in Fig. 4.3 show that AdaIN improves robustness more than ETNet on PACS. As such, the strength of stylization alone is not indicative of the downstream robustness learned by models trained on these images. 129 Method Painting Intradomain Intradomain(Intraclass) AdaIN 1.58±0.93 1.28±0.79 1.16±0.85 ETNet 2.33±1.09 2.13±1.04 1.81±1.03 TPFR 1.52±0.90 1.38±0.87 1.27±0.91 Table B.2: Style (Gram Matrix) Distance. Gram matrices computed from Ima- geNet pretrained ResNet18 features on PACS. Mean distance between (image, stylized image) pairs is reported. ↑ distance implies ↑ style difference. ± de- notes standard deviation across 1.5K pairs. B.6 Biases of Stylization Algorithms As found in Section 4.5 of the main text, the choice of style images can affect the robustness gains from various arbitrary stylization methods. Beyond the choice of style images, the biases of style transfer models can have an impact on the final stylizations, and thus the robustness of models trained on these images. Therefore, we explore: • Hypothesis H1A: Stylization biases from different methods affect model robustness differently. To visualize the artifacts produced by style transfer, we stylize an image with itself as a style image in Fig. B.1. With a perfect decoder, this transformation should be the identity transformation. However, imperfections during encoding or decoding produce visible biases in the output image. For example, all of the models induce blurring, ETNet has prominent checkerboard artifacts, and TPFR induces a clear shift in color distribution. The result is that content images will be inevitably “stylized” by at least some fixed extent regardless of the style image with different biases from different stylization models. Table B.3 shows that training models on self-stylized images changes its robustness from training on 130 photos alone. Corruptions like blur and digital generally benefit from stylization biases while noise appears dependent on both the classification task (objects vs materials) and style transfer method. Answer to H1A: The biases encoded in different style transfer algorithms contributes to changes in model robustness, with different effects dependent on both style transfer algorithm and downstream classification task. Method Noise Blur Weather Digital Materials (30K Samples/Domain) Photo-Only 43.71 58.76 55.25 61.20 Photo + AdaIN 39.98 60.75 56.72 61.38 Photo + ETNet 45.61 59.92 56.44 66.96 Photo + TPFR 47.53 64.00 55.07 69.02 PACS (1.5K Samples/Domain) Photo-Only 62.64 72.75 83.24 86.33 Photo + AdaIN 58.69 74.69 85.59 87.92 Photo + ETNet 55.80 71.16 82.41 87.42 Photo + TPFR 60.85 76.56 85.10 87.58 Table B.3: Effect of Stylization Biases. Per-corruption accuracy for models trained on photos plus photos stylized by themselves. Self-stylization reveals stylization biases in arbitrary style transfer models. Notice that the robustness of models differs between different style transfer methods when self-stylization is applied. Figure B.1: Arbitrary Stylization Biases. Left to Right: Original image, Image stylized by itself using AdaIN, ETNet, and TPFR. Ideally, an image stylized by itself should not change. Notice that style transfer introduces artifacts, shifts in color, and other biases. 131 B.7 Power Spectra of Different Image Types In Section 4.7 of the main text, we found that SACL improves robustness against noise with imperceptible high frequency signals in the stylized images. Here we show the power spectra of stylized images and compare them to the spectra for natural photos and natural paintings. The radial power spectrum for an image is computed as: power(r) = ||X ||2r where Xr = √mean ||Xij|| i2+j2∈R(r) Xij are the frequency components given by the 2D discrete Fourier transform. Since (i, j) are discrete, the radial frequency component Xr is computed as an average over ||Xij|| for (i, j) that fall in a bin R(r). In Fig. B.2, we visualize the mean radial power spectra for natural photos, natural paintings, and SACL- stylized photos. We observe that stylized photos contain higher magnitude high-frequency components relative to natural photos and natural paintings. As noted in Section 4.7 of the main text, reducing the magnitude of sufficiently high-frequency components does not affect the perceptual quality of images. B.8 Domain-Invariant Feature Learning Our results from Section 4.6 of the main text provide evidence that models can learn more robust feature representations from the addition of paintings to a dataset of photographs. We can take this further by explicitly enforcing similar (or 132 Figure B.2: Power Spectrum of Images. Left: PACS, Right: Materials. The plots depict the mean power spectrum for different sets of images. Photos stylized by SACL have larger magnitude high frequency components than natural photos or natural paintings. domain-invariant) feature representations across photos and paintings. Domain- invariance is a common approach to the problem of domain generalization, where models are trained on multiple domains with the goal of generalizing to unseen domains, e.g. [51, 112]. In our setting, we can consider images with common corruptions to be the set of unseen domains. Perfect domain-invariant feature extraction can be harmful if it prevents useful features in photos from being extracted due to an underrepresentation of such features in paintings. Since the target task is recognition of photos, losing robust photo-specific signals can be detrimental. Therefore, we explore the following: • Hypothesis H2A: Explicitly learning domain-invariance from paintings and photos may negatively impact model robustness. We use an adversarial domain discriminator to learn domain invariant fea- tures [46, 112]. In Table B.4, we find that explicitly learning domain invariant features from paintings results in lower robustness than unrestricted feature learning with paintings. However, learning domain-invariant features does still 133 improve robustness over the photo-only baseline. Existing work in domain gen- eralization has shown that domain-invariance is an effective method for learning to recognize images from unseen domains, e.g. [112]. Our finding here suggests that in the special case of domain generalization to corrupted versions of natural photographs, it is advantageous to retain photo-specific features for recognition. This is consistent with our hypothesis and discussion above – an underrepresen- tation of any particular photo-specific features in paintings can result in such features being ignored entirely when domain-invariance is enforced, even if such features are useful for robust recognition. Answer to H2A: Explicitly learning domain-invariant features from paintings neg- atively impacts model robustness with respect to unrestricted feature learning with paintings. However, domain-invariant features do still improve robustness relative to photos only. Method MEAN Noise Blur Weather Digital Materials (30K Samples/Domain) Photo-Only 54.73 43.71 58.76 55.25 61.20 Photo + Painting 57.92 49.82 61.03 56.68 64.15 Photo + Painting (DA) 55.99 46.97 59.60 54.51 62.90 PACS (1.5K Samples/Domain) Photo-Only 76.16 62.64 72.75 83.24 86.33 Photo + Painting 78.99 68.04 74.72 86.26 86.92 Photo + Painting (DA) 77.44 68.86 72.59 84.09 84.23 Table B.4: Effect of Domain-Invariant Features. “DA” refers to feature learning with an adversarial domain discriminator loss [46]. Learning domain-invariant features (red) reduces robustness relative to unrestricted feature learning from paintings (blue), but still improves robustness over photo-only. 134 B.9 Additional Architectures We expect our findings to hold across architectures and datasets. As a sanity check, we have extended Table 4.2 with two additional architectures. The results (Table B.5) follow similar trends to those found in Table 4.2. For example, SACL outperforms both AdaIN and Paintings on Noise. 135 Resnet-18 Method Noise Blur Weather Digital Materials (30K Samples/Domain) Photo-Only 43.70±0.65 58.76±0.14 55.25±0.33 61.20±0.69 Photo + AdaIN 47.33±0.22 65.09±0.21 61.78±0.18 61.41±0.16 Photo + SACL 61.87±0.16 64.36±0.20 57.49±0.24 66.55±0.17 Photo + Painting 49.82±0.56 61.03±0.13 56.69±0.10 64.15±0.14 PACS (1.5K Samples/Domain) Photo-Only 62.64±1.48 72.75±0.04 83.24±0.22 86.33±0.14 Photo + AdaIN 70.17±1.70 81.18±0.20 88.37±0.23 89.32±0.19 Photo + SACL 85.98±0.56 84.61±0.15 89.73±0.33 88.74±0.48 Photo + Painting 68.83±0.83 75.80±0.95 86.88±0.66 87.07±0.14 WideResnet-50-2 Method Noise Blur Weather Digital Materials (30K Samples/Domain) Photo + AdaIN 57.80±1.79 73.77±0.11 67.75±0.51 66.96±0.06 Photo + SACL 69.39±0.72 70.00±0.34 64.00±0.54 73.05±0.30 Photo + Painting 60.72±0.83 68.09±0.49 61.15±0.23 70.98±0.24 PACS (1.5K Samples/Domain) Photo + AdaIN 82.05±1.33 86.89±0.64 93.98±0.15 94.39±0.30 Photo + SACL 93.79±1.35 89.64±0.36 95.19±0.17 93.63±0.11 Photo + Painting 83.92±1.81 85.38±0.27 94.19±0.08 92.63±0.24 Densenet-121 Method Noise Blur Weather Digital Materials (30K Samples/Domain) Photo + AdaIN 54.32±0.23 71.08±0.24 67.31±0.37 66.47±0.13 Photo + SACL 67.22±0.16 68.89±0.16 63.08±0.33 71.87±0.62 Photo + Painting 54.83±1.20 68.21±0.38 61.29±0.39 70.66±0.13 PACS (1.5K Samples/Domain) Photo + AdaIN 76.96±4.12 85.79±0.50 94.96±0.13 92.34±0.19 Photo + SACL 91.33±0.28 88.92±0.37 94.18±0.49 94.12±0.54 Photo + Painting 76.65±2.22 83.22±0.19 94.00±0.62 91.72±0.14 Table B.5: Per-Corruption Accuracy (Additional Architectures). Trends across different architectures are generally consistent. For example, SACL (blue) greatly outperforms AdaIN and paintings (red) for noise robustness. ± indicates stan- dard deviation over 3 runs. 136 AdaIN AdaIN (Intradomain) ETNet ETNet (Intradomain) Figure B.3: Stylized Photos (PACS) (1/2). Intradomain refers to stylization with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. (Continued on next page) 137 TPFR TPFR (Intradomain) SACL Figure B.3: Stylized Photos (PACS) (2/2). Intradomain refers to stylization with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. 138 AdaIN AdaIN (Intradomain) ETNet ETNet (Intradomain) Figure B.4: Stylized Photos (Materials) (1/2). Intradomain refers to stylization with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. (Continued on next page) 139 TPFR TPFR (Intradomain) SACL Figure B.4: Stylized Photos (Materials) (2/2). Intradomain refers to stylization with photos as style images instead of paintings as style images. SACL is a learned style transfer method that is applied with different models pretrained to transfer the style of different artists. 140 APPENDIX C APPENDIX: UNCERTAINTY-AWARE PLANNING WITH SEMANTIC SCENE UNDERSTANDING This is an appendix for Chapter 5 which provides additional details for the semantic segmentation model used in the work. In Section C.1, we specify the construction of the outdoor semantic segmentation dataset. In Section C.2, we provide architecture, training, and inference details. C.1 Outdoor Navigation Segmentation Dataset As mentioned in the main text, we curate a segmentation dataset for outdoor navigation in unstructured environments from the existing large-scale COCO panoptic dataset [76]. Images of unstructured outdoor scenes are selected using a Places365 [191] classifier. An image is kept if (a) the classifier’s top one (highest) prediction is an unstructured outdoor category with > 50% probability, or (b) two or more of the top five predictions are unstructured outdoor categories. Second, the 133 categories in COCO panoptic are merged: (a) all outdoor terrains (e.g. grass, dirt, snow, pavement) are retained; (b) obstacles are merged into four categories: fixed obstacles (e.g. buildings), moving human-made obstacles (e.g. vehicles), humans, and animals; (c) all indoor categories are removed. Our final dataset consists of 34K training images with 22 categories. Our navigation seg- mentation categories (from COCO panoptic), and filtered list of COCO panoptic images are at: https://deepsemantichppc.github.io/segmentation_ dataset/README.html. The 22 semantic categories (plus one VOID/IGNORED class) are as follows: 141 rock-merged, gravel, platform, railroad, river, road, sand, sea, snow, pavement-merged, mountain-merged, grass-merged, dirt-merged, water-other, flower, tree-merged, playingfield, person, obstacles-fixed, obstacles-dynamic-manmade, obstacles-dynamic-animal, background, VOID The obstacles-*, background, and VOID classes are formed by merging several categories in COCO panoptic: OBSTACLES_FIXED (MERGED): stairs wall-brick wall-stone wall-tile wall-wood fire hydrant stop sign parking meter bench chair couch potted plant bed dining table toilet cardboard counter 142 curtain door-stuff house shelf fence-merged table-merged building-other-merged wall-other-merged net cabinet-merged OBSTACLES_DYNAMIC_MANMADE (MERGED): skateboard surfboard bicycle car airplane bus train truck boat skis snowboard motorcycle OBSTACLES_DYNAMIC_ANIMAL (MERGED): bird cat 143 dog horse sheep cow elephant bear zebra giraffe BACKGROUND (MERGED): sky-other-merged VOID (MERGED): VOID teddy bear kite oven keyboard floor-other-merged cup tv backpack orange laptop blanket vase spoon banner 144 towel baseball glove frisbee paper-merged window-blind clock microwave ceiling-merged knife tent umbrella toothbrush sink floor-wood baseball bat sports ball roof pillow apple bridge tennis racket scissors traffic light banana fork donut 145 suitcase cake wine glass carrot mouse hair drier food-other-merged rug-merged toaster bowl book tie pizza sandwich fruit handbag cell phone broccoli refrigerator mirror-stuff window-other remote hot dog light bottle 146 C.2 Network Architecture, Training, and Inference For our network architecture, we use DeepLabv3+[24] with Xception65 [25] back- bone augmented with dropout in the middle and exit flow blocks for semantic segmentations. For details, refer to [24, 25]. ASPP atrous rates are set to 6,12,18 for output stride 16 and 12,24,36 for output stride 8 as in [24]. Training uses output stride 16 while inference uses output stride 8. Training Details. We initialize from a network pretrained on Ima- geNet [33] + MSCOCO [99] + Pascal VOC [39]. This pretrained model is available at https://github.com/tensorflow/models/blob/master/ research/deeplab/g3doc/model_zoo.md. The model is trained for 160000 iterations using SGD with momentum with batchsize 4 on our dataset. Initial learning rate 0.01, momentum 0.9, and polynomial learning rate decay with power 0.9 are used. Images are resized to 513x513. Inference. At inference time, 50 forward passes are used to predict semantics and uncertainties (Eqs. (5.1)-(5.2)). 147 BIBLIOGRAPHY [1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 859–868, 2018. 14, 25 [2] Edward H Adelson. On seeing stuff: the perception of materials by humans and machines. In Human vision and electronic imaging VI, volume 4299, pages 1–12. International Society for Optics and Photonics, 2001. 68 [3] Eirikur Agustsson, Jasper RR Uijlings, and Vittorio Ferrari. Interactive full image segmentation by considering all regions jointly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11622–11631, 2019. 13, 14, 25 [4] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. arXiv preprint arXiv:1803.10464, 2018. 14 [5] Yağız Aksoy, Tae-Hyun Oh, Sylvain Paris, Marc Pollefeys, and Wojciech Matusik. Semantic soft segmentation. ACM Trans. Graph. (Proc. SIG- GRAPH), 37(4):72:1–72:13, 2018. 14 [6] Mykhaylo Andriluka, Jasper RR Uijlings, and Vittorio Ferrari. Fluid anno- tation: a human-machine collaboration interface for full image annotation. arXiv preprint arXiv:1806.07527, 2018. 14 [7] Isabella Augart, Maurice Sa, and Iris Wenderholm. Steinformen. De Gruyter, Berlin, Boston, 2018. 46 [8] Xue Bai and Guillermo Sapiro. Geodesic matting: A framework for fast interactive image and video segmentation and matting. International journal of computer vision, 82(2):113–132, 2009. 13 [9] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. In European Conference on Computer Vision (ECCV), pages 549–565. Springer, 2016. 14, 24, 29, 30 [10] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. OpenSurfaces:A 148 Richly Annotated Catalog of Surface Appearance. ACM Transactions on Graphics, 32(4):1, 2013. xiv, 11, 13, 15, 18, 19, 20, 22, 44, 45, 68, 126 [11] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recog- nition in the wild with the Materials in Context database. Computer Vision and Pattern Recognition (CVPR), 2015. 14, 24, 44, 68, 93, 126 [12] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interac- tive object segmentation with human annotators. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11700–11709, 2019. 45 [13] Willem Beurs. ”De groote waereld in’t kleen geschildert, of Schilderagtig tafereel van’s weerelds schilderyen. Kortelijk vervat in ses boeken: Verklarende de hooftver- wen, haare verscheide mengelinge in oly, en der zelver gebruik...”. By Johannes en Gillis Janssonius van Waesberge, 1692. 42 [14] Marjolijn Bol and Ann-Sophie Lehmann. Painting skin and water: towards a material iconography of translucent motifs in early netherlandish paint- ing. In Rogier van der Weyden in context: papers presented at the Seventeenth Symposium for the Study of Underdrawing and Technology in Painting held in Leuven, 22-24 October, pages 215–228. Peeters, 2012. 46 [15] R. Border, J. D. Gammell, and P. Newman. Surface edge explorer (SEE): Planning next best views directly from 3d observations. In Proc. ICRA, pages 6116–6123, 2018. 90 [16] Ali Sharifi Boroujerdi, Maryam Khanian, and Michael Breuß. Deep inter- active region segmentation and captioning. In Signal-Image Technology & Internet-Based Systems (SITIS), 2017 13th International Conference on, pages 103–110. IEEE, 2017. 13 [17] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO-Stuff: Thing and stuff classes in context. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 11, 13, 20, 24, 32, 33, 34, 44, 45, 68, 93, 126 [18] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019. 64 [19] Patrick Cavanagh. The artist as neuroscientist. Nature, 434(7031):301–307, 2005. 6, 40, 51, 74 149 [20] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 93 [21] Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, and Bohyung Han. Domain-specific batch normalization for unsupervised domain adap- tation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7354–7362, 2019. 56 [22] R. Omar Chavez-Garcia, Jérôme Guzzi, Luca Maria Gambardella, and Alessandro Giusti. Learning ground traversability from simulations. IEEE RAL, 3:1695–1702, 2017. 90 [23] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018. 30 [24] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 26, 93, 99, 115, 116, 117, 147 [25] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017. 99, 115, 147 [26] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepiane, Fabio Ganovelli, and Guido Ranzuglia. Meshlab: an open-source mesh processing tool. In Eurographics Italian chapter conference, volume 1, pages 129–136, 2008. 108 [27] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5771–5780, 2020. 87 [28] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In Com- puter Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016. x, 5, 11, 13, 20, 22, 30, 45 150 [29] Elliot J. Crowley and Andrew Zisserman. The state of the art: Object retrieval in paintings using discriminative regions. In British Machine Vision Conference, 2014. 42 [30] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1635–1643, 2015. 29, 30 [31] Jonathan Daudelin and Mark Campbell. An adaptable, probabilistic, next best view algorithm for reconstruction of unknown 3d objects. IEEE RAL, 2(3):1540–1547, 2017. 90 [32] Jeffrey A. Delmerico, Alessandro Giusti, Elias Mueggler, Luca Maria Gam- bardella, and Davide Scaramuzza. ”on-the-spot training” for terrain classi- fication in autonomous air-ground collaborative teams. In Proc. ISER, 2016. 90 [33] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima- geNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009. 26, 147 [34] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017. 63, 65 [35] Francesca Di Cicco, Maarten W.A. Wijntjes, and Sylvia C. Pont. Under- standing gloss perception through the lens of art: Combining perception, image analysis, and painting recipes of 17th century painted grapes. Journal of Vision, 19(3):1–15, 2019. 51 [36] RV Dietrich. Rocks depicted in painting & sculpture. Rocks & Minerals, 65(3):224–236, 1990. 46 [37] Zeyad Emam, Andrew Kondrich, Sasha Harrison, Felix Lau, Yushi Wang, Aerin Kim, and Elliot Branson. On the state of data in computer vision: Human annotations remain indispensable for developing deep learning models. arXiv preprint arXiv:2108.00114, 2021. 4 [38] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. KDD, pages 226–231, 1996. 97 151 [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, June 2010. 26, 30, 147 [40] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. 44, 46 [41] Dave Ferguson and Anthony Stentz. Using interpolation to improve path planning: The field D* algorithm. J Field Robot, 23:79–101, 2006. 90 [42] Paul Filitchkin. Visual Terrain Classification For Legged Robots. PhD thesis, University of California, Santa Barbara, 2011. 90 [43] Roland W Fleming. Material Perception. Annual Reviews, 3:365–88, 2017. 40 [44] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural net- works with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015. 32, 118 [45] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016. 32, 90, 94, 118 [46] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015. xiii, 133, 134 [47] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena- Martinez, Pablo Martinez-Gonzalez, and Jose Garcia-Rodriguez. A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing, 70:41–65, 2018. 41 [48] Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual factors in neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3985–3993, 2017. 87 [49] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased 152 towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018. 50, 61, 63, 65, 68, 69 [50] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. In Advances in neural information processing systems, pages 7538– 7550, 2018. 64, 65 [51] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain general- ization. arXiv preprint arXiv:2007.01434, 2020. 65, 133 [52] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organiza- tion and recognition of indoor scenes from rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 564–571, 2013. 19 [53] Timo Hackel, N. Savinov, L. Ladicky, J. Wegner, K. Schindler, and M. Polle- feys. SEMANTIC3D.NET: A new large-scale point cloud classification benchmark. ISPRS Annals, 4(1):91–98, 2016. 93 [54] Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art. International Journal of Multimedia Information Retrieval, pages 1–19, 2020. 41 [55] Yutao Han, Hubert Lin, Jacopo Banfi, Kavita Bala, and Mark Campbell. Deepsemantichppc: Hypothesis-based planning over uncertain semantic point clouds. ICRA, 2020. 7, 89 [56] Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison- Burch, and Jeffrey P Bigham. A data-driven analysis of workers’ earnings on amazon mechanical turk. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 449. ACM, 2018. 22 [57] Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Benjamin V Hanrahan, Jeffrey P Bigham, and Chris Callison-Burch. Worker demo- graphics and earnings on amazon mechanical turk: An exploratory anal- ysis. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, page LBW1217. ACM, 2019. 23 [58] Mohammed Hassanin, Salman Khan, and Murat Tahtali. Visual affordance and function understanding: A survey. arXiv preprint arXiv:1807.06775, 2018. 11, 32 153 [59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classifica- tion. In International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. 2 [60] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. 62, 64, 66 [61] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019. 63, 65 [62] Benjamin Hepp, Debadeepta Dey, Sudipta N. Sinha, Ashish Kapoor, Neel Joshi, and Otmar Hilliges. Learn-to-score: Efficient 3d scene exploration by predicting view utility. In Proc. ECCV, pages 437–452, 2018. 90 [63] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 3 [64] Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee, and Bohyung Han. Weakly supervised semantic segmentation using web-crawled videos. arXiv preprint arXiv:1701.00352, 2017. 14 [65] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei- jun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mo- bilenets: Efficient convolutional neural networks for mobile vision applica- tions. arXiv preprint arXiv:1704.04861, 2017. 30 [66] Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and Ross Girshick. Learning to segment every thing. Cornell University arXiv Institution: Ithaca, NY, USA, 2017. 14 [67] Eric Huang, Haoqi Zhang, David C Parkes, Krzysztof Z Gajos, and Yiling Chen. Toward automatic task design: a progress report. In Proceedings of the ACM SIGKDD workshop on human computation, pages 77–85. ACM, 2010. 17, 23 [68] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017. 65, 69, 128 154 [69] Alexander Ivanov and Mark E. Campbell. Uncertainty constrained robotic exploration: An integrated exploration planner. IEEE T Contr Syst T, 27:146–160, 2016. 89 [70] Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. Style normalization and restitution for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3143–3152, 2020. 65 [71] Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualiza- tion and computer graphics, 2019. 50, 61, 63, 69 [72] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Con- trastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4893–4902, 2019. 56 [73] Martin Kemp et al. The science of art: Optical themes in western art from brunelleschi to seurat. 1990. 42 [74] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Proc. NIPS, pages 5574–5584, 2017. 90, 94 [75] Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, volume 1, page 3, 2017. 14 [76] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In Proc. CVPR, pages 9404–9413, 2019. 93, 98, 141 [77] Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer. Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE International Conference on Computer Vision, pages 4422–4431, 2019. 65 [78] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. ImageNet classi- fication with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012. 1 155 [79] Philipp Krüsi, Paul Furgale, Michael Bosse, and Roland Siegwart. Driving on point clouds: Motion planning, trajectory optimization, and terrain assessment in generic nonplanar environments. J Field Rob, 34(5):940–984, 2017. xii, xvii, 90, 91, 92, 93, 100, 108, 109 [80] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018. 44 [81] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Time-series extreme event forecasting with neural networks at uber. In International Conference on Machine Learning, number 34, pages 1–5, 2017. 33 [82] Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998. 92 [83] Steven M. LaValle. Planning Algorithms. Cambridge University Press, New York, NY, USA, 2006. 89 [84] Hoang Le, Long Mai, Brian Price, Scott Cohen, Hailin Jin, and Feng Liu. Interactive boundary prediction for object selection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 18–33, 2018. 13, 14 [85] Ann-Sophie Lehmann. Fleshing out the body: The’colours of the naked’in workshop practice and art theory, 1400-1600. Nederlands Kunsthistorisch Jaarboek, 59:86, 2008. 46 [86] Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, and Siegfried Wahl. Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports, 7(1):17816, 2017. 33, 118 [87] Victor Lempitsky and Andrew Zisserman. Learning to count objects in images. In Advances in neural information processing systems, pages 1324– 1332, 2010. 32 [88] Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting. IEEE transactions on pattern analysis and machine intelligence, 30(2):228–242, 2008. 13 [89] Anat Levin, Alex Rav-Acha, and Dani Lischinski. Spectral matting. IEEE 156 transactions on pattern analysis and machine intelligence, 30(10):1699–1712, 2008. 13 [90] Boyi Li, Felix Wu, Ser-Nam Lim, Serge Belongie, and Kilian Q Wein- berger. On feature normalization and data augmentation. arXiv preprint arXiv:2002.11102, 2020. 70 [91] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE in- ternational conference on computer vision, pages 5542–5550, 2017. 50, 65, 68, 125 [92] Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Tim- othy M Hospedales. Episodic training for domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1446–1455, 2019. 65 [93] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017. 63 [94] Zhengqi Li and Noah Snavely. Cgintrinsics: Better intrinsic image decom- position through physically-based rendering. In Proceedings of the European Conference on Computer Vision (ECCV), pages 371–387, 2018. 19 [95] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3159–3167, 2016. x, 14, 29, 30 [96] Hubert Lin, Paul Upchurch, and Kavita Bala. Block annotation: Better image annotation with sub-image decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5290–5300, 2019. 5, 10, 45 [97] Hubert Lin, Mitchell van Zuijlen, Sylvia C Pont, Maarten WA Wijntjes, and Kavita Bala. What can style transfer and paintings do for model robustness? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11028–11037, 2021. 6, 7, 62 [98] Hubert Lin, Mitchell Van Zuijlen, Maarten WA Wijntjes, Sylvia C Pont, and Kavita Bala. Insights from a large-scale database of material depictions in paintings. In International Conference on Pattern Recognition, pages 531–545. Springer, 2021. 5, 6, 41 157 [99] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), 2014. 11, 13, 22, 26, 44, 45, 46, 147 [100] Zhiqiu Lin, Jin Sun, Abe Davis, and Noah Snavely. Visual chirality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 12295–12303, 2020. 41 [101] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5257–5266, 2019. 45 [102] Aishan Liu, Xianglong Liu, Chongzhi Zhang, Hang Yu, Qiang Liu, and Junfeng He. Training robust deep neural networks via adversarial noise propagation. arXiv preprint arXiv:1909.09034, 2019. 62, 63, 65, 83 [103] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learn- ing transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015. 56 [104] Khairul Azmi Mahadhir, Shing Chiang Tan, Cheng Yee Low, Roman Du- mitrescu, Adam Tan Mohd Amin, and Ahmed Jaffar. Terrain classification for track-driven agricultural robots. In Proc. SYSINT, pages 775 – 782, 2014. 90 [105] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Pro- ceedings of the European conference on computer vision (ECCV), pages 181–196, 2018. 3 [106] Utkarsh Mall, Kevin Matzen, Bharath Hariharan, Noah Snavely, and Kavita Bala. Geostyle: Discovering fashion trends and events. In Proceedings of the IEEE International Conference on Computer Vision, pages 411–420, 2019. 41 [107] Pascal Mamassian. Ambiguities and conventions in the perception of visual art. Vision Research, 48(20):2143–2153, 2008. 6, 40, 82 [108] Pascal Mamassian. Ambiguities and conventions in the perception of visual art. Vision Research, 48(20):2143–2153, 2008. 51 158 [109] R. Manduchi, A. Castano, A. Talukder, and L. Matthies. Obstacle detection and terrain classification for autonomous off-road navigation. Auton Robot, 18(1):81–102, 2005. 90 [110] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. arXiv preprint arXiv:1711.09081, 2017. x, 13, 25, 45, 47 [111] Pat Marion, Peter R Florence, Lucas Manuelli, and Russ Tedrake. A pipeline for generating ground truth labels for real rgbd data of cluttered scenes. arXiv preprint arXiv:1707.04796, 2017. 13 [112] Toshihiko Matsuura and Tatsuya Harada. Domain generalization using a mixture of multiple latent domains. In AAAI, pages 11749–11756, 2020. 65, 133, 134 [113] Kevin Matzen, Kavita Bala, and Noah Snavely. Streetstyle: Explor- ing world-wide clothing styles from millions of photos. arXiv preprint arXiv:1706.01869, 2017. 41 [114] M.H. van Eikema Hommes. The contours in the paintings of the oranjezaal, huis ten bosch. 2005. 45 [115] Branislav Mičušĺı́k and Jana Košecká. Semantic segmentation of street scenes by superpixel co-occurrence and 3d geometry. In 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pages 625–632. IEEE, 2009. 32 [116] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014. 11, 20, 30 [117] B. Nagy and A. Kelly. Trajectory generation for car-like robots using cubic curvature polynomials. In Proc. FSR, 2001. 93 [118] Gerhard Neuhold, Tobias Ollmann, S Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for semantic understanding of street scenes. In International Conference on Computer Vision (ICCV), pages 22–29, 2017. 13 [119] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter 159 Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, pages 5000–5009, 2017. 11, 22 [120] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016. 83 [121] Michael W. Otte, Scott G. Richardson, Jane Mulligan, and Gregory Z. Grudic. Path planning in image space for autonomous robot navigation in unstructured environments. J Field Robot, 26:212–240, 2009. 90 [122] Erwin Panofsky. Perspective as symbolic form. Princeton University Press, 1927/2020. 42 [123] Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object annotation. International Journal of Computer Vision, 2017. x, 43, 46, 47 [124] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object annotation. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4940–4949. IEEE, 2017. 13 [125] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1742–1750, 2015. 14, 29, 30 [126] Deepak Pathak, Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144, 2014. 29, 30 [127] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019. 50, 56, 79 [128] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017. 79 [129] Maurice Henri Pirenne. Optics, painting & photography. Cambridge Univer- sity Press. 42 160 [130] Carol Pottasch. Frans van mieriss painting technique as one of the possible sources for willem beurss treatise on painting. Art & Perception, pages 1 – 17, 13 Jun. 2020. 42 [131] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proc. CVPR, pages 918–927, 2018. 93 [132] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmenta- tion. In Proc. CVPR, pages 77–85, 2017. 93 [133] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proc. NIPS, pages 5099–5108, 2017. 93 [134] Xuebin Qin, Shida He, Zichen Zhang, Masood Dehghan, and Martin Jager- sand. Bylabel: A boundary based semi-automatic image annotation tool. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1804–1813. IEEE, 2018. 14 [135] Alexander J Quinn and Benjamin B Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1403–1412. ACM, 2011. 20 [136] Stephan Rabanser, Stephan Günnemann, and Zachary C Lipton. Failing loudly: An empirical study of methods for detecting dataset shift. arXiv preprint arXiv:1810.11953, 2018. 3 [137] Tal Remez, Jonathan Huang, and Matthew Brown. Learning to segment via cut-and-paste. CoRR, abs/1803.06414, 2018. 14 [138] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 47 [139] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Inter- active foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314. ACM, 2004. 13, 45 [140] Artsiom Sanakoyeu, Dmytro Kotovenko, Sabine Lang, and Bjorn Ommer. 161 A style-aware content loss for real-time hd style transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 698–714, 2018. 65, 80, 128 [141] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottle- necks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 30 [142] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017. 52 [143] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017. 105 [144] Lavanya Sharan, Ce Liu, Ruth Rosenholtz, and Edward H. Adelson. Recog- nizing materials using perceptually inspired features. International Journal of Computer Vision, 103(3):348–371, 2013. 65, 67 [145] Tong Shen, Guosheng Lin, Chunhua Shen, and Ian Reid. Bootstrapping the performance of webly supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1363–1371, 2018. 14 [146] Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8242–8250, 2018. 65 [147] Zhiyuan Shi, Yongxin Yang, Timothy M Hospedales, and Tao Xiang. Weakly-supervised image annotation and segmentation with objects and attributes. IEEE transactions on pattern analysis and machine intelligence, 39(12):2525–2538, 2017. 14 [148] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48, 2019. 4 [149] Nathan Somavarapu, Chih-Yao Ma, and Zsolt Kira. Frustratingly simple domain generalization via image stylization. arXiv preprint arXiv:2006.11207, 2020. 71 162 [150] Chunjin Song, Zhijie Wu, Yang Zhou, Minglun Gong, and Hui Huang. Etnet: Error transition network for arbitrary style transfer. arXiv preprint arXiv:1910.12056, 2019. 65, 69, 128 [151] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017. 19 [152] Gregory J. Stein, Christopher Bradley, and Nicholas Roy. Learning over subgoals for efficient navigation of structured, unknown environments. In Proc. CORL, pages 213–222, 2018. 89 [153] David Summers. Conventions in the history of art. New literary history, 13(1):103–125, 1981. 82 [154] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017. 3 [155] Jan Svoboda, Asha Anoosheh, Christian Osendorfer, and Jonathan Masci. Two-stage peer-regularized feature recombination for arbitrary image style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13816–13825, 2020. 65, 69, 128 [156] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. International Conference on Learning Representations (ICLR), 2014. 2 [157] Wei Ren Tan, Chee Seng Chan, Hernán E Aguirre, and Kiyoshi Tanaka. Artgan: Artwork synthesis with conditional categorical gans. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3760–3764. IEEE, 2017. 69 [158] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 67 [159] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014. 56 163 [160] Paul Upchurch, Daniel Sedra, Andrew Mullen, Haym Hirsh, and Kavita Bala. Interactive consensus agreement games for labeling images. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP), October 2016. 13 [161] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018. 51 [162] Mitchell JP Van Zuijlen, Hubert Lin, Kavita Bala, Sylvia C Pont, and Maarten WA Wijntjes. Materials in paintings (mip): An interdisci- plinary dataset for perception, art history, and computer vision. Plos one, 16(8):e0255109, 2021. 5, 6, 40, 41, 43, 68, 127 [163] Mitchell JP van Zuijlen, Sylvia C Pont, and Maarten WA Wijntjes. Painterly depiction of material properties. Journal of vision, 20(7):7–7, 2020. 45 [164] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 51 [165] Krzysztof Walas. Terrain classification and negotiation with a walking robot. J Intell Robot Syst, 78:401–423, 2015. 90 [166] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generalization of convolutional neural net- works. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8684–8694, 2020. 83 [167] Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, and Ming-Hsuan Yang. Collaborative distillation for ultra-resolution universal style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 1860–1869, 2020. 65 [168] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng, and Tao Qin. Generalizing to unseen domains: A survey on domain generalization. arXiv preprint arXiv:2103.03097, 2021. 2 [169] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 87 164 [170] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In ICML, volume 1, page 2, 2019. 65 [171] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018. 93 [172] Xiu-Shen Wei, Jianxin Wu, and Quan Cui. Deep learning for fine-grained image analysis: A survey. arXiv preprint arXiv:1907.03069, 2019. 51 [173] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010. 13 [174] John White. The birth and rebirth of pictorial space. Cambridge, MA, 1957. 42 [175] ”Lisa Wiersma”. ”colouring material depiction in flemish and dutch baroque art theory”. ”Art & Perception”, pages ”1 – 23”, ”22 Apr. 2020”. 42 [176] M. W. A. Wijntjes, C. Spoiala, and H. de Ridder. Thurstonian scaling and the perception of painterly translucency. Art & Perception, pages 1 – 24, 04 Sep. 2020. 42 [177] WikiArt. Wikiart: Visual art encyclopedia. In wikiart.org. 69 [178] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2, 2019. 47 [179] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas Huang. Deep grabcut for object selection. arXiv preprint arXiv:1707.00243, 2017. 13 [180] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 373–381, 2016. 13 [181] Ning Xu, Brian L Price, Scott Cohen, and Thomas S Huang. Deep image matting. In CVPR, volume 2, page 4, 2017. 13 [182] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021. 3 165 [183] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re-identification: A survey and outlook. arXiv preprint arXiv:2001.04193, 2020. 41 [184] J. Yen. Finding the k shortest loopless paths in a network. Manage Sci, 17(11):712–716, 1971. 98 [185] Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. Advances in Neural Information Processing Systems, 32:13276–13286, 2019. 84 [186] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018. 11, 13, 24, 25 [187] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pages 6023–6032, 2019. 63, 65 [188] Lishi Zhang, Chenghan Fu, and Jia Li. Collaborative annotation of seman- tic objects in images with multi-granularity supervisions. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 474–482. ACM, 2018. 14 [189] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 9(4), 2017. 32 [190] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, and Thomas Funkhouser. Physically-based rendering for indoor scene understanding using convolutional neural networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 19 [191] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 98, 141 [192] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of 166 the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4. IEEE, 2017. 5, 11, 33, 34, 45, 93 [193] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. arXiv preprint arXiv:2103.02503, 2021. 2 [194] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proc. CVPR, pages 4490–4499, 2018. 93 [195] Aleksandar Zlateski, Ronnachai Jaroensri, Prafull Sharma, and Frédo Du- rand. On the importance of label quality for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1479–1487, 2018. 14 167