LEARNING GEOMETRY, APPEARANCE AND MOTION IN THE WILD A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Zhengqi Li May 2021 © 2021 Zhengqi Li ALL RIGHTS RESERVED LEARNING GEOMETRY, APPEARANCE AND MOTION IN THE WILD Zhengqi Li, Ph.D. Cornell University 2021 Physics-based computer vision can be formulated as an inverse process of graphics rendering engine: we seek to take RGB images and recover the intrinsic properties of a scene, including geometry, material, illumination, and object motions. Computer vision as inverse graphics plays an important role in numerous real-world applications such as virtual reality, in which recovered scene intrinsics can be further used to render images at novel viewpoints with plausible lighting. However, most previous techniques either require a multi-camera setup or assume that the underlying scene is static, i.e., that the appearance and geometry do not change over time. In contrast, the photos we see on the Internet only constitute a single view observation for each scene; the videos often involve dynamics due to a variety of time- varying factors such as illumination changes and object motions. Therefore, In this thesis, I address these problems to in-the-wild scenarios by lever- aging a compelling source of data: massive quantities of unlabeled photos and videos people take and upload to the Internet every day. I demonstrate how to make use of such massive but noisy visual data to capture scene geometry, appearance, lighting, and motions from a single RGB image or videos of dynamic scenes, which further enables me to synthesize photo-realistic novel view in both space and time. BIOGRAPHICAL SKETCH Zhengqi Li is a CS Ph.D. candidate at Cornell Tech, Cornell University where he is advised by Prof. Noah Snavely. He will become a research scientist at Google Research starting Summer 2021. He received Bachelor of Computer Engineering with High Distinction at University of Minnesota, Twin Cities where he was advised by Prof. Stergios Roumeliotis and was a research assistant at the MARS Lab and Google Project Tango. He was also a member of the Robotic Sensor Networks (RSN) Lab where he worked closely with Prof. Volkan Isler on agricultural robotics and vision. His research interests span 3D and 4D computer vision, inverse graphics, and novel view synthesis, for images and videos in the wild. He is a recipient of the CVPR 2019 Best Paper Honorable Mention, 2020 Google Ph.D. Fellowship, 2020 Adobe Research Fellowship, 2016 CRA Outstanding Undergraduate Researchers Honorable Mentions. This thesis is dedicated to my parents and all my other family members. ACKNOWLEDGEMENTS During my five year Ph.D. journey, I would like to give the highest appreciation and gratitude to my adviser Professor Noah Snavely for his invaluable support, inspiration and advice. Noah is the best advisor and the nicest person I have ever seen in my life, and he always give me freedom and support in both research and life. It is Noah’s guidance that help me grow over recent five years to become a great and independent researcher. I would also like to thank my thesis committee members Professor Serge Belongie and Professor Mor Naaman, who provide feedback and suggestions on my thesis and dissertation defense. I also would like to thank my collaborator and colleagues at Cornell Tech and Cornell University, including Kai Zhang, Hadar Averbuch-Elor, Qianqian Wang, Jin Sun, Wenqi Xian, Abe Davis. They are great to work with and I always learn a lot from the discussion with them. In addition, I am also fortunate to have chance to work with many great researchers outside Cornell University on different exciting research projects. I would like to give special thanks to Tali Dekel, Forrester Cole, Richard Tucker, Ce Liu and William T. Freeman in VisCam team at Google Research, Fernando De La Torre at Facebook Reality Lab, Oliver Wang and Simon Niklaus at Adobe Research. They all offer me great advice on the research projects during my three summer internships. Last but not least, I would to dedicate my dissertation to my parents and my other family members since my Ph.D. journey would not have been possible without their countless love and support. TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction 1 2 Learning Single View Depth Prediction from Internet Photos 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The MegaDepth Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Photo calibration and reconstruction . . . . . . . . . . . . . . . 11 2.3.2 Depth map refinement . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Depth enhancement via semantic segmentation . . . . . . . . . 13 2.3.4 Creating a dataset . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Depth estimation network . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.1 Evaluation and ablation study on MD test set . . . . . . . . . . 21 2.5.2 Generalization to other datasets . . . . . . . . . . . . . . . . . 23 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Learning the Depths of Moving People by Watching Frozen People 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 The MannequinChallenge Dataset . . . . . . . . . . . . . . . . . . . . 34 3.4 Depth Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Depth from motion parallax . . . . . . . . . . . . . . . . . . . 40 3.4.2 Depth confidence . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.4 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.1 Evaluation on the MC test set . . . . . . . . . . . . . . . . . . 50 3.5.2 Evaluation on the TUM RGBD dataset . . . . . . . . . . . . . 51 3.5.3 Internet videos of dynamic scenes . . . . . . . . . . . . . . . . 55 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 Learning Intrinsic Image Decomposition from Watching the World 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Overview and network architecture . . . . . . . . . . . . . . . . . . . . 63 4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5.1 Image reconstruction loss . . . . . . . . . . . . . . . . . . . . 68 4.5.2 Reflectance consistency . . . . . . . . . . . . . . . . . . . . . 69 4.5.3 Dense spatio-temporal reflectance smoothness . . . . . . . . . 69 4.5.4 Multi-scale shading smoothness . . . . . . . . . . . . . . . . . 71 4.5.5 All-pairs weighted least squares (APWLS) . . . . . . . . . . . 73 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6.1 Evaluation on IIW . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6.2 Evaluation on SAW . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6.3 Qualitative results on IIW and SAW . . . . . . . . . . . . . . . 78 4.6.4 Evaluation on MIT intrinsic images . . . . . . . . . . . . . . . 79 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5 Learning Better Intrinsic Images through Physically-Based Rendering 82 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 CGINTRINSICS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Learning Cross-Dataset Intrinsics . . . . . . . . . . . . . . . . . . . . . 89 5.4.1 Supervised losses . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2 Smoothness losses . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.3 Reconstruction loss . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4.4 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5.1 Evaluation on IIW . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5.2 Evaluation on SAW . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.3 Evaluation on MIT intrinsic images . . . . . . . . . . . . . . . 100 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Crowdsampling the Plenoptic Function 102 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3.1 Collecting Crowdsampled Data . . . . . . . . . . . . . . . . . 107 6.3.2 The DeepMPI Scene Representation . . . . . . . . . . . . . . . 108 6.3.3 Stage 1: Optimizing DeepMPI Color and α Planes . . . . . . . 110 6.3.4 Stage 2: Learning How Appearance Changes with Time . . . . 112 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7 Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes123 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3.1 Neural scene flow fields for dynamic scenes . . . . . . . . . . . 128 7.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.3.3 Integrating a static scene representation . . . . . . . . . . . . . 134 7.3.4 Space-time view synthesis . . . . . . . . . . . . . . . . . . . . 137 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4.1 Baselines and error metrics . . . . . . . . . . . . . . . . . . . . 139 7.4.2 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . 140 7.4.3 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . 141 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8 Ethics in Data-driven Computer Vision 145 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.2 Privacy and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.2.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.2.2 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.3 Fairness and Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.4 Interpretability and Transparency . . . . . . . . . . . . . . . . . . . . . 156 8.5 Other Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.5.1 Policy and Regulation. . . . . . . . . . . . . . . . . . . . . . . 160 8.5.2 Employment and HCI . . . . . . . . . . . . . . . . . . . . . . 161 8.5.3 Control and Surveillance . . . . . . . . . . . . . . . . . . . . . 162 8.5.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9 Conclusion 167 A Chapter 2 Appendix 169 A.1 Depth Map Refinement and Enhancement . . . . . . . . . . . . . . . . 169 A.1.1 Modified MVS algorithm . . . . . . . . . . . . . . . . . . . . . 169 A.1.2 Foreground and background classes . . . . . . . . . . . . . . . 170 A.1.3 Automatic ordinal depth labeling . . . . . . . . . . . . . . . . . 172 A.2 SfM Disagreement Rate (SDR) . . . . . . . . . . . . . . . . . . . . . . 173 B Chapter 3 Appendix 176 B.1 Derivations of depth from motion parallax . . . . . . . . . . . . . . . . 176 B.2 Derivation of error metrics . . . . . . . . . . . . . . . . . . . . . . . . 177 B.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 C Chapter 4 Appendix 181 C.1 Hyperparameters Setting . . . . . . . . . . . . . . . . . . . . . . . . . 181 C.2 All-Pairs Weighted Least Squares (APWLS) . . . . . . . . . . . . . . . 181 C.3 Additional details for SAW evaluation metrics . . . . . . . . . . . . . . 182 D Chapter 5 Appendix 184 D.1 Additional details for training losses . . . . . . . . . . . . . . . . . . . 184 D.1.1 Ordinal term for CGINTRINSICS . . . . . . . . . . . . . . . . 184 D.1.2 Additional hyperparameter settings . . . . . . . . . . . . . . . 185 E Chapter 6 Appendix 186 E.1 Priors on the Plenoptic Function . . . . . . . . . . . . . . . . . . . . . 186 E.1.1 Constant Visibility and Light Field Gradients . . . . . . . . . . 186 E.1.2 Common Light Sources, Material Properties, and Normals . . . 187 E.2 Scene Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 E.3 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 E.3.1 Losses optimizing DeepMPI color and α planes . . . . . . . . . 188 E.3.2 Training Losses . . . . . . . . . . . . . . . . . . . . . . . . . . 189 E.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 E.5 Training and Implementation . . . . . . . . . . . . . . . . . . . . . . . 192 E.6 Visual Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 E.7 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 F Chapter 7 Appendix 196 F.1 Scene Flow Regularization Details . . . . . . . . . . . . . . . . . . . . 196 F.2 Data Driven Prior Details . . . . . . . . . . . . . . . . . . . . . . . . . 197 F.3 Space-Time Interpolation Visualization . . . . . . . . . . . . . . . . . 199 F.4 Volume Rendering Equation Approximation . . . . . . . . . . . . . . . 200 F.5 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 F.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Bibliography 204 LIST OF TABLES 2.1 Results on the MD test set (places unseen during training) for sev- eral network architectures. For VGG∗ we use the same loss and network architecture as in [84] for comparison to [84]. Lower is better. 22 2.2 Results on MD test set (places unseen during training) for different loss configurations. Lower is better. . . . . . . . . . . . . . . . . . . 22 2.3 Results on three different test sets with and without our depth re- finement methods. Raw MD indicates raw depth data; Clean MD indicates depth data using our refinement methods. Lower is better for all error measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Results on Make3D for various training datasets and methods. The first column indicates the training dataset. Errors for “Ours” are averaged over four models trained/validated on MD. Lower is better for all metrics. 25 2.5 Results on the KITTI test set for various training datasets and ap- proaches. Columns are as in Table 2.4. . . . . . . . . . . . . . . . . . 27 2.6 Results on the DIW test set for various training datasets and ap- proaches. Columns are as in Table 2.4. . . . . . . . . . . . . . . . . . 27 3.1 Quantitative comparisons on the MC test set. Different input con- figurations of our model: (I) single image; (II) optical flow masked in the human region (F ), confidence and human mask; (III) masked input depth, human mask; and (IV) additional confidence; in (V), we also input human keypoints. The last row indicates the error for the depth estimated from motion parallax between two frames in all image regions (human and non-human); this serves as an oracle and can only be measured if the entire scene is static. Lower is better for all metrics. 49 3.2 Results on the TUM RGBD dataset. Different si-RMSE metrics as well as standard RMSE and relative error (Rel) are reported. We evaluate our models (light gray background) under different input configurations, as described in Table 3.1. Raw depth indicates the model is trained using raw MVS depth predictions as supervision, without our depth cleaning method. A dataset denoted as ‘-’ indicates that the method is not learning-based. Lower is better for all error metrics. . . . . . . . . 50 4.1 Results on the IIW test set. Lower is better for the Weighted Human Disagreement Rate (WHDR). The second column indicates the train- ing data each learning-based method uses; “-” indicates the method is optimization-based. ∗ indicates WHDR is evaluated based on CNN classifer outputs for pairs of pixels rather than full decompositions. . . 75 4.2 Results on the SAW test set. Higher is better for AP%. The second column is described in Table 4.1. Note that none of the methods use annotations from SAW. . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Results on MIT intrinsics. For all error metrics, lower is better. ST=Sintel dataset and SN=ShapeNet dataset. The second column shows the dataset used for training. GT indicates whether the method uses ground truth for training. . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Comparisons of existing intrinsic image datasets with our CGIN- TRINSICS dataset. PB indicates physically-based rendering and non-PB indicates non-physically-based rendering. . . . . . . . . . . . . . . . 88 5.2 Numerical results on the IIW test set. Lower is better for WHDR. The “Training set” column specifies the training data used by each learning- based method: “-” indicates an optimization-based method. IIW(O) indicates original IIW annotations and IIW(A) indicates augmented IIW comparisons. “All” indicates CGI+IIW(A)+SAW. † indicates network was validated on CGI and others were validated on IIW. ∗ indicates CNN predictions are post-processed with a guided filter [254]. . . . . 95 5.3 Quantitative results on the SAW test set. Higher is better for AP%. The second column is described in Table 5.2. The third and fourth columns show performance on the unweighted SAW benchmark and our more challenging gradient-weighted benchmark, respectively. . . . 97 5.4 Quantitative Results on MIT intrinsics testset. For all error metrics, lower is better. The second column shows the dataset used for training. ? indicates models fine-tuned on MIT. . . . . . . . . . . . . . . . . . 100 6.1 Quantitative comparisons on our test set. Lower is better for l1 and LPIPS and higher is better for PSNR. l1 errors are scaled by 10 for ease of presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1 Quantitative evaluation of novel view synthesis on the Dynamic Scenes dataset. MV indicates whether the approach makes use multi- view information or not. . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2 Quantitative evaluation of novel view and time synthesis. See Sec. 7.4.2 for a description of the baselines. . . . . . . . . . . . . . . . 139 7.3 Ablation study on the Dynamic Scenes dataset. See Sec. 7.4.2 for detailed descriptions of each of the ablations. . . . . . . . . . . . . . . 142 E.1 Scene statistics. We include (1) total number of images, 2) field of view (FoV) of the reference DeepMPI, and (3) depth of near and far MPI planes. The first five scenes are used for evaluation in Chapter6. . . . . 188 E.2 User study, Share of votes on Q1. . . . . . . . . . . . . . . . . . . . 194 E.3 User study, Share of votes on Q2. . . . . . . . . . . . . . . . . . . . 195 E.4 User study, Share of votes on Q3. . . . . . . . . . . . . . . . . . . . 195 LIST OF FIGURES 1.1 Learning geometry, appearance and motion in the wild. Massive numbers of photos and videos are uploaded by people everyday (left). My work in this thesis shows how to leverage such visual data to learn scene geometry, appearance, lighting and motion, Based upon that, we further demonstrate how to synthesize novel views in both space and time. (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Inverse graphics as an ill-posed problem. From a single image (a), a likely explanation of the underlying scene is shown in (b), but this image could also be a painting (c), a sculpture (d), or an effect created by a set of lights (e). Figure adapted from Adelson and Pentland [5] and Jon Barron [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Comparison between MVS depth maps with and without our pro- posed refinement/cleaning methods. The raw MVS depth maps (mid- dle) exhibit depth bleeding (top) or incorrect depth on people (bottom). Our methods (right) can correct or remove such outlier depths. . . . . . 12 2.2 Examples of automatic ordinal labeling. Blue mask: foreground (Ford) derived from semantic segmentation. Red mask: background (Bord) derived from reconstructed depth. . . . . . . . . . . . . . . . . 15 2.3 Effect of Lgrad term. Lgrad encourages predictions to match the ground truth depth gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Effect of Lord term. Lord tends to corrects ordinal depth relations for hard-to-construct objects such as the person in the first row and the tree in the second row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Depth predictions on MD test set. (Blue=near, red=far.) For visu- alization, we mask out the detected sky region. (a) Input photo. (b) Ground truth COLMAP depth map (GT). (c) VGG∗ prediction using the loss and network of [84]. (d) Depth prediction from a ResNet [191]. (e) Depth prediction from an hourglass (HG) network [63] . . . . . . . . . 24 2.6 Depth predictions on Make3D. The last four columns show results from the best models trained on non-Make3D datasets (final column is our result). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Depth predictions on KITTI. (Blue=near, red=far.) None of the mod- els were trained on KITTI data. From left to right: (a) input image, (b) ground truth (GT), (c) model trained on DIW [63], (d) model trained on Make3D [191], (e) Ours trained on MD. . . . . . . . . . . . . . . . . 26 2.8 Depth predictions on the DIW test set. (Blue=near, red=far.) Cap- tions are described in Figure 2.7. None of the models were trained on DIW data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Our model predicts dense depth when both an ordinary camera and people in the scene are freely moving (right). We train our model on our new MannequinChallenge dataset—a collection of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a camera tours the scene (left). Because people are stationary, geometric constraints hold; this allows us to use multi-view stereo to estimate depth which serves as supervision during training. In all figures, we use inverse depth maps for visualization purposes, and refer to them as depth maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Traditional stereo vs. our setup. Left: a person is observed at the same time instant from two different views. The 3D position of points can be computed using triangulation. Right: when both the camera and the objects in the scene are moving, triangulation is no longer possible since the epipolar constraint does not apply. . . . . . . . . . . . . . . . . . . 33 3.3 Sample images from Mannequin Challenge videos. Each image is a frame from a video sequence in which the camera is moving but the humans are all static. The videos span a variety of natural scenes, poses, and configurations of people. . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Effect of depth cleaning. (a-b) Raw MVS depth maps, DMVS, may contain errors and outliers, especially in untextured regions (see regions circled in yellow). (c) Our depth cleaning method effectively filters out such erroneous depth values. . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Sample frames from clips removed during filtering. (a) Videos cap- tured with fisheye cameras; (b) videos with synthetic backgrounds; (c) sequences with truly moving objects (pairs of frames shown in each column). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 System overview. Our model takes as input the RGB frame, a human segmentation mask, masked depth from motion parallax (via optical flow and SfM pose), and associated confidence map. We ask the network to use these inputs to predict depths that match the ground truth MVS depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7 System inputs and training data. The input to our network consists of: (a) an RGB image, (b) a human mask, (c) a masked depth map computed from motion parallax w.r.t. a selected source image, and (d) a masked confidence map. Low confidence regions (dark circles) in the first two rows indicate the vicinity of the camera epipole, where depth from parallax is unreliable and removed. The network is trained to regress to MVS depth (e). . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 Examples of keypoint images. The top row shows examples of in- put images and the bottom row shows corresponding detected human keypoint images, where different colors indicating different joints. We perform morphological dilation to the keypoint maps to make each keypoint location more visible. . . . . . . . . . . . . . . . . . . . . . 46 3.9 Qualitative results on the MC test set. From top to bottom: reference images and their corresponding MVS depth (pseudo ground truth); our depth predictions using: our single view model (third row) and our two- frame model (forth row). The additional network inputs give improved performance in both human and non-human regions. . . . . . . . . . . 47 3.10 Qualitative comparisons on the TUM RGBD dataset. (a) Reference images, (b) ground truth sensor depth, (c) results of the single-view depth prediction method DORN [94], (d) result of the two-frame motion stereo method DeMoN [354], (e-f) depth predictions from our single view and two-frame models, respectively. . . . . . . . . . . . . . . . . 52 3.11 Comparisons on Internet video clips with moving cameras and people. From left to right: (a) reference input image (b) results of DORN [94], (c) results of Chen et al. [63], (d) results of DeMoN [354], (e) results of our full method. . . . . . . . . . . . . . . . . . . . . . . 53 3.12 Depth-based visual effects. Using our predicted depth maps, we can apply depth-aware visual effects on (a) input images; we show (b) defocus, (c) object insertion, and (d) Anaglyph effects. . . . . . . . . 54 3.13 Depth-based image inpainting. We use depth prediction and camera poses to warp the pixels in nearby frames for image inpainting and people removal. Top row shows original images and bottom row shows inpainted images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.14 Failure cases. From left to right: (a) input RGB image (b) depth predicted from our single-view method (c) depth predicted from our proposed full method. Our proposed full method can fail for reasons including (1) a failure to generalize to complex human poses, (first three rows), or due to non-human movers such as animals, cars, and shadows (last three rows). In some of these cases, our single-view method can outperform our full two-view method, because added complexities can sometimes arise in the presence of multiple views. . . . . . . . . . . . 58 4.1 To train, our method learns from unlabeled videos with fixed viewpoint but varying illumination (top). At test time (bottom), our network produces an intrinsic image decomposition (R, S) from a single image I . 60 4.2 System overview and network. During training, our network input is an image sequence I, and the outputs are reflectance images R and shading images S for the sequence. Each block in the network depicts a convolutional/deconvolutional layer. E is an encoder, and DR and DS are decoders for the reflectance and shading images. For the innermost feature maps, we have one side output c representing the illumination color. E is an energy function measuring the cost of the decomposition. 63 4.3 Examples of challenging images in our dataset. The first two im- ages depict colorful illumination. The last two images show strong sunlight/shadows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Failure cases for intrinsic image estimation algorithms. We applied a state-of-the-art multi-image intrinsic image decomposition estimation algorithm [128] to our dataset. This method fails to produce decompo- sition results suitable for training due to strong assumptions that hold primarily for outdoor/laboratory scenes. . . . . . . . . . . . . . . . . 66 4.5 Effect of vmed in shading smoothness term. (white = large weight, black = small weight.) Adding the extra vmed can help capture smooth- ness in textured regions such as the pillows in the first row and floor in the second row. The last column shows the final smoothness weight vpq. 72 4.6 Qualitative comparisons for intrinsic image decomposition on the IIW/SAW test sets. Our network predictions achieve comparable re- sults to state-of-art intrinsic image decomposition algorithms (Bell et al. [26] and Zhou et al. [417]). . . . . . . . . . . . . . . . . . . . . . 76 4.7 Qualitative comparisons on the MIT intrinsic test set. Odd-number rows show predicted reflectance; even-numbered rows show predicted shading. (a) Input image, (b) Ground truth (GT), (c) SIRFS [19], (d) Direct Intrinsics (DI) [252], (e) Shi et al. [313], (f) Our method. . . . . 80 5.1 Overview and network architecture. Our work integrates physically- based rendered images from our CGINTRINSICS dataset and re- flectance/shading annotations from IIW and SAW in order to train a better intrinsic decomposition network. . . . . . . . . . . . . . . . . . 83 5.2 Visualization of ground truth from our CGINTRINSICS dataset. Top row: rendered RGB images. Middle: ground truth reflectance. Bottom: ground truth shading. Note that light sources are masked out when creating the ground truth decomposition. . . . . . . . . . . . . . . . . 86 5.3 Visual comparisons between our CGI and the original SUNCG dataset. Top row: images from SUNCG/PBRS. Bottom row: images from our CGI dataset. The images in our dataset have higher SNR and are more realistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Examples of predictions with and without IIW training data. Adding real IIW data can qualitatively improve reflectance and shading predictions. Note for instance how the quilt highlighted in first row has a more uniform reflectance after incorporating IIW data, and similarly for the floor highlighted in the second row. . . . . . . . . . . . . . . . 91 5.5 Examples of predictions with and without SAW training data. Adding SAW training data can qualitatively improve reflectance and shading predictions. Note the pictures/TV highlighted in the decompo- sitions in the first row, and the improved assignment of texture to the reflectance channel for the paintings and sofa in the second row. . . . . 93 5.6 Precision-Recall (PR) curve for shading images on the SAW test set. Left: PR curves generated using the unweighted SAW error metric of [205]. Right: curves generated using our more challenging gradient- weighted metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.7 Qualitative comparisons on the IIW/SAW test sets. Our predictions show significant improvements compared to state-of-the-art algorithms (Bell et al. [26] and Zhou et al. [417]). In particular, our predicted shading channels include significantly less surface texture in several challenging settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.8 Qualitative comparisons on MIT intrinsics testset. Odd rows: re- flectance predictions. Even rows: shading predictions. ? are the predic- tions fine-tuned on MIT. . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1 Crowdsampled plenoptic slices. Given a large number of tourist pho- tos taken at different times of day, our system learns to construct a continuous set of lightfields and to synthesize novel views capturing all-times-of-day scene appearance. . . . . . . . . . . . . . . . . . . . 103 6.2 Registered photo collections. Example SfM reconstructions of clusters of Internet photos sharing similar viewpoints, labeled as red dots. . . . 108 6.3 Renderings of base color and alpha. From left to right: (a) origi- nal photos at target viewpoint ck, (b) our estimated base color at ck, (c) pseudo-depth computed from the RGBα MPI at ck using our two- phase approach, (d) pseudo-depth from the baseline. For depth maps, red=close and blue=far. . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4 Learning framework. Our method builds a reference DeepMPI Dr, consisting of base color, alpha, and latent feature components organized into planar layers. A rendering network G takes a DeepMPI projected to a target viewpoint ck, and predicts corresponding RGB color layers. The appearance of these layers is modulated by an appearance vector zs produced by encoder E. The over operation O is applied to the resulting RGBα MPI to render a view. We jointly train the encoder E, rendering network G, and latent features F r in the DeepMPI by comparing a rendered view with an original exemplar image Ik = Is. During inference, given an exemplar photo Is, we can synthesize novel views close to the reference viewpoint, while also preserving the exemplar’s appearance. . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.5 Comparisons of images reconstructed with different configurations of our method. The images rendered from our full approach (e) are more similar to the ground truth images (a) than other configurations. In particular, the images rendered from the models without AdaIN (b) or the DeepMPI (c) are less realistic, and the model that does not feed the deep buffer Φrs to the encoder (d) fails to capture accurate scene appearance, as indicated in the highlighted regions. . . . . . . . . . . 115 6.6 Appearance transfer comparison. From left to right: (a) exemplar images used to extract appearance vectors, (b) predictions from MU- NIT [143], (c) predictions from NRW [238], (d) predictions from our method. Compared to the baselines, our rendered images are more photo-realistic and are more faithful to the appearance of the exem- plar images. Please zoom in to highlighted regions for better visual comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.7 Appearance interpolation. The left- and rightmost exemplar images indicate start and end appearance. Intermediate images are generated by linearly interpolating latent vectors from the two images. Odd rows show interpolation results from NRW [238], and even rows from our method. Moving shadows are indicated in highlighted regions. . . . . . 119 6.8 4D Photos. We demonstrate an application of creating 4D photos by per- forming spatial-temporal interpolation in which both camera viewpoint and scene illumination change simultaneously. . . . . . . . . . . . . . 120 6.9 Limitations. Some failure cases include: (a) input photo collections that do not span the full range of desired viewpoints, or (b) intrinsic limitations of MPI leading to poor extrapolation to large camera motions. In addition, as shown in (c) (exemplar image with strong shadow) and (d) (resulting rendering), our method can fail to model strong cast shadows produced by occluders outside the reference field of view. . . . . . . . 121 7.1 Scene flow fields warping. To render a frame at time i, we perform volume tracing along ray ri with RGBσ at time i, giving us the pixel color Ĉi(ri) (left). To warp the scene from time j to i, we offset each step along ri using scene flow fi→j and volume trace with the associated color and opacity (cj, σj) (right). . . . . . . . . . . . . . . . . . . . . 129 7.2 Scene flow disocclusion ambiguity. In this 2D orthographic example, a single blue object translates to the right by one pixel from frame at time i to frame at time j. Here, the correct scene flow at the point labeled a, e.g., fi→j (a), points one unit to the right, however, for the scene flow fi→j (c) (and similarly fj→i (a)), there can be multiple answers. If fi→j (c) = 0, then the scene flow would incorrectly point to the foreground in the next frame, and if fi→j (c) = 1, the scene flow would point to the freespace location d in the next frame. . . . . . . . . . . . 131 7.3 Qualitative ablations. Results of our full method with different loss components removed. The odd rows show zoom-in rendered color and the even rows show corresponding pseudo depth. Each component reduces the overall quality in different ways. . . . . . . . . . . . . . . 132 7.4 Dynamic and static components. Our method learns static and dy- namic components in the combined representation. Note that person is almost still in the second example. . . . . . . . . . . . . . . . . . . . . 134 7.5 Static scene representation ablation. Adding a static scene represen- tation yields higher fidelity renderings, especially in static regions (a,c) when compared to the pure dynamic model (b). . . . . . . . . . . . . . 136 7.6 Novel time synthesis. Rendering images by interpolating the time index (top) yields blending artifacts compared to our scene flow based rendering (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.7 Qualitative comparisons on the Dynamic Scenes dataset. Compared with prior methods, our rendered images more closely match the ground truth, and include fewer artifacts, as shown in the highlighted regions. . 138 7.8 Qualitative comparisons on monocular video clips. When compared to baselines, our approach more correctly synthesizes hidden content in disocclusions (shown in the last three rows), and locations with complex scene structure such as the fence in the first row. . . . . . . . . . . . . 142 7.9 Limitations. Our method is unable to extrapolate content unseen in the training views (a), and has difficulty recovering high frequency details if a video involves extreme object motions (b,c). . . . . . . . . . . . . 143 8.1 Facial expression manipulation. Computer vision has enabled control of facial expression from arbitrary videos by using image and depth information. This has raised significant security concerns regarding misuse in fake news or propaganda. Figure adapted from Thies et al.[349].147 8.2 Privacy-preserving image synthesis from 3D reconstruction. From left to right: original photo, synthesized image from standard SfM reconstruction [304] through a technique from Pittaluga et al. [278], syn- thesized image from privacy-preserving SfM 3D reconstruction [105], which excludes sensitive visual information such as humans. Figure adapted from Geppert et al. [333]. . . . . . . . . . . . . . . . . . . . 150 8.3 Privacy-preserving 3D representation. Instead of the use of points as a 3D representation, the use of randomized 3D lines to enable privacy- persevering localization has been proposed. Figure adapted from Spe- ciale et al. [333]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.4 Unfairness in data-driven computer vision. State-of-the-art facial recognition systems all reveal gender and ethnicity bias in the model’s predictions. These algorithms perform much better on light-skinned males than dark-skinned females. Figure adapted from Buolamwini et al. [48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.5 Dataset bias in depth prediction. Qualitative comparison of state-of- the-art depth prediction model [281], and our proposed depth prediction models in this thesis which was trained on MegaDepth [207] and Man- nquinChallenge [203], using images from the Microsoft COCO dataset [210]. Figure adapted from Ranftl et al. [281]. . . . . . . . . . . . . . 155 8.6 Uncertainty modeling in depth prediction. Uncertainty modeling can help in identifying confident regions during depth prediction, making the model robust to noise and potential attacks. From left to right: input image, ground-truth depth, depth prediction, estimated aleatoric uncertainty, and estimated epistemic uncertainty. Figure adapted from Kendall et al. [173]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.7 Uncertainty modeling in novel view synthesis. By modeling tran- sient and sensitive objects in Internet photos as aleatoric uncertainty, we obtain novel view synthesis results with better rendering quality and privacy-preserving properties. From left to right: rendered static component, rendered transient component, composite rendered image, original photo, and estimated aleatoric uncertainty. Figure adapted from Martin-Brualla et al. [230]. . . . . . . . . . . . . . . . . . . . . . . . 159 A.1 Additional example comparisons between MVS depth maps with and without our proposed refinement/cleaning methods. Column (b) (before filtering): the plinth of the statue in the first row and the “Statue of Liberty” in the second row both show depth bleeding ef- fect. Column (c) (after filtering): our refinement method corrects or removes such depth values. . . . . . . . . . . . . . . . . . . . . . . . 171 A.2 Additional examples of automatic ordinal labeling. Blue mask: fore- ground (Ford) derived from semantic segmentation. Red mask: back- ground (Bord) derived from reconstructed depth. . . . . . . . . . . . . 173 A.3 Examples of sampled SfM points. Red circles indicate sampled SfM points with the radius indicating estimated depth derived from SfM; small radius = small (close) depth, large radius = large (far) depth. . . 175 B.1 Network Architecture. Each block with a different color (id) in (a) indicates a convolutional layer. The block labeled H indicates a 3× 3 convolutional layer and all other blocks are implemented as a variant of an Inception module [344], as shown in (b). Parameters for each type of layer are shown in (c). We use bilinear interpolation to upsample features in the network. Figures modified from Chen et al. [63]. . . . . 179 E.1 Visual examples of reference base color images. These are over- composited from base color planes of the reference DeepMPI. . . . . . 191 E.2 Visual illustration of reference mean RGB PSV. Different images in each row indicate different depth planes of the plane sweep volume (PSV). The mean RGB images at different depth planes have different in-focus regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 E.3 Visual examples of rectified RGB images. The reference rectified images are geometrically stable and globally aligned up to disocculusion.194 F.1 Space-time view synthesis. We propose a 3D splatting-based approach to perform space-time interpolation at specified target viewpoint (shown as green camera) at an intermediate time i+ δi. Specifically, we sweep a plane over every ray r emitted from the specified target viewpoint from front to back. At each sampled step t along the ray, we query the color and density information (c, α), and the scene flows at times i and i+ 1. We then displace the 3D points along the ray by the scaled scene flow δifi→i+1, (1− δi)fi→i−1 respectively (left). The 3D displaced points are then splatted from time i and i + 1 onto a (c, α) accumulation buffer at the target viewpoint, and the splats are blended with linear weights 1 − δi, δi (middle). The final rendered view is obtained by volume rendering the accumulation buffer (right). . . . . . . . . . . . . . . . 199 F.2 Network architecture of static (time-invariant) scene representa- tion. Modified from the original NeRF architecture diagram. We predict an extra blending weight field v from intermediate features along with opacity σ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 F.3 Network Architecture of dynamic (time-variant) scene representa- tion. Modified from the original NeRF architecture diagram. We encode and input time indices i into the MLP and predict time-dependent scene flow fields Fi and disocculusion weight fieldsWi from the intermediate features along with opacity σi. . . . . . . . . . . . . . . . . . . . . . 201 CHAPTER 1 INTRODUCTION We live in a complex 4D world where we rely on our eyes to perceive color, light and dynamics around us. For example, when we drive on a highway, we can determine how far the car is front of us and how fast it is traveling, while also being able to distinguish whether a dark region is due to shadow or to painted texture. Although these tasks are simple for humans, it is very challenging for a machine to understand such physical properties from images. Therefore, a natural question comes to a mind: how can we build a machine capable of modeling a scene from cameras, and inferring the physical properties of that scene in space and across time. One way to formulate this problem is as physics-based computer vision, or inverse graphics. As illustrated in Figure 1.1, inverse graphics can be treated as the inverse process of a traditional graphics engine, with its goal of recovering scene geometry, materials, illumination and motion from one or more images [280]. Inverse graphics techniques are useful in practice, and play important role in numerous real-world applications. For instance, if we can accurately recover the space-time structure of a scene, then we can enable immersive virtual reality (VR) and augmented reality (AR) experience. In fact, Google’s ARCore, Microsoft’s Hololens, and Facebook’s Oculus rely on accurate estimates of geometry, appearance and illumination from their vision algorithms to render graphics from the correct viewpoint with plausible lighting. Inverse graphics also enables many image or video editing functionalities such as object insertion or 3D Photography, which have been adopted by many products such as Adobe PhotoShop and After Effects. In the past, inverse graphics problems have often been tackled by relying on multiple synchronized cameras or extra sensors [255, 219, 303, 46]. But such setup prevents them from being scaled to the images and videos we see and take in our everyday life. On …... Figure 1.1: Learning geometry, appearance and motion in the wild. Massive numbers of photos and videos are uploaded by people everyday (left). My work in this thesis shows how to leverage such visual data to learn scene geometry, appearance, lighting and motion, Based upon that, we further demonstrate how to synthesize novel views in both space and time. (right). the other hand, solving inverse graphics from images and videos in the wild is usually a highly ill-posed problem. For example, in the case of single-image inference shown in Figure 1.2, there can be many different physical explanations for a single image observation. However, many of these explanations are very unlikely in the real-world. Hence, deep learning is a promising approach for this task, since it can potentially learn useful priors over likely scenes by training on large amounts of data. This raise another important question: how do we get and use data for learning geometry, appearance, and motions in the wild? Unlike with object recognition datasets such as ImageNet [77] and COCO [210], it is much more difficult to collect accurate human-labelled data from crowd-sourcing for inverse graphics problems, since it is hard for humans to accurately label quantities such as depth, object motions and illumination at scale. My work instead leverages a compelling source of data for solving this problem: massive numbers of unlabelled photos and videos people upload to the Internet every day (Figure 1.1, left). On one hand, the physical properties of our world are implicitly encoded in these photos and videos. On the other hand, Internet data is extremely massive, unstructured, and uncalibrated, meaning they were taken by unknown cameras, 2 (a) Observed image (b) likely explanation (c) painter’s explanation (d) sculptor’s explanation (e) gaffer’s explanation Figure 1.2: Inverse graphics as an ill-posed problem. From a single image (a), a likely explanation of the underlying scene is shown in (b), but this image could also be a painting (c), a sculpture (d), or an effect created by a set of lights (e). Figure adapted from Adelson and Pentland [5] and Jon Barron [20]. at unknown locations, from unknown viewpoints, and at unknown times. By leveraging classical physically based tools from vision as well as graphics, and combining them with the power of machine learning, my work has made key advances on this problem. In particular, in this thesis, I address three main challenges in capturing scene geometry, appearance and dynamics from images and videos in the wild. Learning depth in the wild. Estimating the 3D geometry of scenes in the wild is one of the most important tasks for visual reasoning. Although there has been incredible progress in using multi-view geometry for 3D reconstruction of objects, buildings, and even whole cities, there are key limitations of classical multi-view based methods that are the basis for exciting research challenges. One key limitation of classical 3D methods is that they require multiple images as input. In contrast, depth estimation from single images is a long-standing, ill-posed problem. While single-view depth perception is a hot topic in computer vision, datasets for this problem are very limited, and typically focus on single domains, such as indoor or street scenes. Machine learning models trained on such data do not generalize to other kinds of photos, such as those in one’s personal photo collections. Therefore, in Chapter 2, I explore the use of multi-view Internet photo collections, a virtually unlimited data source, to generate large number of image/depth pairs as 3D supervision for training a better singe-view depth prediction model. 3 Another key assumption of classic methods is that scenes are static. However, in practice, moving objects—especially people—are very common in many scenarios, such as Augmented Reality (AR) and video processing. Thus, in Chapter 3, I present a new approach tackling the challenging problem of dense depth prediction for dynamic scenes with moving people. We take a data-driven approach and learn to predict the depth of moving people from a surprising data source: thousands of YouTube videos of people imitating mannequins. Because the people featured in such videos are stationary, geometric constraints hold, thus training data can be generated using classical multi-view reconstruction methods. Further, I propose a novel approach that uses motion parallax cues available in the videos to improve depth prediction. I demonstrate that our technique can be used to create a variety of visual effects such as video defocus, image in-painting and object insertion for AR applications.1 Learning material and illumination in the wild. Recovering Material and illumina- tion of a scene, often in the form of intrinsic image decomposition (IID), is another key topic in inverse graphics, involving factorizing an input image into a product of two other images with a physical interpretation: a reflectance image, R, containing the material color, and a shading image, S, that modulates the scene appearance via illumination. While the community has seen significant progress in IID, it remains an extremely challenging problem due to its ill-posedness. Hence, as with depth prediction, the use of learning is an appealing proposition. Unfortunately, unlike with depth prediction where we can use depth sensors or multi-view geometry based approaches, there is no device or reliable multi-view algorithms for collecting accurate ground truth for intrinsic images. Thus, in Chapter 4, my work explore a weaker, but readily available, source of training 1https://www.youtube.com/watch?v=fj_fK74y5_0 4 data for IID, Internet time-lapse videos, and leverages consistency between each frame of a given sequence as a supervision signal. Based on this idea, I introduce BigTime, a dataset of unlabeled time-lapse image sequences collected from the Internet. While the sequences in BigTime do not provide ground truth intrinsic images, I can instead exploit rich indirect consistency cues during training by specifying that the model should predict decompositions where the scene appearance itself is consistent across a sequence, and any variation is due to changes in illumination. This work is the first of its kind to show the power of unsupervised learning using Internet time-lapse video for material and illumination estimation in the wild [206]. In Chapter 5, I further explore the use of synthetic data generated by physics-based rendering engines for learning a better intrinsic images model. I introduce CGIntrinsics, a large-scale synthetic dataset with full ground truth decompositions. Surprisingly, we find that a decomposition network trained solely on our data outperforms the state-of-the-art, demonstrating the surprising effectiveness of synthetic data for the intrinsic images task. Reconstructing the Plenoptic Function in the wild. The intrinsic physical properties estimated by inverse graphics techniques can further enable image-based rendering. Image-based rendering refers to the problem of synthesizing images at novel viewpoints from a set of 2D images. Most previous work focuses on rendering novel views in 3D, which are often described as light fields [199]. These methods usually require synchronized camera arrays, or the assumption that the captured scene is completely static with time-invariant appearance and geometry. However, if we want to capture the entire world and the render it in Mixture Reality, it would be impossible to do so by putting such camera arrays in every corner of the world. More importantly, in reality, we know our world is dynamic: people, cars, animals can freely move in a scene; photos and videos can also be taken at different seasons, or at different times of day. So another 5 question arises: how can we synthesize photo-realistic novel views in both space and time? In the literature, such space-time view synthesis can be characterized with a function called the Plenoptic function [4]. The Plenoptic function is a hypothetical function representing everything we can ever see, and describing the light reaching an observer at any point in space and time. In other words, we can define this function as the light passing through the camera center at every 3D location, at every possible viewing angle, and at any given time. In Chapter 6 and 7, I demonstrate how to use unlabelled Internet visual data to reconstruct the Plenoptic function in the wild. First, to enable modeling of scene appearance and illumination changes over time, in Chapter 6, I present a new approach to novel view synthesis under time-varying illumination from Internet photo collections without temporal registration. Our approach builds on the recent multi-plane image (MPI) format for representing local light fields under fixed viewing conditions. I introduce a DeepMPI representation, motivated by observations on the sparsity structure of the plenoptic function, that allows for real- time synthesis of photorealistic views that are continuous in both space and across changes in lighting. My method can synthesize the same compelling parallax and view- dependent effects as previous methods, while simultaneously interpolating along changes in reflectance and illumination with time. Another important temporal factor that leads to appearance change over time is scene dynamics. Thus, in Chapter 7, I present a new method for novel view and time synthesis of complex dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, I introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. My representation is optimized through a neural network to fit the 6 observed input views. I show that our representation can be used for challenging dynamic scenes, including thin structures, view-dependent effects, and natural degrees of motion. In Chapter 8, I discuss important ethical concerns in computer vision and deep learning and show concrete case studies related to our topics in this thesis. I then conclude in Chapter 9, discussing future research direction. 7 CHAPTER 2 LEARNING SINGLE VIEW DEPTH PREDICTION FROM INTERNET PHOTOS 2.1 Introduction Predicting 3D shape from a single image is an important capability of visual reasoning, with applications in robotics, graphics, and other vision tasks such as intrinsic images. While single-view depth estimation is a challenging, underconstrained problem, deep learning methods have recently driven significant progress. Such methods thrive when trained with large amounts of data. Unfortunately, fully general training data in the form of (RGB image, depth map) pairs is difficult to collect. Commodity RGB-D sensors such as Kinect have been widely used for this purpose [322], but are limited to indoor use. Laser scanners have enabled important datasets such as Make3D [302] and KITTI [237], but such devices are cumbersome to operate (in the case of industrial scanners), or produce sparse depth maps (in the case of LIDAR). Moreover, both Make3D and KITTI are collected in specific scenarios (a university campus, and atop a car, respectively). Training data can also be generated through crowdsourcing, but this approach has so far been limited to gathering sparse ordinal relationships or surface normals [108, 63, 64]. In this chapter, we explore the use of a nearly unlimited source of data for this problem: images from the Internet from overlapping viewpoints, from which structure-from-motion (SfM) and multi-view stereo (MVS) methods can automatically produce dense depth. Such images have been widely used in research on large-scale 3D reconstruction [328, 111, 8, 92]. We propose to use the outputs of these systems as the inputs to machine learning methods for single-view depth prediction. By using large amounts of diverse training data from photos taken around the world, we seek to learn to predict depth with high accuracy and generalizability. Based on this idea, we introduce MegaDepth (MD), a large-scale depth dataset generated from Internet photo collections. To our knowledge, ours is the first use of Internet SfM+MVS data for single-view depth prediction. Our main contribution is the MD dataset itself. In addition, in creating MD, we found that care must be taken in preparing a dataset from noisy MVS data, and so we also propose new methods for processing raw MVS output, and a corresponding new loss function for training models with this data. Notably, because MVS tends to not reconstruct dynamic objects (people, cars, etc), we augment our dataset with ordinal depth relationships automatically derived from semantic segmentation, and train with a joint loss that includes an ordinal term. In our experiments, we show that by training on MD, we can learn a model that works well not only on images of new scenes, but that also generalizes well to completely different datasets, including Make3D, KITTI, and DIW—achieving better generalization than prior datasets. 2.2 Related work Single-view depth prediction. A variety of methods have been proposed for single-view depth prediction, most recently by utilizing machine learning [139, 301]. A standard approach is to collect RGB images with ground truth depth, and then train a model (e.g., a CNN) to predict depth from RGB [85, 213, 214, 298, 13, 191]. Most such methods are trained on a few standard datasets, such as NYU [321, 322], Make3D [302], and KITTI [104], which are captured using RGB-D sensors (such as Kinect) or laser scanning. Such scanning methods have important limitations, as discussed in the introduction. Recently, Novotny et al. [263] trained a network on 3D models derived from SfM+MVS on videos to learn 3D shapes of single objects. However, their method 9 is limited to images of objects, rather than scenes. Multiple views of a scene can also be used as an implicit source of training data for single-view depth prediction, by utilizing view synthesis as a supervisory signal [385, 101, 110, 415]. However, view synthesis is only a proxy for depth, and may not always yield high-quality learned depth. Ummenhofer et al. [355] trained from overlapping image pairs taken with a single camera, and learned to predict image matches, camera poses, and depth. However, it requires two input images at test time. Ordinal depth prediction. Another way to collect depth data for training is to ask people to manually annotate depth in images. While labeling absolute depth is challenging, people are good at specifying relative (ordinal) depth relationships (e.g., closer-than, further-than) [108]. Zoran et al.[428] used such relative depth judgments to predict ordinal relationships between points using CNNs. Chen et al.leveraged crowdsourcing of ordinal depth labels to create a large dataset called “Depth in the Wild” [63]. While useful for predicting depth ordering (and so we incorporate ordinal data automatically generated from our imagery), the Euclidean accuracy of depth learned solely from ordinal data is limited. Depth estimation from Internet photos. Estimating geometry from Internet photo collections has been an active research area for a decade, with advances in both structure from motion [328, 8, 377, 304] and multi-view stereo [111, 95, 306]. These techniques generally operate on 10s to 1000s of images. Using such methods, past work has used retrieval and SfM to build a 3D model seeded from a single image [305], or registered a photo to an existing 3D model to transfer depth [402]. However, this work requires either having a detailed 3D model of each location in advance, or building one at run-time. Instead, we use SfM+MVS to train a network that generalizes to novel locations and scenarios. 10 2.3 The MegaDepth Dataset In this section, we describe how we construct our dataset. We first download Internet photos from Flickr for a set of well-photographed landmarks from the Landmarks10K dataset [201]. We then reconstruct each landmark in 3D using state-of-the-art SfM and MVS methods. This yields an SfM model as well as a dense depth map for each reconstructed image. However, these depth maps have significant noise and outliers, and training a deep network on this raw depth data will not yield a useful predictor. Therefore, we propose a series of processing steps that prepare these depth maps for use in learning, and additionally use semantic segmentation to automatically generate ordinal depth data. 2.3.1 Photo calibration and reconstruction We build a 3D model from each photo collection using COLMAP, a state-of-art SfM system [304] (for reconstructing camera poses and sparse point clouds) and MVS sys- tem [306] (for generating dense depth maps). We use COLMAP because we found that it produces high-quality 3D models via its careful incremental SfM procedure, but other such systems could be used. COLMAP produces a depth map D for every reconstructed photo I (where some pixels of D can be empty if COLMAP was unable to recover a depth), as well as other outputs, such as camera parameters and sparse SfM points plus camera visibility. 11 (a) Input photo (b) Raw depth (c) Refined depth Figure 2.1: Comparison between MVS depth maps with and without our proposed refinement/cleaning methods. The raw MVS depth maps (middle) exhibit depth bleed- ing (top) or incorrect depth on people (bottom). Our methods (right) can correct or remove such outlier depths. 2.3.2 Depth map refinement The raw depth maps from COLMAP contain many outliers from a range of sources, including: (1) transient objects (people, cars, etc.) that appear in a single image but nonetheless are assigned (incorrect) depths, (2) noisy depth discontinuities, and (3) bleeding of background depths into foreground objects. Other MVS methods exhibit similar problems due to inherent ambiguities in stereo matching. Figure 2.1(b) shows two example depth maps produced by COLMAP that illustrate these issues. Such outliers have a highly negative effect on the depth prediction networks we seek to train. To address this problem, we propose two new depth refinement methods designed to generate high-quality training data: First, we devise a modified MVS algorithm based on COLMAP, but more conservative in its depth estimates, based on the idea that we would prefer less training data over bad 12 training data. COLMAP computes depth maps iteratively, at each stage trying to ensure geometric consistency between nearby depth maps. One adverse effect of this strategy is that background depths can tend to “eat away” at foreground objects, because one way to increase consistency between depth maps is to consistently predict the background depth (see Figure 2.1 (top)). To counter this effect, at each depth inference iteration in COLMAP, we compare the depth values at each pixel before and after the update and keep the smaller (closer) of the two. We then apply a median filter to remove unstable depth values. We describe our modified MVS algorithm in detail in the Appendix. Second, we utilize semantic segmentation to enhance and filter the depth maps, and to yield large amounts of ordinal depth comparisons as additional training data. The second row of Figure 2.1 shows an example depth map computed with our object-aware filtering. We now describe our use of semantic segmentation in detail. 2.3.3 Depth enhancement via semantic segmentation Multi-view stereo methods can have problems with a number of object types, including transient objects such as people and cars, difficult-to-reconstruct objects such as poles and traffic signals, and sky regions. However, if we can understand the semantic layout of an image, then we can attempt to mitigate these issues, or at least identify problematic pixels. We have found that deep learning methods for semantic segmentation are starting to become reliable enough for this use [407]. We propose three new uses of semantic segmentation in the creation of our dataset. First, we use such segmentations to remove spurious MVS depths in foreground regions. Second, we use the segmentation as a criterion to categorize each photo as providing either Euclidean depth or ordinal depth data. Finally, we combine semantic information 13 and MVS depth to automatically annotate ordinal depth relationships, which can be used to help training in regions that cannot be reconstructed by MVS. Semantic filtering. To process a given photo I , we first run semantic segmentation using PSPNet [407], a recent segmentation method, trained on the MIT Scene Parsing dataset (consisting of 150 semantic categories) [412]. We then divide the pixels into three subsets by predicted semantic category: 1. Foreground objects, denoted F , corresponding to objects that often appear in the foreground of scenes, including static foreground objects (e.g., statues, fountains) and dynamic objects (e.g., people, cars). 2. Background objects, denotedB, including buildings, towers, mountains, etc. (See Appendix for full details of the foreground/background classes.) 3. Sky, denoted S, which is treated as a special case in the depth filtering described below. We use this semantic categorization of pixels in several ways. As illustrated in Figure 2.1 (bottom), transient objects such as people can result in spurious depths. To remove these from each image I , we consider each connected component C of the foreground mask F . If< 50% of pixels in C have a reconstructed depth, we discard all depths from C. We use a threshold of 50%, rather than simply removing all foreground depths, because pixels on certain objects in F (such as sculptures) can indeed be accurately reconstructed (and we found that PSPNet can sometimes mistake sculptures and people for one another). This simple filtering of foreground depths yields large improvements in depth map quality. Additionally, we remove reconstructed depths that fall inside the sky region S, as such depths tend to be spurious. Euclidean vs. ordinal depth. For each 3D model we have thousands of reconstructed 14 Figure 2.2: Examples of automatic ordinal labeling. Blue mask: foreground (Ford) derived from semantic segmentation. Red mask: background (Bord) derived from recon- structed depth. Internet photos, and ideally we would use as much of this depth data as possible for training. However, some depth maps are more reliable than others, due to factors such as the accuracy of the estimated camera pose or the presence of large occluders. Hence, we found that it is beneficial to limit training to a subset of highly reliable depth maps. We devise a simple but effective way to compute a subset of high-quality depth maps, by thresholding by the fraction of reconstructed pixels. In particular, if ≥ 30% of an image I (ignoring the sky region S) consists of valid depth values, then we keep that image as training data for learning Euclidean depth. This criterion prefers images without large transient foreground objects (e.g., “no selfies”). At the same time, such foreground-heavy images are extremely useful for another purpose: automatically generating training data for learning ordinal depth relationships. Automatic ordinal depth labeling. As noted above, transient or difficult to reconstruct objects, such as people, cars, and street signs are often missing from MVS reconstructions. Therefore, using Internet-derived data alone, we will lack ground truth depth for such objects, and will likely do a poor job of learning to reconstruct them. To address this issue, we propose a novel method of automatically extracting ordinal depth labels from our training images based on their estimated 3D geometry and semantic segmentation. Let us denote as O (“Ordinal”) the subset of photos that do not satisfy the “no selfies” criterion described above. For each image I ∈ O, we compute two regions, Ford ∈ F 15 (based on semantic information) and Bord ∈ B (based on 3D geometry information), such that all pixels in Ford are likely closer to the camera than all pixels in Bord. Briefly, Ford consists of large connected components of F , and Bord consists of large components of B that also contain valid depths in the last quartile of the full depth range for I (see Appendix for full details). We found this simple approach works very well (> 95% accuracy in pairwise ordinal relationships), likely because natural photos tend to be composed in certain common ways. Several examples of our automatic ordinal depth labels are shown in Figure 2.2. 2.3.4 Creating a dataset We use the approach above to densely reconstruct 200 3D models from landmarks around the world, representing about 150K reconstructed images. After our proposed filtering, we are left with 130K valid images. Of these 130K photos, around 100K images are used for Euclidean depth data, and the remaining 30K images are used to derive ordinal depth data. We also include images from [180] in our training set. Together, this data comprises the MegaDepth (MD) dataset, available at http://www.cs.cornell. edu/projects/megadepth/. 2.4 Depth estimation network This section presents our end-to-end deep learning algorithm for predicting depth from a single photo. 16 2.4.1 Network architecture We evaluated three networks used in prior work on single-view depth prediction: VGG [84], the “hourglass” network [63], and a ResNet architecture [191]. Of these, the hourglass network performed best, as described in Section 2.5. 2.4.2 Loss function The 3D data produced by SfM+MVS is only up to an unknown scale factor, so we cannot compare predicted and ground truth depths directly. However, as noted by Eigen and Fergus [85], the ratios of pairs of depths are preserved under scaling (or, in the log-depth domain, the difference between pairs of log-depths). Therefore, we solve for a depth map in the log domain and train using a scale-invariant loss function, Lsi. Lsi combines three terms: Lsi = Ldata + αLgrad + βLord. (2.1) Scale-invariant data term. We adopt the loss of Eigen and Fergus [85], which computes the mean square error (MSE) of the difference between all pairs of log-depths in linear time. Suppose we have a predicted log-depth map L, and a ground truth log depth map L∗. Li and L∗i denote corresponding individual log-depth values indexed by pixel position i. We denote Ri = Li − L∗i and define:∑ (n1 ∑ )n 2L = (R )2 − 1data i Ri (2.2) n n2 i=1 i=1 where n is the number of valid depths in the ground truth depth map. Multi-scale scale-invariant gradient matching term. To encourage smoother gradient changes and sharper depth discontinuities in the predicted depth map, we introduce a 17 Input photo Output w/o Lgrad Output w/ Lgrad Figure 2.3: Effect of Lgrad term. Lgrad encourages predictions to match the ground truth depth gradient. multi-scale scale-invariant gradient matching term Lgrad, defined as an `1 penalty on differences in log-depth gradients b∑etw∑een(t∣he predicted and g1 ∣ ∣∣ ∣∣ ∣ r ∣) ound truth depth map: Lgrad = ∇ k kxRi + ∇yRn i (2.3) k i where Rki is the value of the log-depth difference map at position i and scale k. Because the loss is computed at multiple scales, Lgrad captures depth gradients across large image distances. In our experiments, we use four scales. We illustrate the effect of Lgrad in Figure 2.3. Ordinal depth loss. Inspired by Chen et al. [63], our ordinal depth loss term Lord utilizes the automatic ordinal relations described in Section 2.3.3. During training, for each image in our ordinal set O, we pick a single pair of pixels (i, j), with pixel i and j either belonging to the foreground region Ford or the background region Bord. Lord is designed to be robust to the small number of incorrectly ordered pairs.  log ((1 + exp ((P√ij)) )) if Pij ≤ τLord =  (2.4)log 1 + exp Pij + c if Pij > τ 18 Input photo Output w/o Lord Output w/ Lord Figure 2.4: Effect of Lord term. Lord tends to corrects ordinal depth relations for hard- to-construct objects such as the person in the first row and the tree in the second row. where P = −r∗ij ij (Li − Lj) and r∗ij is the automatically labeled ordinal depth relation between i and j (r∗ij = 1 if pixel i is further than j and −1 otherwise). c is a constant set so that Lord is continuous. Lord encourages the depth difference of a pair of points to be large (and ordered) if our automatic labeling method judged the pair to have a likely depth ordering. We illustrate the effect of Lord in Figure 2.4. In our tests, we set τ = 0.25 based on cross-validation. 2.5 Evaluation In this section, we evaluate our networks on a number of datasets, and compare to several state-of-art depth prediction algorithms, trained on a variety of training data. In our evaluation, we seek to answer several questions, including: • How well does our model trained on MD generalize to new Internet photos from never-before-seen locations? 19 • How important is our depth map processing? What is the effect of the terms in our loss function? • How well does our model trained on MD generalize to other types of images from other datasets? The third question is perhaps the most interesting, because the promise of training on large amounts of diverse data is good generalization. Therefore, we run a set of experiments training on one dataset and testing on another, and show that our MD dataset gives the best generalization performance. We also show that our depth refinement strategies are essential for achieving good generalization, and show that our proposed loss function—combining scale-invariant data terms with an ordinal depth loss—improves prediction performance both quantitatively and qualitatively. Experimental setup. Out of the 200 reconstructed models in our MD dataset, we randomly select 46 to form a test set (locations not seen during training). For the remaining 154 models, we randomly split images from each model into training and validation sets with a ratio of 96% and 4% respectively. We set α = 0.5 and β = 0.1 using MD validation set. We implement our networks in PyTorch [272], and train using Adam [177] for 20 epochs with batch size 32. For fair comparison, we train and validate our network using MD data for all experi- ments. Due to variance in performance of cross-dataset testing, we train four models on MD and compute the average error. performance of each individual model). 20 2.5.1 Evaluation and ablation study on MD test set In this subsection, we describe experiments where we train on our MD training set and test on the MD test set. Error metrics. For numerical evaluation, we use two scale-invariant error measures (as with our loss function, we use scale-invariant measures due to the scale-free nature of SfM models). The first measure is the scale-invariant RMSE (si-RMSE) (Equation 2.2), which measures precise numerical depth accuracy. The second measure is based on the preservation of depth ordering. In particular, we use a measure similar to [428, 63] that we call the SfM Disagreement Rate (SDR). SDR is based on the rate of disagreement with ordinal depth relationships derived from estimated SfM points. We use sparse SfM points rather than dense MVS because we found that sparse SfM points capture some structures not reconstructed by MVS (e.g., complex objects such as lampposts). We define SDR(D,D∗), the ordinal disagreement rate between the predicted (non-log) depth map D = exp(L) and ground-truth SfM depths D∗, as: ∑ SDR(D,D∗ 1 ( ) ) = 1 ord(Di, D ∗ ∗ j) 6= ord(D n i , Dj ) (2.5) i,j∈P where P is the set of pairs of pixels with available SfM depths to compare, n is the total number of pairwise comparisons, and ord(·, ·) is one of three depth relations (further-than, closer-than, and same-depth-as):  1 if Di > 1 + δDj ord(Di, Dj) = −1 if Di < 1− δ (2.6)  D j0 if 1− δ ≤ Di ≤ 1 + δ Dj We also define SDR= and SDR 6= as the disagreement rate with ord(D∗i , D ∗ j ) = 0 and ord(D∗i , D ∗ j ) =6 0 respectively. In our experiments, we set δ = 0.1 for tolerance to 21 Network si-RMSE SDR=% SDR6=% SDR% VGG∗ [84] 0.116 31.28 28.63 29.78 VGG (full) 0.114 29.34 26.91 27.53 ResNet (full) 0.112 26.25 24.23 25.14 HG (full) 0.103 28.00 23.74 25.59 Table 2.1: Results on the MD test set (places unseen during training) for several network architectures. For VGG∗ we use the same loss and network architecture as in [84] for comparison to [84]. Lower is better. Method si-RMSE SDR=% SDR=6 % SDR% Ldata only 0.146 32.32 29.96 30.08 +Lgrad 0.111 25.17 27.32 26.11 +Lgrad +Lord 0.103 28.00 23.74 25.59 Table 2.2: Results on MD test set (places unseen during training) for different loss configurations. Lower is better. uncertainty in SfM points. For efficiency, we sample SfM points from the full set to compute this error term. Effect of network and loss variants. We evaluate three popular network architectures for depth prediction on our MD test set: the VGG network used by Eigen et al. [84], an “hourglass”(HG) network [63], and ResNets [191]. To compare our loss function to that of Eigen et al. [84], we also test the same network and loss function as [84] trained on MD. [84] uses a VGG network with a scale-invariant loss plus single scale gradient matching term. Quantitative results are shown in Table 2.1 and qualitative comparisons are shown in Figure 2.5. We also evaluate variants of our method trained using only some of our loss terms: (1) a version with only the scale-invariant data term Ldata (the same loss as in [85]), (2) a version that adds our multi-scale gradient matching loss Lgrad, and (3) the full version including Lgrad and the ordinal depth loss Lord. Results are shown in Table 2.2. As shown in Tables 2.1 and 2.2, the HG architecture achieves the best performance of 22 Test set Error measure Raw MD Clean MD Make3D RMS 11.41 5.322 Abs Rel 0.614 0.364 log10 0.386 0.152 KITTI RMS 12.15 6.621 RMS(log) 0.582 0.369 Abs Rel 0.433 0.307 Sq Rel 3.927 2.546 DIW WHDR% 31.32 24.55 Table 2.3: Results on three different test sets with and without our depth refinement methods. Raw MD indicates raw depth data; Clean MD indicates depth data using our refinement methods. Lower is better for all error measures. the three architectures, and training with our full loss yields better performance compared to other loss variants, including that of [84] (first row of Table 2.1). One thing to notice that is adding Lord could significantly improve SDR=6 while increasing SDR=. Figure 2.5 shows that our joint loss helps preserve the structure of the depth map and capture nearby objects such as people and buses. Finally, we experiment with training our network on MD with and without our proposed depth refinement methods, testing on three datasets: KITTI, Make3D, and DIW. The results, shown in Table 2.3, show that networks trained on raw MVS depth do not generalize well. Our proposed refinements significantly boost prediction performance. 2.5.2 Generalization to other datasets A powerful application of our 3D-reconstruction-derived training data is to generalize to outdoor images beyond landmark photos. To evaluate this capability, we train our model on MD and test on three standard benchmarks: Make3D [301], KITTI [104], and DIW [63]—without seeing training data from these datasets. Since our depth prediction is 23 (a) Image (b) GT (c) VGG∗ (d) ResNet (e) HG Figure 2.5: Depth predictions on MD test set. (Blue=near, red=far.) For visualization, we mask out the detected sky region. (a) Input photo. (b) Ground truth COLMAP depth map (GT). (c) VGG∗ prediction using the loss and network of [84]. (d) Depth prediction from a ResNet [191]. (e) Depth prediction from an hourglass (HG) network [63] . defined up to a scale factor, for each dataset, we align each prediction with the ground truth by a scaling factor based on ratio between ground truth and predicted depth. Make3D. To test on Make3D, we follow the protocol of prior work [214, 191],resizing all images to 345× 460, and removing ground truth depths larger than 70m (since Make3D data is unreliable at large distances). We train our network only on MD using our full loss. Table 2.4 shows numerical results, including comparisons to several methods trained on both Make3D and non-Make3D data, and Figure 2.6 visualizes depth predictions from our model and several other non-Make3D-trained models. Our network trained on MD has the best performance among all non-Make3D-trained models. Our model even outperforms several models trained directly on Make3D. Finally, the last row of 24 Training set Method RMS Abs Rel log10 Make3D Karsch et al. [170] 9.2 0.355 0.127 Liu et al. [216] 9.49 0.335 0.137 Liu et al. [213] 8.6 0.314 0.119 Li et al. [200] 7.19 0.278 0.092 Laina et al. [191] 4.45 0.176 0.072 Xu et al. [386] 4.38 0.184 0.065 NYU Eigen et al. [84] 6.96 0.427 0.180 Liu et al. [213] 7.96 0.438 0.186 Laina et al. [191] 7.99 0.466 0.195 KITTI Zhou et al. [415] 10.47 0.383 0.478 Godard et al. [110] 11.76 0.544 0.193 DIW Chen et al. [63] 5.59 0.424 0.176 MD Ours 5.32 0.364 0.152 MD+Make3D Ours 4.26 0.176 0.069 Table 2.4: Results on Make3D for various training datasets and methods. The first column indicates the training dataset. Errors for “Ours” are averaged over four models trained/validated on MD. Lower is better for all metrics. Table 2.4 shows that our model fine-tuned on Make3D achieves better performance than the state-of-the-art. KITTI. Next, we evaluate our model on the KITTI test set based on the split of [85]. As with our Make3D experiments, we do not use images from KITTI during training. The KITTI dataset is very different from ours, consisting of driving sequences that include objects, such as sidewalks, cars, and people, that are difficult to reconstruct with SfM/MVS. Nevertheless, as shown in Table 2.5, our MD-trained network still outperforms approaches trained on non-KITTI datasets. In particular, our performance is similar to the method of Zhou et al. [415] trained on the Cityscapes (CS) dataset. CS also consists of driving image sequences quite similar to KITTI’s. In contrast, our MD dataset contains much more diverse scenes. Finally, the last row of Table 2.5 shows that we can achieve state-of-the-art performance by fine-tuning our network on KITTI training data. 25 (a) Image (b) GT (c) DIW [63] (d) NYU [84] (e) KITTI [110] (f) MD Figure 2.6: Depth predictions on Make3D. The last four columns show results from the best models trained on non-Make3D datasets (final column is our result). (a) Image (b) GT (c) DIW (D) Make3D (e) MD Figure 2.7: Depth predictions on KITTI. (Blue=near, red=far.) None of the models were trained on KITTI data. From left to right: (a) input image, (b) ground truth (GT), (c) model trained on DIW [63], (d) model trained on Make3D [191], (e) Ours trained on MD. Figure 2.7 shows visual comparisons between our results and models trained on other non-KITTI datasets. One can see that we achieve much better visual quality compared to other non-KITTI datasets, and our predictions can reasonably capture nearby objects such as traffic signs, cars, and trees, due to our ordinal depth loss. DIW. Finally, we test our network on the DIW dataset [63]. DIW consists of Internet photos with general scene structures. Each image in DIW has a single pair of points with 26 Training set Method RMS RMS(log) Abs Rel Sq Rel KITTI Liu et al. [214] 6.52 0.275 0.202 1.614 Eigen et al. [85] 6.31 0.282 0.203 1.548 Zhou et al. [415] 6.86 0.283 0.208 1.768 Godard et al. [110] 5.93 0.247 0.148 1.334 Make3D Laina et al. [191] 8.50 0.397 0.311 3.201 Liu et al. [213] 11.88 0.416 0.365 7.591 NYU Eigen et al. [84] 10.47 0.492 0.367 3.716 Liu et al. [213] 10.19 0.446 0.321 3.118 Laina et al. [191] 10.58 0.508 0.390 3.939 CS Zhou et al. [415] 7.58 0.334 0.267 2.686 DIW Chen et al. [63] 7.11 0.471 0.409 3.270 MD Ours 6.62 0.369 0.307 2.546 MD+KITTI Ours 5.90 0.241 0.141 1.328 Table 2.5: Results on the KITTI test set for various training datasets and ap- proaches. Columns are as in Table 2.4. Training set Method WHDR% DIW Chen et al. [63] 22.14 KITTI Zhou et al. [415] 31.24 Godard et al. [110] 30.52 NYU Eigen et al. [84] 25.70 Laina et al. [191] 45.30 Liu et al. [213] 28.27 Make3D Laina et al. [191] 31.65 Liu et al. [213] 29.58 MD Ours 24.55 Table 2.6: Results on the DIW test set for various training datasets and approaches. Columns are as in Table 2.4. a human-labeled ordinal depth relationship. As with Make3D and KITTI, we do not use DIW data during training. For DIW, quality is computed via the Weighted Human Disagreement Rate (WHDR), which measures the frequency of disagreement between predicted depth maps and human annotations on a test set. Numerical results are shown in Table 2.6. Our MD-trained network again has the best performance among all non-DIW trained models. Figure 2.8 visualizes our predictions and those of other non-DIW-trained 27 (a) Image (b) NYU [84] (c) KITTI [110] (d) Make3D [213] (e) Ours Figure 2.8: Depth predictions on the DIW test set. (Blue=near, red=far.) Captions are described in Figure 2.7. None of the models were trained on DIW data. networks on DIW test images. Our predictions achieve visually better depth relationships. Our method even works reasonably well for challenging scenes such as offices and close-ups. 2.6 Discussion In this chapter, we presented a new use for Internet-derived SfM+MVS data: generating large amounts of training data for single-view depth prediction. We demonstrated that this data can be used to predict state-of-the-art depth maps for locations never observed during training, and generalizes very well to other datasets. However, our method also has a number of limitations. MVS methods still do not perfectly reconstruct even static scenes, particularly when there are oblique surfaces (e.g., ground), thin or complex objects (e.g., lampposts), and difficult materials (e.g., shiny glass). Our method does not predict metric depth; future work in SfM could use learning or semantic information to correctly scale scenes. Our dataset is currently biased towards outdoor landmarks, 28 though by scaling to much larger input photo collections we will find more diverse scenes. Despite these limitations, our work points towards the Internet as an intriguing, useful source of data for geometric learning problems. 29 CHAPTER 3 LEARNING THE DEPTHS OF MOVING PEOPLE BY WATCHING FROZEN PEOPLE 3.1 Introduction A hand-held camera capturing video of a dynamic scene is a common scenario. Recover- ing dense geometry in this case is a challenging task: moving objects violate the epipolar constraint commonly used in 3D vision (Figure 3.2), and are often treated as noise or outliers in existing structure-from-motion (SfM) and multi-view stereo (MVS) methods. Human depth perception, however, is not easily fooled by object motion—rather, we maintain a feasible interpretation of the objects’ geometry and depth ordering even if both the observer and the objects are moving, and even when the scene is observed with just one eye [140]. In this work, we take a step towards achieving this ability computationally. We focus on the task of predicting accurate, dense depth from ordinary videos where both the camera and people in the scene are naturally moving. We focus on humans for two reasons: i) in many application areas, such as augmented reality, humans constitute the salient objects in the scene, and ii) human motion is articulated and difficult to model. By taking a data-driven approach, we avoid the need to explicitly impose assumptions on the shape or deformation of people, but instead learn these priors from data. Where do we get data to train such a method? Generating high-quality synthetic data where both the camera and the people in the scene are naturally moving is very challenging. One approach would be to record real scenes with an RGBD sensor (e.g., a Microsoft Kinect), but such data is typically limited to indoor environments and requires significant manual work to capture and process. In addition, if such a dataset is captured in the lab, a model trained on it may have difficulty generalizing to real scenes. It is also difficult to gather a diverse collection of people with diverse poses at scale. Instead, we derive data from a surprising source: YouTube videos in which people imitate mannequins, i.e., freeze in elaborate, natural poses, while a hand-held camera tours the scene (Figure 3.3). These videos comprise our new MannequinChallenge (MC) dataset, which we have released for the research community [202]. Because the entire scene in such videos is stationary—including the people—we can accurately estimate camera poses and depth using modern SfM and MVS algorithms, and then use this derived 3D data as supervision for training a model to predict depth for moving scenes. In particular, we design and train a deep neural network that takes an input RGB image, a mask indicating human regions, and an initial depth defined for the static environment (i.e., the non-human regions), and outputs a dense depth map over the entire image—both the environment and the people. Note that the initial environmental depth is computed using motion parallax between two video frames, providing the network with information not available from a single frame. Once trained, our model can handle natural videos with arbitrary camera and human motion. Figure 3.1 illustrates our approach. We demonstrate our method on a variety of real-world Internet videos shot with a hand-held camera and depicting complex human actions such as walking, running, and dancing. Our model predicts depth with higher accuracy than state-of-the-art monocular depth prediction and motion stereo methods. We further show how our predicted depth maps can be used to produce various 3D effects such as synthetic depth-of-field, depth- aware inpainting, and insertion of virtual objects into 3D scenes with correct occlusion. In summary, our contributions are: i) a new source of data for depth prediction consisting of a large number of Internet videos in which the camera moves around people 31 Train Inference MannequinChallenge (MC) Dataset MVS Depth Moving people, moving camera Static scene, moving camera (supervison) Human Mask Initial depth from flow RGB Image Predicted depth Our depth predictions Figure 3.1: Our model predicts dense depth when both an ordinary camera and people in the scene are freely moving (right). We train our model on our new MannequinChallenge dataset—a collection of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a camera tours the scene (left). Because people are stationary, geometric constraints hold; this allows us to use multi-view stereo to estimate depth which serves as supervision during training. In all figures, we use inverse depth maps for visualization purposes, and refer to them as depth maps. “frozen” in natural poses, along with a methodology for generating accurate depth maps and camera poses; and ii) a deep-network-based model that makes use of motion parallax cues from video sequences, and that is designed and trained to predict dense depth maps in the challenging case of simultaneous camera motion and complex human motion. 3.2 Related Work Learning-based depth prediction. Numerous algorithms, based on both supervised and unsupervised learning methods, have recently been proposed for predicting dense depth from a single RGB image [387, 191, 94, 85, 63, 207, 306, 109, 416, 398, 226, 362]. However, because these methods use a single RGB image, they ignore useful motion parallax cues present in video sequences. Some recent learning-based methods also consider multiple images for depth estimation, either assuming known camera poses [141, 395] or simultaneously predicting camera poses along with depth [354, 413]. However, these methods assume that the captured scenes are completely static. They are not designed to estimate depth for dynamic objects, which is the focus of our work. 32 ! !"# ! !"# Traditional Stereo Our Case (static scene / stereo camera) (moving camera, moving people) Figure 3.2: Traditional stereo vs. our setup. Left: a person is observed at the same time instant from two different views. The 3D position of points can be computed using triangulation. Right: when both the camera and the objects in the scene are moving, triangulation is no longer possible since the epipolar constraint does not apply. Depth estimation for dynamic scenes. Depth information captured from RGBD sensors or stereo cameras has been widely used for 3D modeling of dynamic scenes [255, 427, 396, 79, 150, 369, 159, 285, 22, 21]. However, only a few methods attempt to estimate depth from a monocular camera. Several methods have sought to reconstruct sparse geometry for dynamic scenes using either a single monocular camera [270, 411, 324], or multiple unsynchronized cameras [360]. Russell et al. [299] and Ranftl et al. [282] suggest motion/object segmentation–based algorithms to decompose a dynamic scene into piecewise rigid parts before inferring depth ordering. However, these methods impose strong assumptions about object motion that can be violated by articulated human motion. More recently, Rematas et al. [284] predict depth for moving soccer players using synthetic training data from FIFA video games. However, their method is limited to soccer players, and cannot handle general people in the wild. 33 RGBD datasets for learning depth. There are a number of RGBD datasets of indoor scenes, captured using depth sensors [322, 56, 72, 382] or rendered from synthetic data [330]. However, none of these datasets provide depth supervision for moving people in natural environments. In particular, several action recognition methods use depth sensors to capture human actions [424, 319, 233, 256], but most of these use a static camera and provide only a limited number of indoor scenes. REFRESH [223] is a recent semi-synthetic scene flow dataset created by overlaying animated people on NYUv2 images. Here, too, the data is limited to interior scenes and consists of synthetic humans placed in unrealistic configurations with respect to their surroundings. The resulting trained models thus have limited ability to generalize to real scenarios. Human shape and pose prediction. Recovery of a posed 3D human mesh from a single RGB image has attracted significant attention [193, 116, 167, 37, 273, 234]. Recent methods achieve impressive results on natural images spanning a variety of poses, some of which can also model fine details such as hair and clothing [389, 389, 120]. However, such approaches do not model geometric relations between the people and the static parts of the scenes. Finally, many of these methods rely on correctly detecting human keypoints, requiring most of the body to be visible in each video frame. 3.3 The MannequinChallenge Dataset The Mannequin Challenge [374] is a popular video trend in which people freeze in place— often in interesting poses—while the camera operator moves around the scene filming them. Thousands of such videos have been created and uploaded to YouTube since late 2016. These videos comprise our new MannequinChallenge (MC) Dataset [202], which spans a wide range of scenes with people of different ages, naturally posing in 34 Figure 3.3: Sample images from Mannequin Challenge videos. Each image is a frame from a video sequence in which the camera is moving but the humans are all static. The videos span a variety of natural scenes, poses, and configurations of people. different group configurations (see Figure 3.3). To the extent that people succeed in staying still during the videos, we can assume the scenes are static and obtain accurate camera poses and depth information by processing them with SfM and MVS algorithms. However, recovering accurate geometry from such raw Internet videos is challenging, and requires careful filtering of noisy video clips and individual frames in each clip. After processing, we obtain around 2,000 candidate videos from which we derive 4,690 sequences comprised of a total of more than 170K valid image-depth pairs. We now describe in detail how we process the raw videos and derive our training data. Estimating camera poses. Following a similar approach to Zhou et al. [418], we use ORB-SLAM2 [251] to identify trackable sequences in each video and to estimate an initial camera pose for each frame. At this stage, we process a lower-resolution version of the video for efficiency, and set the field of view to 60 degrees (a typical value for modern cell-phone cameras). We then reprocess each sequence at a higher resolution using a visual SfM system [304], which refines the initial camera poses and intrinsic parameters. This method extracts and matches features across frames in the videos, then performs a global bundle adjustment optimization. Finally, sequences with non-smooth camera motion are removed using the technique of Zhou et al. [418], as we observe that 35 such sequences often have erroneous camera poses. Computing dense depth with MVS. Once the camera poses for each clip are estimated, we then reconstruct each scene’s dense geometry. In particular, we recover per-frame dense depth maps using COLMAP, a state-of-the-art MVS system [306]. Because our data consists of challenging Internet videos that exhibit camera motion blur, shadows, reflections, etc., the raw depth maps estimated by MVS are often too noisy for use in training a model. We address this issue with a careful depth cleaning procedure. We first filter outlier depths using the depth refinement method proposed by Li and Snavely [207]. We further remove erroneous depth values by considering the consistency between the MVS depth and the depth obtained from motion parallax between pairs of frames. Specifically, for each frame, we compute a normalized error ∆(p) for every valid pixel p: |DMVS(p)−Dpp(p)| ∆(p) = (3.1) DMVS(p) +Dpp(p) where DMVS is the depth map obtained by MVS and Dpp is the depth map computed from two-frame motion parallax (see Section 3.4.1). Depth values for which ∆(p) > δ are removed, where we empirically set δ = 0.2. Figure 3.4 shows examples of MVS depth maps before and after our proposed depth cleaning method. The regions circled in yellow illustrate that our depth cleaning method can effectively remove incorrect depth regions. Because these depth maps serve as supervision during training, this filtering has a significant impact on our model’s performance, as shown in our experiments (Sec. 3.5.2). Figure 3.7 shows additional examples of our processed sequences with corresponding estimated MVS depths after cleaning. Filtering clips. Several factors can make a video clip unsuitable for training. For 36 (a) Image (b) Raw DMVS (c) Cleaned DMVS Figure 3.4: Effect of depth cleaning. (a-b) Raw MVS depth maps, DMVS, may contain errors and outliers, especially in untextured regions (see regions circled in yellow). (c) Our depth cleaning method effectively filters out such erroneous depth values. example, people may “unfreeze” (start moving) at some point in the video, or the video may contain synthetic graphical elements in the background. Dynamic objects and synthetic backgrounds do not obey multi-view geometric constraints and hence are treated as outliers and filtered out by MVS, potentially leaving few valid pixels. Therefore, we remove frames where < 20% of pixels have valid MVS depth after our two-pass cleaning stage. Further, we remove frames where the estimated radial distortion coefficient |k1| > 0.1 (indicative of a fisheye camera) or where the estimated focal length is ≤ 0.6 or ≥ 1.2 (indicating that the camera parameters are likely inaccurate). We keep sequences that are at least 30 frames long, have an aspect ratio of 16:9, and have a width of ≥ 1600 pixels. 37 Finally, we visually inspect the trajectories and point clouds of the remaining sequences and remove obviously incorrect reconstructions. Figure 3.5 shows examples of images filtered out from the raw Mannequin Challenge video clips by our data creation pipeline. These examples include images captured by fisheye cameras, as well as images with large regions of synthetic background or moving objects. After processing, we obtain 4,690 sequences with a total of more then 170K valid image-depth pairs. We split our MC dataset into training, validation and testing sets with a 80:3:17 split over clips. 3.4 Depth Prediction Model We train our depth prediction model on our MannequinChallenge dataset in a supervised manner, i.e., by regressing to the depth generated by the SfM and MVS pipeline. A key question is how to structure the input to the network to allow training on frozen people but inference on moving people. One possible approach is to regress to depth from a single RGB image (RGB-to- depth), but this approach disregards geometric information about the static regions of the scene that is available by considering more than a single frame. To benefit from such information, we design a two-frame model that uses depth estimated from motion parallax for the static, non-human regions of the scene (Figure 3.6). The full input to our network (Figure 3.7) includes 1) a reference image Ir, 2) a binary mask M indicating human regions, 3) an initial depth map Dpp estimated from motion parallax and with human regions removed, 4) a confidence map C, and 5) an optional 38 (a) Fisheye (b) Synthetic background (c) Moving objects Figure 3.5: Sample frames from clips removed during filtering. (a) Videos captured with fisheye cameras; (b) videos with synthetic backgrounds; (c) sequences with truly moving objects (pairs of frames shown in each column). human keypoint map K. We assume known, accurate camera poses from SfM during both training and inference. In an online inference-time setting, accurate camera poses can also be obtained using visual-inertial odometry. Given these inputs, the network predicts a full depth map for the entire scene. To match the MVS depth values, the network must inpaint the depth in human regions, refine the depth in non-human regions from the estimated Dpp, and finally make the depth of the entire scene consistent. Our network architecture is a variant of the hourglass network proposed by Chen et al. [63]. Specifically, the network has a standard encoder-decoder U-Net structure, with matching input and output resolution, consisting of approximately 5M parameters. In 39 Inputs at time t Source Frame Reference Frame MVS Depth (supervision) Human mask Optical flow Losses Depth from parallax Camera poses Regression from CNN SLAM/SfM Confidence Predicted Depth Figure 3.6: System overview. Our model takes as input the RGB frame, a human segmentation mask, masked depth from motion parallax (via optical flow and SfM pose), and associated confidence map. We ask the network to use these inputs to predict depths that match the ground truth MVS depth. addition, an Inception module variant [344] is used in each convolutional layer of the network. We replace nearest-neighbor upsampling layers with bilinear upsampling layers, which we found to produce sharper depth maps while slightly improving overall accuracy. We refer readers to the Appendix and to Chen et al. [63] for full details of our network architecture. The following sections describe our model inputs and training losses in detail. 3.4.1 Depth from motion parallax Motion parallax between two video frames provides an initial depth estimate for the static regions of the scene. We assume humans are dynamic while the rest of the scene is static. Specifically, for each reference frame, Ir, we select another frame in the video Is, and estimate an optical flow field from Ir to Is using FlowNet2.0 [146]. Given the estimated flow field and the relative camera poses between the two views, we then compute an 40 (a) Reference image Ir (b) Human mask M (c) Input depth Dpp (d) Input confidence C (e) MVS depth DMVS Figure 3.7: System inputs and training data. The input to our network consists of: (a) an RGB image, (b) a human mask, (c) a masked depth map computed from motion parallax w.r.t. a selected source image, and (d) a masked confidence map. Low confidence regions (dark circles) in the first two rows indicate the vicinity of the camera epipole, where depth from parallax is unreliable and removed. The network is trained to regress to MVS depth (e). initial depth map using the Plane-Plus-Parallax (P+P) representation [152, 379]. Note that P+P is typically used to estimate the relative structure of a scene with respect to a reference plane, either a plane in the scene or a virtual reference plane. In our case, we use it as means to cancel out relative camera rotation, as described below. Formally, suppose we have a relative camera pose relating Is and Ir consisting of a 3D rotation R ∈ SO(3) and 3D translation t ∈ R3, with shared intrinsics matrix K. Given an arbitrary planar surface, Π, the geometric relation between a 2D image point p ∈ Ir and its corresponding point p′ ∈ Is (expressed in homogeneous coordinates) can be represented as a combination of a planar component and residual parallax component: p = pw + µ, (3.2) where pw is the 2D image point in Ir that results from warping p′ ∈ Is by a homography A, which aligns the plane Π between the two views, and µ is the remaining 2D parallax motion. We refer readers to the Appendix for a detailed definition of pw and µ. 41 One can show that when setting the reference plane Π to the plane at infinity, the expression in Eq. 3.2 can be written as: tz(pw −Kt) p = pw + , (3.3) Dpp(p) where Dpp(p) is the depth value at p in the coordinate system of the reference view Ir, and tz is the third component of the translation vector t. In addition, the homography A in this case is computed as A = KRK−1. From Eq. 3.3, we can estimate the depth Dpp(p) as: ‖tzpw −Kt‖2 Dpp(p) = , (3.4)‖p− pw‖2 We found this computation to be more efficient and robust for dense depth estima- tion compared to standard triangulation methods, which are usually applied to sparse correspondences. See the Appendix for a detailed derivation of Eq. 3.4. In some cases, such as forward/backward relative camera motion, ||p− pw||2 will be close to zero in some image regions (i.e., near the camera epipole), resulting in ill-defined depth values. We detect and remove these image regions as described in Sec. 3.4.2. Keyframe selection. Depth from motion parallax can be ill-posed if the 2D displacement between two views is small or well-approximated by a homography (e.g., in the case of pure camera rotation). To avoid such cases, we use a heuristic to select a reference frame Ir and a corresponding source keyframe Is. We want the two views to have significant overlap, while having sufficient baseline (i.e., distance between camera centers). In particular, for each Ir, we find the index s of Is as s = arg maxjdrjorj (3.5) where drj is the L2 distance between the camera centers of Ir and neighboring frame Ij . 42 The term orj is the fraction of co-visible SfM f⋂eatures in Ir and Ij: 2|V r V j| orj = , (3.6)|V r|+ |V j| where V j is the set of features visible in Ij . We discard pairs of frames for which orj < τo, i.e., the fraction of co-visible features should be larger than a threshold τo (we set τo = 0.6), and limit the maximum frame interval to 10. We found these view selection criteria to work well in our experiments. 3.4.2 Depth confidence Our data consists of challenging Internet video clips with camera motion blur, shadows, low lighting, and reflections. In such cases, optical flow is often noisy [380], leading to uncertainty in the input depth map Dpp. We thus estimate, and feed to the network, a confidence map C. This map allows the network to rely more on the input depth in high-confidence regions, and potentially to improve its prediction in low-confidence regions. The confidence value at each pixel p in the non-human regions is defined as: C(p) = Clr(p)Cep(p)Cpa(p). (3.7) where the individual terms are defined as follows. Flow consistency. The term Clr measures “left-right” consistency between the forward and backward flow fields. Specifically, we denote forward flow from Ir to Is as ffwd, and backward flow from Is to Ir as fbwd. Clr is then defined as: ( ) C 2 2lr(p) = max 0, 1− r(p) /σ̄ (3.8) where r(p) = ‖ffwd(p) + f ′bwd(p )‖2 is the forward-backward optical flow warping error, and σ̄ is a tolerance parameter. For perfectly consistent forward and backward flows 43 Clr = 1, while Clr = 0 when the error is greater than σ̄ pixels (we set σ̄ = 1px in our experiments). Geometric consistency. The term Cep measures how well the flow field complies with the epipolar constraint between the views [125]. Cep gives low confidence to pixels where the flow field and the epipolar constraint disagree: ( ) Cep(p) = max 0, 1− (γ(p)/γ̄)2 (3.9) where γ̄ controls the epipolar distance tolerance (we set γ̄ = 2px in our experiments), and the geometric epipolar distance γ(p) is defined as: |p′TFp| γ(p) = √ (3.10) (Fp)2 2x + (Fp)y where F = K−T [t] RK−1× is the fundamental matrix relating the two views, and (Fp)x and (Fp)y are the first and second elements of Fp, respectively. Parallax confidence. The term Cpa assigns low confidence to pixels for which the parallax between the views is small [30(6]: )2 − min(β̄, β(p))− β̄Cpa(p) = 1 (3.11) β̄ where ( ′ ) β(p) = cos−1 v(p)v(p ) (3.12) ‖v(p)‖ ′2‖v(p )‖2 is the angle between the camera rays meeting at pixel p, and v(p) = K−1p and v(p′) = K−1p′ are viewpoint direction vectors at p in Ir and p′ in Is respectively. β̄ is the angle tolerance (we use β̄ = 1° in our experiments). Figure 3.7(d) shows examples of computed confidence maps. Note that human regions as well as regions for which the confidence C(p) < 0.25 are masked out. 44 3.4.3 Keypoints We optionally use human keypoints as an additional input to the network, providing the network with explicit information about the poses of the people featured. In particular, we apply the Mask-RCNN [131] human keypoint detection algorithm to each frames. This algorithm detects, for each person, a set of keypoints on salient points such as joint locations. We encode these detections as an image for use as a network input by simply setting the image pixel value at each keypoint location to the corresponding keypoint index (normalized to lie between 0 and 1), and the rest of the pixels to zero. Figure 3.8 shows examples of human keypoints predicted by Mask-RCNN. We show that adding keypoints as an input can boost depth prediction performance for people, as shown in Tables 3.1 and 3.2. 3.4.4 Losses We train our network to regress to depth maps computed by our proposed data pipeline. Because the estimated depth values from SfM and MVS have an arbitrary scale, we use a scale-invariant depth regression loss. That is, our loss is computed on log-space depth values. Our loss function consists primarily of three terms: Lsi = LMSE + α1Lgrad + α2(Lsm1 + Lsm2). (3.13) We compute our losses with respect to the reference image Ir. To simplify notations, we remove the superscript r in the loss equations. Scale-invariant MSE. LMSE denotes the scale-invariant mean square error (MSE) adopted from [85]. This loss term computes the squared, log-space difference in depth between two pixels in the prediction and the same two pixels in the ground truth, averaged 45 Figure 3.8: Examples of keypoint images. The top row shows examples of input images and the bottom row shows corresponding detected human keypoint images, where different colors indicating different joints. We perform morphological dilation to the keypoint maps to make each keypoint location more visible. over all pairs of valid pixels. That is, it penalizes differences in the depth ratio between any two pixels in the prediction and the ground truth. Further, this loss can be computed in linear time in terms of the number of pixels, as derived in the Appendix: ∑∑ L 1MSE = (R(p)−R(q))2 (3.14) 2N2 p∈I q∈I ( )2 1 ∑ 1 ∑ = R(p)2 − R(p) (3.15) N N2 p∈I p∈I where R(p) = log D̂(p) − logDgt(p), and D̂ and and Dgt denote the predicted and ground truth depth, respectively. Multi-scale gradient consistency term. To improve depth predictions, we use a multi- scale gradient consistency term to encourage smoother gradient changes and sharper depth discontinuities in the predicted depth images [207]: ∑S−1 ∑ L 1grad = (|∇xRs(p)|+ |∇yRs(p)|) (3.16) N s=0 s p∈Is where the subscript s on Rs and Is indicates that images are computed at scale s, and Ns denotes the number of valid pixel at scale s. 46 Figure 3.9: Qualitative results on the MC test set. From top to bottom: reference images and their corresponding MVS depth (pseudo ground truth); our depth predictions using: our single view model (third row) and our two-frame model (forth row). The additional network inputs give improved performance in both human and non-human regions. Multi-scale edge-aware smoothness terms. To encourage smooth interpolation of depth in texture-less regions where MVS fails to recover depth, we add smoothness terms at multiple scales based on first- and second-order image derivatives [362], and smoothness weight is modulated by the distance to neighborhood pixels: ∑S−1 ∑ L 1sm1 = exp(−|∇Is(p)|)|∇ log D̂s(p)| (3.17) N s ∑s=0 s 2 p∑∈IsS−1 L 1sm2 = exp(−|∇2Is(p)|)|∇2 log D̂s(p)| (3.18) N 2s s=0 s p∈Is For the Lgrad, Lsm1 and Lsm2 terms, we create S = 5-scale image pyramids for both the predicted and ground truth depth images, using nearest-neighbor down-sampling, since we find, compared with bilinear interpolation, nearest-neighbor down-sampling leads to much sharper depth prediction. 47 3.5 Results We test our method quantitatively and qualitatively and compare it with several state-of- the-art single-view and motion-based depth prediction algorithms. We show additional qualitative results on challenging Internet videos with complex human motion and natural camera motion, and demonstrate how our predicted depth maps can be used for several visual effects. Implementation details. We use FlowNet2.0 [146] to estimate optical flow since it handles large displacements well and preserves sharp motion discontinuities. We use Mask-RCNN [131] to generate human masks and human keypoints. The predicted masks sometimes have errors and miss small parts of people, so we apply a morphological dilation operation to the binary human masks to ensure that the masks are conservative and include all the human regions. When keypoints are used, we normalize their values to between 0 and 1 before feeding them to the network. Our network predicts log depth at both the training and inference stages. During training, we randomly normalize the input log-depth before feeding it to the network by subtracting a value sampled from between the 40th and 60th percentile of valid input logDpp. During inference, we normalize input log-depth by subtracting the median of logDpp. Additionally, during training, we randomly zero out the initial input depth and confidence (with probability 0.1) to address the potential situation where input depth is unavailable (e.g., camera is nearly static or estimated optical flow is completely incorrect) during inference. When using human keypoints as input, we also use the depth from motion parallax Dpp with high confidence (Clr > 0, Cep > 0 and Cpa > 0.5) at these locations as ground truth if MVS depth DMVS is not available. In our experiments, we set hyperparameters in our loss terms α1 = 0.5, α2 = 0.05 48 Network inputs si-full si-env si-hum si-intra si-inter I. I 0.333 0.338 0.317 0.264 0.384 II. IFCM 0.330 0.349 0.312 0.260 0.381 III. IDppM 0.255 0.229 0.264 0.243 0.285 IV. IDppCM 0.232 0.188 0.237 0.221 0.268 V. IDppCMK 0.227 0.189 0.230 0.212 0.263 Unmasked Dpp (oracle) 0.202 0.206 0.200 0.192 0.213 Table 3.1: Quantitative comparisons on the MC test set. Different input configurations of our model: (I) single image; (II) optical flow masked in the human region (F ), confidence and human mask; (III) masked input depth, human mask; and (IV) additional confidence; in (V), we also input human keypoints. The last row indicates the error for the depth estimated from motion parallax between two frames in all image regions (human and non-human); this serves as an oracle and can only be measured if the entire scene is static. Lower is better for all metrics. based on the validation set. We train our networks for 20 epochs from scratch using the Adam [177] optimizer with initial learning rate of 0.0004. We halve the learning rate every 8 epochs. During training, we downsample all the images to a resolution of 532×299, use a mini-batch size of 16, and perform data augmentation though random flips and central crops so that input image resolution to the networks is 512×288. Error metrics. We measure error using scale-invariant RMSE (si-RMSE), equivalent to √ LMSE, described in Section 3.4.4. We evaluate si-RMSE on five different regions: 1) si-full measures the error between all pairs of pixels, giving the overall accuracy across the entire image; 2) si-env measures pairs of pixels in non-human regions E , providing depth accuracy of the environment; and 3) si-hum measures pairs where at least one pixel lies in the human region H, providing depth accuracy for people. si-hum can further be divided into two error measures: 4) si-intra measures si-RMSE withinH, or human accuracy independent of the environment; and 5) si-inter measures si-RMSE between pixels inH and in E , or human accuracy w.r.t. the environment. We include derivations in the Appendix. 49 Methods Dataset two-view? si-full si-env si-hum si-intra si-inter RMSE Rel Russell et al. [299] - Yes 2.146 2.021 2.207 2.206 2.093 2.520 0.772 DeMoN [354] RGBD+MVS Yes 0.338 0.302 0.360 0.293 0.384 0.866 0.220 Chen et al. [63] NYU+DIW No 0.441 0.398 0.458 0.408 0.470 1.004 0.262 Laina et al. [191] NYU No 0.358 0.356 0.349 0.270 0.377 0.947 0.223 Xu et al. [387] NYU No 0.427 0.419 0.411 0.302 0.451 1.085 0.274 Fu et al. [94] NYU No 0.351 0.357 0.334 0.257 0.360 0.925 0.194 I MC No 0.318 0.334 0.294 0.227 0.319 0.840 0.204 IFCM MC Yes 0.316 0.330 0.302 0.228 0.323 0.843 0.206 IDppM MC Yes 0.246 0.225 0.260 0.233 0.273 0.635 0.136 IDppCM (raw depth) MC Yes 0.272 0.238 0.293 0.258 0.282 0.688 0.147 IDppCM MC Yes 0.232 0.203 0.252 0.224 0.262 0.570 0.129 IDppCMK MC Yes 0.221 0.195 0.238 0.215 0.247 0.541 0.125 Table 3.2: Results on the TUM RGBD dataset. Different si-RMSE metrics as well as standard RMSE and relative error (Rel) are reported. We evaluate our models (light gray background) under different input configurations, as described in Table 3.1. Raw depth indicates the model is trained using raw MVS depth predictions as supervision, without our depth cleaning method. A dataset denoted as ‘-’ indicates that the method is not learning-based. Lower is better for all error metrics. 3.5.1 Evaluation on the MC test set We evaluated our method on our MC test set, which consists of more than 29K images taken from 756 video clips. Processed MVS depth values DMVS obtained by our pipeline (see Section 3.3) are considered as ground truth. To quantify the importance of each component of the model’s input, we compare the performance of several models, each trained on our MC dataset with a different input configuration. The two main configurations are: (i) a single-view model (input is RGB image) and (ii) our full two-frame model, where the input includes a reference image, an initial masked depth map Dpp, a confidence map C, and a human mask M . We also perform ablation studies by replacing the input depth with optical flow F , removing C from the input, and adding the human keypoint map K. Quantitative evaluations are shown in Table 3.1. By comparing rows (I), (III) and (IV), it is clear that adding the initial depth of the environment as well as the confidence 50 map significantly improves the performance for both human and non-human regions. Adding human keypoint locations to the network input further improves performance. Note that if we input an optical flow field to the network instead of depth (II), the performance is only on par with the single-view method. The mapping from 2D optical flow to depth depends on the relative camera poses, which are not provided to the network. This result indicates that the network is unable to implicitly learn relative poses and extract depth information. Finally, we report the errors for full (unmasked) depth maps computed from motion parallax between two frames (last row of Table. 3.1). Note that these depth maps can be only computed if the entire scene, including people, is static (thus, this baseline serves as an oracle and cannot be used at test time). As can be seen from the second column (si-env), our model leads to 20% improvement compared to this baseline for non-human regions, which suggests that our model refines the initial input depth (Dpp), rather than just copying it. In human regions, where our model has no input depth information, our performance is only 15% below that of depth from motion parallax (si-hum). Figure 3.9 shows qualitative comparisons between our single-view model (I) and our full model (IDppCMK). Our full model results are more accurate in both human regions (first column) and non-human regions (second column). In addition, the depth relationships between people and their surroundings are improved in all examples. 3.5.2 Evaluation on the TUM RGBD dataset We also evaluate on a subset of the TUM RGBD dataset [339], which contains indoor scenes featuring people performing complex actions, captured from different camera 51 (a) Ir (b) GT (c) DORN (d) DeMoN (e) Ours(SV) (f) Ours(Full) Figure 3.10: Qualitative comparisons on the TUM RGBD dataset. (a) Reference images, (b) ground truth sensor depth, (c) results of the single-view depth prediction method DORN [94], (d) result of the two-frame motion stereo method DeMoN [354], (e-f) depth predictions from our single view and two-frame models, respectively. poses. Sample images from this dataset are shown in Figure 3.10(a-b). To run our model, we first estimate camera poses using ORB-SLAM2, because we found that estimates from ORB-SLAM2 were better synchronized with the RGB images compared to the ground truth poses provided with the TUM dataset. In some cases, due to low image quality and motion blur, the estimated camera poses can be incorrect. We manually filter such failures by inspecting the camera trajectory and point cloud. 52 (a) Ir (b) DORN [94] (c) Chen et al. [63] (d) DeMoN [354] (e) Ours (full) Figure 3.11: Comparisons on Internet video clips with moving cameras and people. From left to right: (a) reference input image (b) results of DORN [94], (c) results of Chen et al. [63], (d) results of DeMoN [354], (e) results of our full method. In total, we obtain 11 valid image sequences with 1,815 images in total for evaluation. We downsample these images to 512×384 resolution in order to preserve their original aspect ratio (our model is fully convolutional and thus can be applied to different image resolutions at test time). We compare our depth predictions (using our MC trained models) with several state- of-the-art monocular depth prediction methods trained on the indoor NYUv2 [191, 387, 94] and Depth in the Wild (DIW) datasets [63], as well as with a recent two-frame stereo model DeMoN [354], which assumes a static scene. We also compare with Video- 53 (a) Input image (b) Defocus (c) Object insertion (d) Anaglyph Figure 3.12: Depth-based visual effects. Using our predicted depth maps, we can apply depth-aware visual effects on (a) input images; we show (b) defocus, (c) object insertion, and (d) Anaglyph effects. Popup [299], which deals with dynamic scenes. We use the same image pairs that were used for computing Dpp as inputs to DeMoN and Video-Popup. Figure 3.13: Depth-based image inpainting. We use depth prediction and camera poses to warp the pixels in nearby frames for image inpainting and people removal. Top row shows original images and bottom row shows inpainted images. Quantitative comparisons are shown in Table 3.2, where we report five different scale-invariant error measures as well as the standard RMSE metric and relative error; these last two are computed by applying a single scaling factor that best aligns the predicted and ground-truth depths in the least-squares sense. Our single-view model already outperforms the other single-view models, demonstrating the benefit of the MC dataset for training. Note that VideoPopup [299] failed to produce meaningful results 54 due to the challenging camera and object motion present in the data. Our full model, by making use of the initial (masked) depth map, significantly improves performance for all error measures. Consistent with our MC test set results, when we use optical flow as input (instead of the initial depth map) the performance is only slightly better than the single-view network. Finally, we show the importance of our proposed depth cleaning methods that we apply to the training data (see Eq. 3.1). The same model trained using the raw MVS depth estimates as supervision (“raw depth”) leads to a drop of about 15% in performance. Figure 3.10 shows a qualitative comparison between these different methods. Our models’ depth predictions (Figure 3.10(f-g)) strongly resemble the ground truth and show a high level of detail, as well as sharp depth discontinuities. This result is a notable improvement over competing methods, which often produce significant errors in both the human regions (e.g., legs in the second row of Figure 3.10), and the non-human regions (e.g., table and ceiling in the last two rows). 3.5.3 Internet videos of dynamic scenes We tested our method on challenging Internet videos (downloaded from YouTube and Shutterstock) that involve simultaneous natural camera motion and human motion. Our SLAM/SfM pipeline was used to generate sequences ranging from 5 to 15 seconds with smooth and accurate camera trajectories, after which we apply our method to obtain the required network input buffers. We qualitatively compare our full model (IDppCMK ) with several recent learning based depth prediction models: DORN [94], Chen et al. [63], and DeMoN [354]. For fair comparisons, we use DORN with a model trained on NYUv2 for indoor videos and 55 a model trained on KITTI for outdoor videos; For Chen et al. [63], we use the models trained on both NYUv2 and DIW. For all of our predictions, we use a single model trained from scratch on our MC dataset. As illustrated in Figure 3.11, our depth predictions are significantly better than the baseline methods. In particular, DORN [94] has very limited generalization to Internet videos, and Chen et al. [63], which is mainly trained on Internet photos, is not able to capture accurate depth. DeMoN often produces incorrect depth, especially in human regions, as it designed for static scenes. Our predicted depth maps capture accurate depth ordering both between people and other objects in the scene (e.g., between the people and buildings in the fourth row of Figure 3.11), and within human regions (such as the arms and legs of the people in the first three rows of Figure 3.11). Depth-based visual effects. Our depth predictions can be used to apply a range of depth-based visual effects to video. Figure 3.12 shows depth-based defocus, insertion of synthetic 3D graphics, as well as stereo pairs displayed as anaglyph images. In Figure 3.13, we show an example of image inpainting by removing nearby humans using our predicted depths. The depth estimates are sufficiently stable over time to allow inpainting from frames elsewhere in the video. To use a frame for inpainting, we construct a triangle heightfield from the depth map, texture the heightfield with the video frame, and render the height- field from the target frame using the relative camera transformation. Figure 3.12 (d, f) shows the results of inpainting two street scenes. Humans near the camera are removed using the human mask M , and holes are filled with colors from up to 200 frames later in the video. Some artifacts are visible in areas that the human mask misses, such as shadows on the ground. 56 3.6 Discussion We demonstrated the power of a learning-based approach for predicting dense depth for dynamic scenes where a monocular camera and people are freely moving. We make a new source of data available for training: a large corpus of Mannequin Challenge videos from YouTube, in which the camera moves around and people are “frozen” in natural poses. We showed how to obtain reliable depth supervision from such noisy data, and demonstrated that by using motion parallax cues available in a video sequence, our models can significantly improve over prior state-of-the-art methods. Our approach has a number of limitations. First, we assume known and accurate camera poses, which can be difficult to compute accurately if moving objects cover most of the scene or camera motion is close to a pure rotation. Second, our model can fail to generalize to non-standard human poses, as shown in the first three rows of Fig. 3.14. Third, the depths predicted by our model may be inaccurate for non-human moving regions such as animals, cars, and shadows, as shown in the last three rows of Fig. 3.14. Finally, our approach also uses just two views, rather than operating on an entire video sequence. This can lead to temporally inconsistent depth estimates and reconstructions across a video. Despite these limitations, we hope that our work can guide and enable further progress in dense reconstruction of dynamic scenes. 57 Complex poses Non-human movers (a) Input image (b) Ours (RGB) (c) Ours (full) Figure 3.14: Failure cases. From left to right: (a) input RGB image (b) depth predicted from our single-view method (c) depth predicted from our proposed full method. Our proposed full method can fail for reasons including (1) a failure to generalize to complex human poses, (first three rows), or due to non-human movers such as animals, cars, and shadows (last three rows). In some of these cases, our single-view method can outperform our full two-view method, because added complexities can sometimes arise in the presence of multiple views. 58 CHAPTER 4 LEARNING INTRINSIC IMAGE DECOMPOSITION FROM WATCHING THE WORLD 4.1 Introduction Intrinsic image decomposition is the problem of factorizing an input image I into a product of a reflectance image and a shading image: I = R · S. While the vision community has seen significant advances in single-image intrinsic image decomposition, it remains a challenging, highly ill-posed problem. Hence, the use of machine learning for this task is an appealing prospect. Unfortunately, it is also difficult to gather direct ground truth training data. Previous work has collected ground truth via painting objects [115], synthetic renderings [51, 57], and manual annotation [26, 185], but each of these methods has significant limitations. Inspired by how humans can learn by simply observing the world and formulating consistent explanations, we consider an alternative, readily available source of training data for learning intrinsic images: image sequences from the Internet for which the viewpoint is fixed but illumination varies. Based on this idea, we introduce BIGTIME (BT), a large dataset of time-lapse image sequences. While the sequences in BT do not provide ground truth, they allow us to incorporate useful constraints during training, by specifying that the model should predict outputs consistent with the sequence. While we train on image sequences, our model can apply to a single image at inference time, as illustrated in Figure 4.1. Although a number of prior methods estimate intrinsic images from sequences, our concept is quite different: we train on sequences, but learn to infer decompositions from Training: we learn from unlabeled indoor and outdoor videos. CNN R CNN R S S I I Testing: our CNN produces intrinsic images from a single photo. Figure 4.1: To train, our method learns from unlabeled videos with fixed viewpoint but varying illumination (top). At test time (bottom), our network produces an intrinsic image decomposition (R, S) from a single image I . single views. In a sense, our method lies between optimization-based intrinsic images methods and machine learning approaches. In particular, our training loss incorporates priors similar to those of optimization-based approaches, but in a feed-forward prediction framework. To fully utilize the information present in image sequences, we also introduce two new methods for computing losses over whole sequences, and show how to efficiently implement these losses inside a deep network. The first is an all-pairs weighted least squares loss that considers all pairs of images. The second is a dense, spatio-temporal smoothness loss that jointly considers all of the pixels in the entire sequence. While we use these losses for training intrinsic images, they could also be applied to other problems that involve image sequences, such as video segmentation. 60 In our evaluation, our method yields competitive or superior performance on two standard real-world benchmarks, IIW and SAW, even when trained on BT without access to annotations from those datasets. We further show improved results on the MIT intrinsic images dataset, even compared to learning methods that utilize full supervised ground truth. 4.2 Related work Intrinsic images through optimization. Intrinsic images has been studied for nearly fifty years, often within an optimization framework. Because the problem is ill-posed, additional priors must be applied. For instance, the seminal Retinex algorithm [192] assumes large image gradients correspond to changes in reflectance, while smaller gradients are due to shading. Subsequently, many different priors have been proposed to guide the decomposition [310, 409, 296, 311, 99], and many new optimization tools, such as inference in dense CRFs, have been deployed [26]. Some recent approaches make use of surface normals from RGB-D cameras [61, 18, 158]. Surface normals can improve shading estimates, but such methods assume depth maps are available during optimization. Intrinsic images from multiple observations. A number of methods, starting with Weiss [370], estimate intrinsic images from time-lapse sequences by assuming constant reflectance but varying shading over time [231, 342, 128, 188, 187]. Such an approach is similar to our training regime, although a crucial distinction is that once our model is trained, we can run it on a single image. These methods rely on priors derived from statistics of image sequences or lighting sources. We found that in practice these methods require a) a large number of input images and b) images taken in outdoor or controlled 61 laboratory environments. In contrast, our method can learn from much shorter and less controlled sequences. Intrinsic images via supervised learning. Barron and Malik [19] proposed a unified learning-based method that incorporates a number of complex priors on shape, albedo, and illumination. However, their method only applies to single objects and does not generalize well to real-world scenes. Recently, several approaches use deep learning to predict albedo and shading via direct supervision. These methods train on the synthetic Sintel [176, 52], object-centric MIT [115] or synthetic ShapeNet datasets [57, 157]. However, Sintel and ShapeNet are highly synthetic datasets, and networks trained on them do not generalize well to real-world scenes. The MIT dataset consists of real images, but these images depict objects captured in the lab, not realistic scenes, and the dataset contains just 20 objects with ground truth. Recently, two datasets have been created for real-world scenes. Intrinsic Images in the Wild (IIW) [26] is a dataset of sparse, human-labeled relative reflectance judgments. Shading Annotations in the Wild (SAW) [185] similarly contains sparse shading anno- tations.Several methods [417, 428, 253, 185] train CNNs on sparse annotations from IIW/SAW and use the predictions as priors for intrinsic images. However, it is difficult to collect such annotations at scale, especially for shading relationships, which can be challenging to perceive. Further, these datasets are limited to sparse annotations. We propose an alternative form of training data that is much easier to capture and provides full-image constraints. 62 I1 I 2 DR I 3 c I m2 + E I m1 I m DS Figure 4.2: System overview and network. During training, our network input is an image sequence I, and the outputs are reflectance imagesR and shading images S for the sequence. Each block in the network depicts a convolutional/deconvolutional layer. E is an encoder, and DR and DS are decoders for the reflectance and shading images. For the innermost feature maps, we have one side output c representing the illumination color. E is an energy function measuring the cost of the decomposition. 4.3 Overview and network architecture Our work makes two main contributions: a new dataset, BIGTIME, of image sequences for learning intrinsic images (Sec. 6.3.1), and a new approach to learning single-view intrinsic images from this data (Sec. 4.5). Because we train from image sequences, one learning approach would be to use existing sequence-based intrinsic images algorithms to produce approximate ground truth decompositions, then use these algorithmic outputs as supervision. However, we found that for many image sequences, existing sequence- based algorithms perform poorly because their assumptions are not met, as discussed in Sec. 6.3.1. Hence, during training, our CNN directly takes an image sequence as input, and processes it in a feed-forward fashion to produce reflectance and shading for each image in the sequence, as shown in Figure 4.2. Because the network processes each image independently, at test time multiple images are not required, i.e., we can use the network to produce a decomposition for a single image. During training, the input 63 images interact through our novel loss function (Sec. 4.5), which evaluates the predicted decompositions jointly for the entire sequence. For our network, we use a variant of the U-Net architecture [291, 153] (Figure 4.2). Our network has one encoder and two decoders, one for log-reflectance and one for log-shading, with skip connections for both decoders. Each layer of the encoder consists mainly of a 4 × 4 stride-2 convolutional layer followed by batch normalization [151] as well as leaky ReLu [132]. For the two decoders, each layer is composed of a 4× 4 deconvolutional layer followed by ReLu. In addition to the decoders for reflectance and shading, the network predicts one side output from the innermost feature maps, a single RGB vector for each image corresponding to the predicted illumination color. 4.4 Dataset To create the BIGTIME dataset, we collected videos and image sequences depicting both indoor and outdoor scenes with varying illumination. While many time-lapse datasets primarily capture outdoor scenes, we explicitly wanted representation from indoor scenes as well. Our indoor sequences were gathered from Youtube, Vimeo, Flickr, Shutterstock, and Boyadzhiev et al. [39], and our outdoor sequences were collected from the AMOS [155] and Time Hallucination [317] datasets. For each video, we masked out the sky as well as dynamic objects such as pets, people, and cars via automatic semantic segmentation [407] or manual annotation. We collected 145 sequences from indoor scenes and 50 from outdoor scenes, yielding a total of ∼6,500 training images. Challenges with Internet videos. Most outdoor scenes in our dataset are from time- lapse sequences where the sun moves evenly over time. Many existing algorithms for multi-image intrinsic image decomposition work well on such data. However, we found 64 Figure 4.3: Examples of challenging images in our dataset. The first two images depict colorful illumination. The last two images show strong sunlight/shadows. that indoor image sequences are much more challenging because illumination changes in indoor scenes tend to be less even or continuous compared to outdoor scenes. In particular, we observed that: 1. most relevant video clips cover a short period of time and do not show large changes in light direction, 2. several video clips are comprised of a light turning on/off in a room, producing a limited number (<8) of valid images with different lighting conditions, and 3. the dynamic range of indoor scenes can be high, with strong sunlight or shadows leading to saturation/clipping that can break intrinsic image algorithms. These properties make our dataset even more complex than the IIW and SAW datasets. Several difficult examples are shown in Fig. 4.3. We found that prior intrinsic image methods designed for image sequences often fail on our indoor videos, as their assump- tions tend to hold only for outdoor or lab-captured sequences. Example failure cases are shown in Fig. 4.4. However, as we show in our evaluation, our approach is robust to such strong illumination conditions, and networks trained on BT generalize well to IIW and SAW. 65 Image Estimated R Estimated S Figure 4.4: Failure cases for intrinsic image estimation algorithms. We applied a state-of-the-art multi-image intrinsic image decomposition estimation algorithm [128] to our dataset. This method fails to produce decomposition results suitable for training due to strong assumptions that hold primarily for outdoor/laboratory scenes. 4.5 Approach In this section, we describe our novel framework for learning reflectance and shading from Internet time-lapse video clips. During training, we formulate the problem as a continuous densely connected conditional random field (dense CRF) and learn a deep neural network to directly predict a decomposition from single views in a feed-forward fashion. Image formation model. Let I denote an input image, and R and S denote the predicted reflectance (albedo) and shading. Assuming an image of a Lambertian scene, we can write the image decomposition in the log domain as: log I = logR + logS +N (4.1) where N models image noise as well as deviations from a Lambertian assumption. In our model, S is a single-channel (grayscale) image, while R is an RGB image. However, modeling S with a single channel assumes white light. In practice, the illumination color can vary across each input video (for instance, red illumination at sunset/sunrise). Hence, 66 we also allow for a colored light in our model: log I = logR + logS + c+N (4.2) where c is a single RGB vector that is added to each element of the left-hand side. For simplicity, we use Eq. 4.1 in the following sections; without loss of generality, we treat c as being folded into the predicted shading. Each training instance is a stack of m input images with n pixels taken from a fixed viewpoint and varying illumination. We denote such an image sequence by I = {I i|i = 1 . . .m}, and denote the corresponding predicted reflectances and shadings byR = {Ri|i = 1 . . .m}, and S = {Si|i = 1 . . .m}, respectively. Additionally, for each image I i we have a binary mask M i indicating which pixels are valid (which we use to exclude saturated pixels, sky, dynamic objects, etc). We wish to devise a method for learning single-view intrinsic image decomposition that leverages having multiple views during training. Hence, we propose to combine learning and estimation by encoding our priors into the training loss function. Essentially, we learn a feed-forward predictor for single-image intrinsic images, trained on image sequences with a loss that incorporates these priors, and in particular priors that operate at the sequence level. This loss should also be differentiable and efficient to evaluate, considerations which guide our design below. Energy/loss function. During training, we formulate the problem as a dense CRF over an image se- quence I, where our goal is to maximize a posterior probability p(R,S|I) = 1 I exp (−E(R,S, I)), where Z(I) is the partition function. Maximizing p(R,S|I) isZ( ) equivalent to minimize an energy function E(R,S, I). Because we use a feed-forward network to predict the decomposition, we also use this energy function as our training 67 loss. We define E as: E(R,S, I) = Lreconstruct + w1Lconsistency + w2Lrsmooth + w3Lssmooth We now describe each term in Eq. 4.3 in detail. 4.5.1 Image reconstruction loss Given an input sequence I , for each image I i ∈ I we expect the predicted reflectance and shading for I i to approximately reconstruct I i via our image formation model. Moreover, since reflectance is constant over time, we should be able to use the reflectance Rj predicted for any image Ij ∈ I to reconstruct I i, when paired with Si (and masked by the valid image regions indicated by binary masks M i and M j). This yields a term involving all pairs of images: ∑m ∑m L = ∥∥ ∥Li ⊗M i ⊗M jreconstruct ⊗ 2(log I i − logRj − logSi)∥ (4.3)F i=1 j=1 where ⊗ is the Hadamard product. Similar to [61], we weight our reconstruction loss 1 by input pixel luminance Li = lum(I i) 8 , since dark pixels tend to be noisy, and image differences in dark regions are magnified in log-space. We found that including such an all-pairs connected image reconstruction loss improves prediction results, perhaps because it creates more communication between predictions. A direct implementation of this loss takes time O(m2n). In Sec. 4.5.5 we introduce a computational trick that reduces this to O(mn) time, which is key to making training tractable. 68 4.5.2 Reflectance consistency We also include a reflectance consistency loss that directly encodes the assumption that the predicted reflectances should be identical across the image sequence: ∑m ∑m L = ∥∥ ∥i j 2consistency M ⊗M ⊗ (logRi − logRj)∥ (4.4)F i=1 j=1 As above, this can be directly computed in time O(m2n), but Sec. 4.5.5 shows how to reduce this to O(mn). 4.5.3 Dense spatio-temporal reflectance smoothness Our reflectance smoothness term Lrsmooth is based on the similarity of chromaticity and intensity between pixels. Because we see a sequence of images at training time, we can define a reflectance smoothness term that acts jointly on all of the images in each sequence at once, allowing us to express smoothness in a richer way. Accordingly, we introduce a novel spatio-temporal densely connected reflectance smoothness term that considers the similarity of the predicted reflectance at each pixel in the sequence to all other pixels in the sequence. Our method is inspired by the bilateral-space stereo method of Barron et al. [17], but we show how to apply their single-image dense solver to an entire image sequence and how to implement it inside a deep network. We define our smoothness term as: ∑∑ L 1= Ŵ (logRi − logRjrsmooth pq p q)2 (4.5)2 Ii,Ij p∈Ii q∈Ij where p and q indicate pixels in the image sequence, and Ŵ is a (bistochastic) weight matrix capturing the affinity between any two pixels p and q. Computing this equation 69 directly is very expensive because it involves all pairs of pixels in the sequence, hence we need a more efficient approach. First, note that if Ŵ is a bistochastic matrix, we can rewrite Eq. 4.5 in the following simplified matrix form: L = r>rsmooth (I − Ŵ )r (4.6) where r is a stacked vector representation (of length mn) of all of the predicted log- reflectance images in the sequence: r = [r1 r2 · · · rm]>, where ri is a vector containing the values in logRi. However, now we have a potentially dense affinity matrix Ŵ ∈ Rmn×mn. But we can approximately evaluate this term much more efficiently if the pixel-wise affinities are Gaussian, i.(e., W = exp −(f − f )>Σ− ) 1 pq p q (fp − fq)) (4.7) where fp and fq are feature vectors for pixels p and q respectively, and Σ is a covariance matrix. We can approximately minimize Eq. 4.6 in bilateral space by factorizing the Gaussian affinity matrix W ≈ S>B̄S, where B̄ = B0B1 · · ·Bd + BdBd−1 · · ·B0 is a symmetric matrix constructed as a product of sparse matrices representing blur operations in bilateral space, d is the dimension of feature vector fp, and S is a sparse splat/slicing matrix that transforms between image space and bilateral space. Finally, let Ŵ = NWN be a bistochastic representation of W , where N is a diagonal matrix that bistochasticizes W [181]. This bilateral embedding allows us to write the loss in Eq. 4.6 as: L ≈ r>(I −NS>rsmooth B̄SN)r (4.8) Note that Lrsmooth is differentiable and N and S are both sparse matrices that can be computed efficiently. Our final form of Lrsmooth (Eq. 4.8) can be computed in time O((d+ 1)mn), rather than O(m2n2). We define the feature vector used to compute the affinities in Eq. 4.7 as fp = [ xp, yp, Ip, c1, c2 ] >, where (xp, yp) is the spatial position of pixel p in the image, Ip 70 is the intensity of p, and c = R1 and c G2 = are the first two elements of theR+G+B R+G+B L1 chromaticity of p. 4.5.4 Multi-scale shading smoothness In addition to a reflectance smoothness term, our loss also incorporates a shading smoothness∑term, Lssmooth. This term is summed over each predicted shading image: L m i issmooth = i=1 Lssmooth(S ), where Lssmooth(S ) is defined as a weighted L2 term over neighboring pixels: ∑ ∑ ( ) L i i i 2 ssmooth(S ) = vpq logSp − logSq (4.9) p∈Ii q∈N(p) where N(p) denotes the 8-connected neighborhood around pixel p, and vpq is a weight on each edge. Our insight is to leverage all of the input images to compute the weights for each individual image. We are inspired by Weiss [370], who derives a multi-image intrinsic images algorithm based on median image derivatives over the sequence. Essen- tially, we expect the median image derivative over the input sequence (in the log domain) to approximate the derivative of the reflectance image. If we denote Jpq = log Ip− log Iq (dropping the image index i for convenience), then this suggests a weight of the form: ( ) vmed = exp −λmedpq (Jpq −median{Jpq}) 2 (4.10) where median{Jpq} is the median value of Jpq over the image sequence, and λmed is a parameter defining the strength of vmedpq . This weight discourages shading smoothness where the gradient of a particular image is very different from the median (as would happen, e.g., for a shadow boundary). We found that vmedpq works well as a weight for texture-less regions (for instance, it captures the effect of a cast shadow on a flat wall well), but, due to noise present in dark 71 Image vmed max{vmed, vmedpq pq pq } vpq Figure 4.5: Effect of vmed in shading smoothness term. (white = large weight, black = small weight.) Adding the extra vmed can help capture smoothness in textured regions such as the pillows in the first row and floor in the second row. The last column shows the final smoothness weight vpq. image regions, it does not always capture the desired shading smoothness for textured surfaces. Figure 4.5 (bottom) illustrates such a case with a checkerboard pattern on the floor. To address this issue, we define an additional weight vmedpq that is normalized by the median derivative: ( ( ) )2 − Jmed med pq −median{Jpq}vpq = exp λ (4.11)median{Jpq} We combine these weights as follows: vpq = max{vmed, vmedpq pq } · (1−median{Wpq}) (4.12) This final shading smoothness weight is more robust to textured regions while still distinguishing shadow discontinuities. The last factor (1−median{Wpq}) reflects the belief that we should enforce stronger shading smoothness on reflectance edges such as textures and weaker smoothness on regions of constant reflectance. Ideally, our shading smoothness term would be densely connected. However, the median operator is nonlinear and cannot be integrated in a pixel-wise densely connected term. Instead, to introduce longer-range shading constraints, we compute the shading 72 smoothness term at multiple image scales, by repeatedly downsizing each predicted shading image by a factor of two. We set the number of scales to be 4, and each scale l is weighted by a factor 1 . l 4.5.5 All-pairs weighted least squares (APWLS) Direct implementations of the all-pairs image reconstruction and reflectance consistency terms from Sections 4.5.1 and 4.5.2 would take O(m2n) time. This quadratic complexity would make training intractable for large enough m. Here, we propose a closed-form version of this all-pairs weighted least squares loss (APWLS) that is linear in m. While we apply this tool to our scenario, it can be used in other situations involving all-pairs computation on image sequences. In general, suppose each image I i is associated with two matrices P i and Qi and two prediction images X i and Y i. We then can write APWLS as (see Appendix for a detailed derivation): ∑m ∑m APWLS = ||P i ⊗Qj ⊗ (X i − Y j)||2F (4.13) i=1 j=1 =1>(ΣQ2 ⊗ ΣP 2X2 + ΣP 2 ⊗ ΣQ2Y 2 − 2ΣP 2Y ⊗ ΣQ2X)1 (4.14) where ΣZ denotes the sum over all images of the Hadamard product indicated in the subscript Z. Evaluating Eq. 4.13 requires time O(m2n), but rewritten as Eq. 4.14, just O(mn). We use this derivation to implement our image reconstruction loss Lreconstruct (Eq. 4.14), by making the substitutions P i = Li ⊗M i, Qj = M j , X i = log I i − logSi 73 and Y j = logRj , and our reflectance consistency loss Lconsistency (Eq. 4.4) by substituting P i = M i, Qj = M j , X i = logRi and Y j = logRj . 4.6 Evaluation In this section we evaluate our approach by training solely on our BIGTIME dataset, and testing on two standard datasets, IIW and SAW. The performance of machine learning approaches can suffer from cross-dataset domain shift due to dataset bias. For example, we show that the performance of networks trained on Sintel, MIT, or ShapeNet do not generalize well to IIW and SAW. However, our method, though not trained on IIW or SAW data, can still produce competitive results on both datasets. We also evaluate on the MIT intrinsic images dataset [115], which has full ground truth. Rather than using the ground truth during training, we train the network on image sequences provided by the MIT dataset. Training details. We implement our method in PyTorch [272]. In total, we have 195 image sequences for training. We perform data augmentation via random rotations, flips, and crops. When feeding images into the network, we resize them to 256× 384, 384× 256, or 256× 256 depending on the original aspect ratio. For all evaluations, we train the network from scratch using Adam [177]. 4.6.1 Evaluation on IIW To evaluate on the IIW dataset, we train our network on BT (without using IIW training data) and directly apply our trained model on the IIW test split provided by [253]. Numer- ical comparisons between our method and other optimization-based and learning-based 74 Method Training set WHDR% Retinex-Color [115] - 26.9 Garces et al. [99] - 24.8 Zhao et al. [409] - 23.8 Bell et al. [26] - 20.6 Narihira et al. [253]∗ IIW 18.1∗ Zhou et al. [417]∗ IIW 15.7∗ Zhou et al. [417] IIW 19.9 DI [252] Sintel+MIT 37.3 Shi et al. [313] ShapeNet 59.4 Ours (w/ per-image Lreconstruct) BT 25.9 Ours (w/ local Lrsmooth) BT 27.4 Ours (w/ grayscale S) BT 22.3 Ours (full method) BT 20.3 Table 4.1: Results on the IIW test set. Lower is better for the Weighted Human Disagree- ment Rate (WHDR). The second column indicates the training data each learning-based method uses; “-” indicates the method is optimization-based. ∗ indicates WHDR is evalu- ated based on CNN classifer outputs for pairs of pixels rather than full decompositions. Method Training set AP% Retinex-Color [115] - 91.93 Garces et al. [99] - 96.89 Zhao et al. [409] - 97.11 Bell et al. [26] - 97.37 Zhou et al. [417] IIW 96.24 DI [252] Sintel+MIT 95.04 Shi et al. [313] ShapeNet 86.30 Ours (w/ local Lssmooth) BT 97.03 Ours (w/o Eq. 4.11) BT 97.15 Ours (full method) BT 97.90 Table 4.2: Results on the SAW test set. Higher is better for AP%. The second column is described in Table 4.1. Note that none of the methods use annotations from SAW. approaches are shown in Table 4.1. Our method is competitive with both optimization- based methods [26] and learning-based methods [417]. Note that the best WHDR (marked ∗) in the table is achieved using CNN classier outputs on pairs of pixels, rather than full image decompositions. In contrast, our results are based on full decompositions. 75 (a) Image (b) Bell et al.(R) (c) Bell et al.(S) (d) Zhou et al.(R) (e) Zhou et al.(S) (f) Ours (R) (g) Ours (S) Figure 4.6: Qualitative comparisons for intrinsic image decomposition on the IIW/SAW test sets. Our network predictions achieve comparable results to state-of-art intrinsic image decomposition algorithms (Bell et al. [26] and Zhou et al. [417]). Additionally, as we show in the next subsection, the best performing method (Zhou et al. [417]) on IIW (which primarily evaluates reflectance) falls behind on SAW (which evaluates shading), suggesting that their method tends to overfit on reflectance. shading accuracy. We also see that networks trained on Sintel, MIT or ShapeNet perform poorly on IIW, likely due to dataset bias. We also perform an ablation study on different configurations of our framework. First, we modify the image reconstruction loss to an alternate loss that considers each image independently, rather than considering all pairs of images in a sequence. Second, we evaluate a modified reflectance smoothness loss that uses local pairwise smoothness (between neighboring pixels) rather than our proposed dense spatio-temporal smoothness. Finally, we try using grayscale shading, rather than our colored shading. The results, shown in the last four rows of Table 4.1, demonstrate that our full method can significantly improve reflectance predictions on the IIW test set compared to simpler configurations. 76 MSE LMSE DSSIM Method Training set GT refl. shading refl. shading refl. shading SIRFS [19] MIT Yes 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985 DI [252] MIT+ST Yes 0.0277 0.0154 0.0585 0.0295 0.1526 0.1328 Shi [313] MIT+SN Yes 0.0278 0.0126 0.0503 0.0240 0.1465 0.1200 Ours MIT No 0.0147 0.0135 0.0341 0.0253 0.1398 0.1266 Table 4.3: Results on MIT intrinsics. For all error metrics, lower is better. ST=Sintel dataset and SN=ShapeNet dataset. The second column shows the dataset used for training. GT indicates whether the method uses ground truth for training. 4.6.2 Evaluation on SAW Next, we test our network on SAW [185], again training without using data from SAW. We also propose two improvements to the metric used to evaluate results on SAW: First, the original SAW error metric is based on classifying a pixel p as having smooth/nonsmooth shading based on the gradient magnitude of the predicted shading im- age, ||∇S||2, normalized to the range [0, 1]. Instead, we measure the gradient magnitude in the log domain. We do this because of the scale ambiguity inherent to shading and reflectance, and because it is possible to have very bright values in the shading channel (e.g., due to strong sunlight), and in such cases if we normalize shading to [0, 1] then most of the resulting values will be close to 0. In contrast, computing the gradient magnitude of log shading ||∇ logS||2 achieves scale invariance, resulting in fairer comparisons for all methods. As in [185], we sweep a threshold τ to create a precision-recall (PR) curve that captures how well each method captures smooth and non-smooth shading. Second, Kovacs et al. [185] apply a 10 × 10 maximum filter to the shading gra- dient magnitude image before computing PR curves, because many shadow boundary annotations are not precisely localized. However, this maximum filter can result in degraded performance for smooth shading regions. Instead, we use the max-filtered log-gradient-magnitude image when classifying non-smooth annotations, but use the 77 unfiltered log gradient image when classifying smooth annotations (see Appendix for details). All methods, including our own, are trained without use of SAW data. Average precision (AP) scores are shown in Table 4.2 (please see the Appendix for full precision- recall curves). Our method has the best performance among all prior methods we tested, and our full loss outperforms variants with terms removed. In particular, our method outperforms the best optimization-based algorithm [26] on both IIW and SAW. On the other hand, Zhou et al. [417] tends to overfit to IIW, as their performance on SAW ranks lower than several other methods. Again, networks trained on Sintel, MIT, and ShapeNet data perform poorly on SAW. 4.6.3 Qualitative results on IIW and SAW Figure 4.6 shows qualitative results from our method and two other state-of-art intrinsic image decomposition algorithms, Zhou et al. [417] and Bell et al. [26], on test images from IIW and SAW. Our results are visually comparable to these methods. One observa- tion is that our shading predictions for dark pixels can be quite dark, leading to reduced contrast in the reflectance images. However, this loss of contrast does not hurt numerical performance. Additionally, like other CNN approaches [252, 313], the direct predictions from our network may not strictly satisfy I = R · S since the two decoders predict R and S simultaneously at test time. As future work, it would be interesting to use our predictions as priors for optimization to address these issues. 78 4.6.4 Evaluation on MIT intrinsic images The MIT intrinsic images dataset [115] contains 20 objects with ground truth reflectance and shading, as well as an associated image sequence taken from 11 different directional light sources. We use the same training-test split as in Barron et al. [19], but instead of training our network on the ground truth provided by the MIT dataset, we train only on the provided image sequences using our learning approach. In this case, we configure our network to produce grayscale shading outputs, since the MIT dataset only contains grayscale shading ground truth images. We compare our approach to several supervised learning methods including SIRFS [19], Direct Intrinsics (DI) [252] and Shi et al. [313]. These prior methods all train using ground truth reflectance and shading images, and additionally DI [252] and Shi et al. [313] pretrain on Sintel [51] and ShapeNet [57], respectively. In contrast, we train our network from scratch and only use the provided image sequences during training. We adopt the same metrics as [313], including mean square error (MSE), local mean square error (LMSE), and structural dissimilarity index (DSSIM). Numerical results are shown in Table 4.3 and qualitative comparisons are shown in Figure 4.7. Averaged over reflectance and shading, our results numerically outperform both prior CNN-based supervised learning methods [252, 313]. In particular, our albedo estimates are significantly better, while our shading estimates are comparable (slightly better than [252], and slightly worse than [313]). SIRFS has the best numerical results on the MIT, but SIRFS’s priors only apply to single objects, and their algorithm performs much more poorly on full images of real-world scenes [252, 313]. 79 (a) Image (b) GT (c) SIRFS (d) DI (e) Shi et al. (f) Ours Figure 4.7: Qualitative comparisons on the MIT intrinsic test set. Odd-number rows show predicted reflectance; even-numbered rows show predicted shading. (a) Input image, (b) Ground truth (GT), (c) SIRFS [19], (d) Direct Intrinsics (DI) [252], (e) Shi et al. [313], (f) Our method. 4.7 Discussion In this chapter, we presented a new method for learning intrinsic images, supervised not by ground truth decompositions, but instead by simply observing image sequences with varying illumination over time, and learning to produce decompositions that are consistent with these sequences. Our model can then be run on single images, producing competitive results on several benchmarks. Our results illustrate the power of learning 80 decompositions simply from watching large amounts of video. In the future, we plan to combine our approach with other kinds of annotations (IIW, SAW, etc), to measure how well they perform when used together, and to use our outputs as inputs to optimization- based methods. 81 CHAPTER 5 LEARNING BETTER INTRINSIC IMAGES THROUGH PHYSICALLY-BASED RENDERING 5.1 Introduction Intrinsic images is a classic vision problem involving decomposing an input image I into a product of reflectance (albedo) and shading images R · S. Recent years have seen remarkable progress on this problem, but it remains challenging due to its ill-posedness. An attractive proposition has been to replace traditional hand-crafted priors with learned, CNN-based models. For such learning methods data is key, but collecting ground truth data for intrinsic images is extremely difficult, especially for images of real-world scenes. One way to generate large amounts of training data for intrinsic images is to render synthetic scenes. However, existing synthetic datasets are limited to images of single objects [157, 313] (e.g., via ShapeNet [57]) or images of CG animation that utilize simplified, unrealistic illumination (e.g., via Sintel [52]). An alternative is to collect ground truth for real images using crowdsourcing, as in the Intrinsic Images in the Wild (IIW) and Shading Annotations in the Wild (SAW) datasets [26, 185]. However, the annotations in such datasets are sparse and difficult to collect accurately at scale. Inspired by recent efforts to use synthetic images of scenes as training data for indoor and outdoor scene understanding [287, 292, 96, 286], we present the first large-scale scene-level intrinsic images dataset based on high-quality physically-based rendering, which we call CGINTRINSICS (CGI). CGI consists of over 20,000 images of indoor scenes, based on the SUNCG dataset [331]. Our aim with CGI is to help drive significant progress towards solving the intrinsic images problem for Internet photos of real-world Synthetic Images IIW Annotations SAW Annotations Train R S Input Image Decomposition Network Figure 5.1: Overview and network architecture. Our work integrates physically-based rendered images from our CGINTRINSICS dataset and reflectance/shading annotations from IIW and SAW in order to train a better intrinsic decomposition network. scenes. We find that high-quality physically-based rendering is essential for our task. While SUNCG provides physically-based scene renderings [406], our experiments show that the details of how images are rendered are of critical importance, and certain choices can lead to massive improvements in how well CNNs trained for intrinsic images on synthetic data generalize to real data. We also propose a new partially supervised learning method for training a CNN to directly predict reflectance and shading, by combining ground truth from CGI and sparse annotations from IIW/SAW. Through evaluations on IIW and SAW, we find that, surprisingly, decomposition networks trained solely on CGI can achieve state-of-the-art performance on both datasets. Combined training using both CGI and IIW/SAW leads to even better performance. Finally, we find that CGI generalizes better than existing datasets by evaluating on MIT Intrinsic Images, a very different, object-centric, dataset. 83 5.2 Related work Optimization-based methods. The classical approach to intrinsic images is to integrate various priors (smoothness, reflectance sparseness, etc.) into an optimization frame- work [192, 409, 296, 311, 99, 26]. However, for images of real-world scenes, such hand-crafted prior assumptions are difficult to craft and are often violated. Several recent methods seek to improve decomposition quality by integrating surface normals or depths from RGB-D cameras [61, 18, 158] into the optimization process. However, these meth- ods assume depth maps are available during optimization, preventing them from being used for a wide range of consumer photos. Learning-based methods. Learning methods for intrinsic images have recently been explored as an alternative to models with hand-crafted priors, or a way to set the parame- ters of such models automatically. Barron and Malik [19] learn parameters of a model that utilizes sophisticated priors on reflectance, shape and illumination. This approach works on images of objects (such as in the MIT dataset), but does not generalize to real world scenes. More recently, CNN-based methods have been deployed, including work that regresses directly to the output decomposition based on various training datasets, such as Sintel [252, 176], MIT intrinsics and ShapeNet [313, 157]. Shu et al. [320] also propose a CNN-based method specifically for the domain of facial images, where ground truth geometry can be obtained through model fitting. However, as we show in the evaluation section, the networks trained on such prior datasets perform poorly on images of real-world scenes. Two recent datasets are based on images of real-world scenes. Intrinsic Images in the Wild (IIW) [26] and Shading Annotations in the Wild (SAW) [185] consist of sparse, crowd-sourced reflectance and shading annotations on real indoor images. Subsequently, 84 several papers train CNN-based classifiers on these sparse annotations and use the classifier outputs as priors to guide decomposition [185, 417, 428, 253]. However, we find these annotations alone are insufficient to train a direct regression approach, likely because they are sparse and are derived from just a few thousand images. Finally, very recent work has explored the use of time-lapse imagery as training data for intrinsic images [205], although this provides a very indirect source of supervision. Synthetic datasets for real scenes. Synthetic data has recently been utilized to improve predictions on real-world images across a range of problems. For instance, [287, 286] created a large-scale dataset and benchmark based on video games for the purpose of autonomous driving, and [25, 38] use synthetic imagery to form small benchmarks for intrinsic images. SUNCG [406] is a recent, large-scale synthetic dataset for indoor scene understanding. However, many of the images in the PBRS database of physically-based renderings derived from SUNCG have low signal-to-noise ratio (SNR) and non-realistic sensor properties. We show that higher quality renderings yield much better training data for intrinsic images. 5.3 CGINTRINSICS Dataset To create our CGINTRINSICS (CGI) dataset, we started from the SUNCG dataset [331], which contains over 45,000 3D models of indoor scenes. We first considered the PBRS dataset of physically-based renderings of scenes from SUNCG [406]. For each scene, PBRS samples cameras from good viewpoints, and uses the physically-based Mitsuba renderer [156] to generate realistic images under reasonably realistic lighting (including a mix of indoor and outdoor illumination sources), with global illumination. Using such an approach, we can also generate ground truth data for intrinsic images by rendering a 85 Figure 5.2: Visualization of ground truth from our CGINTRINSICS dataset. Top row: rendered RGB images. Middle: ground truth reflectance. Bottom: ground truth shading. Note that light sources are masked out when creating the ground truth decomposition. standard RGB image I , then asking the renderer to produce a reflectance map R from the same viewpoint, and finally dividing to get the shading image S = I/R. Examples of such ground truth decompositions are shown in Figure 5.2. Note that we automatically mask out light sources (including illumination from windows looking outside) when creating the decomposition, and do not consider those pixels when training the network. However, we found that the PBRS renderings are not ideal for use in training real- world intrinsic image decomposition networks. In fact, certain details in how images are rendered have a dramatic impact on learning performance: Rendering quality. Mitsuba and other high-quality renderers support a range of ren- dering algorithms, including various flavors of path tracing methods that sample many light paths for each output pixel. In PBRS, the authors note that bidirectional path tracing works well but is very slow, and opt for Metropolis Light Transport (MLT) with a sample rate of 512 samples per pixel [406]. In contrast, for our purposes we found 86 Figure 5.3: Visual comparisons between our CGI and the original SUNCG dataset. Top row: images from SUNCG/PBRS. Bottom row: images from our CGI dataset. The images in our dataset have higher SNR and are more realistic. that bidirectional path tracing (BDPT) with very large numbers of samples per pixel was the only algorithm that gave consistently good results for rendering SUNCG images. Comparisons between selected renderings from PBRS and our new CGI images are shown in Figure 5.3. Note the significantly decreased noise in our renderings. This extra quality comes at a cost. We find that using BDPT with 8,192 samples per pixel yields acceptable quality for most images. This increases the render time per image significantly, from a reported 31s [406], to approximately 30 minutes.1 One reason for the need for large numbers of samples is that SUNCG scenes are often challenging from a rendering perspective—the illumination is often indirect, coming from open doorways or constrained in other ways by geometry. However, rendering is highly parallelizable, and over the course of about six months we rendered over ten thousand images on a cluster of about 10 machines. Tone mapping from HDR to LDR. We found that another critical factor in image generation is how rendered images are tone mapped. Renderers like Mitsuba generally produce high dynamic range (HDR) outputs that encode raw, linear radiance estimates 1While high, this is still a fair ways off of reported render times for animated films. For instance, each frame of Pixar’s Monsters University took a reported 29 hours to render [347]. 87 Dataset Size Setting Rendered/Real Illumination GT type MPI Sintel [51] 890 Animation non-PB spatial-varying full MIT Intrinsics [115] 110 Object Real single global full ShapeNet [313] 2M+ Object PB single global full IIW [26] 5230 Scene Real spatial-varying sparse SAW [185] 6677 Scene Real spatial-varying sparse CGINTRINSICS 20,000+ Scene PB spatial-varying full Table 5.1: Comparisons of existing intrinsic image datasets with our CGINTRINSICS dataset. PB indicates physically-based rendering and non-PB indicates non-physically- based rendering. for each pixel. In contrast, real photos are usually low dynamic range. The process that takes an HDR input and produces an LDR output is called tone mapping, and in real cameras the analogous operations are the auto-exposure, gamma correction, etc., that yield a well-exposed, high-contrast photograph. PBRS uses the tone mapping method of Reinhard et al. [283], which is inspired by photographers such as Ansel Adams, but which can produce images that are very different in character from those of consumer cameras. We find that a simpler tone mapping method produces more natural-looking results. Again, Figure 5.3 shows comparisons between PBRS renderings and our own. Note how the color and illumination features, such as shadows, are better captured in our renderings (we noticed that shadows often disappear with the Reinhard tone mapper). In particular, to tone map a linear HDR radiance image IHDR, we find the 90th percentile intensity value r90, then compute the image I γ 1 LDR = αIHDR, where γ = is2.2 a standard gamma correction factor, and α is computed such that r90 maps to the value 0.8. The final image is then clipped to the range [0, 1]. This mapping ensures that at most 10% of the image pixels (and usually many fewer) are saturated after tone mapping, and tends to result in natural-looking LDR images. Using the above rendering approach, we re-rendered ∼ 20,000 images from PBRS. We also integrated 152 realistic renderings from [38] into our dataset. Table 5.1 compares 88 our CGI dataset to prior intrinsic image datasets. Sintel is a dataset created for an animated film, and does not utilize physical-based rendering. Other datasets, such as ShapeNet and MIT, are object-centered, whereas CGI focuses on images of indoor scenes, which have more sophisticated structure and illumination (cast shadows, spatial-varying lighting, etc). Compared to IIW and SAW, which include images of real scenes, CGI has full ground truth and and is much more easily collected at scale. 5.4 Learning Cross-Dataset Intrinsics In this section, we describe how we use CGINTRINSICS to jointly train an intrinsic decomposition network end-to-end, incorporating additional sparse annotations from IIW and SAW. Our full training loss considers training data from each dataset: L = LCGI + λIIWLIIW + λSAWLSAW. (5.1) where LCGI, LIIW, and LSAW are the losses we use for training from the CGI, IIW, and SAW datasets respectively. The most direct way to train would be to simply incorporate supervision from each dataset. In the case of CGI, this supervision consists of full ground truth. For IIW and SAW, this supervision takes the form of sparse annotations for each image, as illustrated in Figure 5.1. However, in addition to supervision, we found that incorporating smoothness priors into the loss also improves performance. Our full loss functions thus incorporate a number of terms: LCGI =Lsup + λordLord + λrecLreconstruct (5.2) LIIW =λordLord + λrsLrsmooth + λssLssmooth + Lreconstruct (5.3) LSAW =λS/NSLS/NS + λrsLrsmooth + λssLssmooth + Lreconstruct (5.4) We now describe each term in detail. 89 5.4.1 Supervised losses CGIntrinsics-supervised loss. Since the images in our CGI dataset are equipped with a full ground truth decomposition, the learning problem for this dataset can be formulated as a direct regression problem from input image I to output images R and S. However, because the decomposition is only up to an unknown scale factor, we use a scale-invariant supervised loss, LsiMSE (for “scale-invariant mean-squared-error”). In addition, we add a gradient domain multi-scale matching term Lgrad. For each training image in CGI, our supervised loss is defined as Lsup = L∑ siMSE + Lgrad, where N L 1 ∗ 2 ∗ 2siMSE = (Ri − crRi) + (SN i − csSi) (5.5) ∑L ∑ i=1N1 l ∥ ∥ ∥ ∥L ∥ ∗ ∥ ∥ ∗ ∥grad = ∇Rl,i − cr∇Rl,i + ∇S − c ∇S1 l,i s l,i . (5.6)N 1l l=1 i=1 R ∗l,i (Rl,i) and Sl,i (S ∗ l,i) denote reflectance prediction (resp. ground truth) and shading prediction (resp. ground truth) respectively, at pixel i and scale l of an image pyramid. Nl is the number of valid pixels at scale l and N = N1 is the number of valid pixels at the original image scale. The scale factors cr and cs are computed via least squares. In addition to the scale-invariance of LsiMSE, another important aspect is that we compute the MSE in the linear intensity domain, as opposed to the all-pairs pixel comparisons in the log domain used in [252]. In the log domain, pairs of pixels with large absolute log-difference tend to dominate the loss. As we show in our evaluation, computing LsiMSE in the linear domain significantly improves performance. Finally, the multi-scale gradient matching term Lgrad encourages decompositions to be piecewise smooth with sharp discontinuities. Ordinal reflectance loss. IIW provides sparse ordinal reflectance judgments between pairs of points (e.g., “point i has brighter reflectance than point j”). We introduce a 90 Image CGI (R) CGI (S) CGI+IIW (R) CGI+IIW (S) Figure 5.4: Examples of predictions with and without IIW training data. Adding real IIW data can qualitatively improve reflectance and shading predictions. Note for instance how the quilt highlighted in first row has a more uniform reflectance after incorporating IIW data, and similarly for the floor highlighted in the second row. loss based on this ordinal supervision. For a given IIW training image and predicted reflectance R∑, we accumulate losses for each pair of annotated pixels (i, j) in that image: Lord(R) = (i,j) ei,j(R), where wi,j(logRi − logR 2j) , ri,j = 0 ei,j(R) =  w 2 (5.7)i,j (max(0,m− logRi + logRj)) , ri,j = +1 wi,j (max(0,m− logR + logR ))2j i , ri,j = −1 and ri,j is the ordinal relation from IIW, indicating whether point i is darker (-1), j is darker (+1), or they have equal reflectance (0). wi,j is the confidence of the annotation, provided by IIW. Example predictions with and without IIW data are shown in Fig. 5.4. We also found that adding a similar ordinal term derived from CGI data can improve reflectance predictions. For each image in CGI, we over-segment it using superpixel segmentation [2]. Then in each training iteration, we randomly choose one pixel from every segmented region, and for each pair of chosen pixels, we evaluate Lord similar to Eq. 5.7, with wi,j = 1 and the ordinal relation derived from the ground truth reflectance. SAW shading loss. The SAW dataset provides images containing annotations of smooth (S) shading regions and non-smooth (NS) shading points, as depicted in Figure 5.1. These 91 annotations can be further divided into three types: regions of constant shading, shadow boundaries, and depth/normal discontinuities. We integrate all three types of annotations into our supervised SAW loss LS/NS. For each constant shading region (with Nc pixels), we compute a loss Lconstant−shading encouraging the variance of the predicted shading in the(region to be)zero:∑N 21 c ∑NcL 2 1constant−shading = (logSi) − logSi . (5.8) N 2c Ni=1 c i=1 SAW also provides individual point annotations at cast shadow boundaries. As noted in [185], these points are not localized precisely on shadow boundaries, and so we apply a morphological dilation with a radius of 5 pixels to the set of marked points before using them in training. This results in shadow boundary regions. We find that most shadow boundary annotations lie in regions of constant reflectance, which implies that for all pair of shading pixels within a small neighborhood, their log difference should be approximately equal to the log difference of the image intensity. This is equivalent to encouraging the variance of logSi − log Ii within this small region to be 0 [85]. Hence, we define the loss for each shadow boundary region ((with Nsd) pixels as:∑ )N ∑N 21 sd 1 sdLshadow = (logSi − log I 2i) − (logSi − log Ii) (5.9) N 2sd Ni=1 sd i=1 Finally, SAW provides depth/normal discontinuities, which are also usually shading discontinuities. However, since we cannot derive the actual shading change for such discontinuities, we simply mask out such regions in our shading smoothness term Lssmooth (Eq. 5.11), i.e., we do not penalize shading changes in such regions. As above, we first dilate these annotated regions before use in training. Examples predictions before/after adding SAW data into our training are shown in Fig. 5.5. 92 Image CGI (R) CGI (S) CGI+SAW (R) CGI+SAW (S) Figure 5.5: Examples of predictions with and without SAW training data. Adding SAW training data can qualitatively improve reflectance and shading predictions. Note the pictures/TV highlighted in the decompositions in the first row, and the improved assignment of texture to the reflectance channel for the paintings and sofa in the second row. 5.4.2 Smoothness losses To further constrain the decompositions for real images in IIW/SAW, following clas- sical intrinsic image algorithms we add reflectance smoothness Lrsmooth and shading smoothness Lssmooth terms. For reflectance, we use a multi-scale `1 smoothness term to encourage reflectance predictions to be piecewise constant: ∑L N L 1 ∑l ∑ rsmooth = vl,i,j ‖logRl,i − logRl,j‖ N l 1 (5.10) l l=1 i=1 j∈N (l,i) where N (l, i) denotes the 8-connected(neighborhood of the pixel at po)sition i and scale l. The reflectance weight v 1 T −1l,i,j = exp − (fl,i − fl,j) Σ (fl,i − fl,j) , and the feature2 vector fl,i is defined as [ p 1 2l,i, Il,i, cl,i, cl,i ], where pl,i and Il,i are the spatial position and image intensity respectively, and c1 2l,i and cl,i are the first two elements of chromaticity. Σ is a covariance matrix defining the distance between two feature vectors. We also include a densely-connected `2 shading smoothness term, which can be evaluated in linear time in the number of pixels N using bilateral embeddings [17, 205]: 1 ∑N ∑NLssmooth = Ŵi,j (logSi − 1logS 2 > >j) ≈ s (I −NbSb B̄bSbNb)s (5.11)2N N i j 93 ( ) where Ŵ is a bistochastic weight matrix derived fromW and p −pWi,j = exp −1 || i j ||2 .2 σp 2 We refer readers to [17, 205] for a detailed derivation. As shown in our experiments, adding such smoothness terms to real data can yield better generalization. 5.4.3 Reconstruction loss Finally, for each training image in each dataset, we add a loss expressing the constraint that the reflectance and shading should reconstruct the original image: ∑N L 1 2reconstruct = (Ii −RiSi) . (5.12) N i=1 5.4.4 Network architecture Our network architecture is illustrated in Figure 5.1. We use a variant of the “U-Net” architecture [205, 153]. Our network has one encoder and two decoders with skip connections. The two decoders output log reflectance and log shading, respectively. Each layer of the encoder mainly consists of a 4× 4 stride-2 convolutional layer followed by batch normalization [151] and leaky ReLu [132]. For the two decoders, each layer is composed of a 4× 4 deconvolutional layer followed by batch normalization and ReLu, and a 1× 1 convolutional layer is appended to the final layer of each decoder. 5.5 Evaluation We conduct experiments on two datasets of real world scenes, IIW [26] and SAW [185] (using test data unseen during training) and compare our method with several state-of- 94 Method Training set WHDR Method Training set WHDR Retinex-Color [115] - 26.9% Ours (log, LsiMSE) CGI 22.7% Garces et al. [99] - 24.8% Ours (w/o Lgrad) CGI 19.7% Zhao et al. [409] - 23.8% Ours (w/o Lord) CGI 19.9% Bell et al. [26] - 20.6% Ours (w/o Lrsmooth) All 16.1% Ours SUNCG 26.1% Zhou et al. [417] IIW 19.9% Ours† CGI 18.4% Bi et al. [32] - 17.7% Ours CGI 17.8% Nestmeyer et al. [254] IIW 19.5% Ours CGI+IIW(O) 17.5% Nestmeyer et al. [254]∗ IIW 17.7% Ours CGI+IIW(A) 16.2% DI [252] Sintel 37.3% Ours All 15.5% Shi et al. [313] ShapeNet 59.4% Ours∗ All 14.8% Table 5.2: Numerical results on the IIW test set. Lower is better for WHDR. The “Training set” column specifies the training data used by each learning-based method: “-” indicates an optimization-based method. IIW(O) indicates original IIW annotations and IIW(A) indicates augmented IIW comparisons. “All” indicates CGI+IIW(A)+SAW. † indicates network was validated on CGI and others were validated on IIW. ∗ indicates CNN predictions are post-processed with a guided filter [254]. the-art intrinsic images algorithms. Additionally, we also evaluate the generalization of our CGI dataset by evaluating it on the MIT Intrinsic Images benchmark [115]. Network training details. We implement our method in PyTorch [272]. For all three datasets, we perform data augmentation through random flips, resizing, and crops. For all evaluations, we train our network from scratch using the Adam [177] optimizer, with initial learning rate 0.0005 and mini-batch size 16. We refer readers to the supplementary material for the detailed hyperparameter settings. 5.5.1 Evaluation on IIW We follow the train/test split for IIW provided by [253], also used in [417]. We also conduct several ablation studies using different loss configurations. Quantitative compar- isons of Weighted Human Disagreement Rate (WHDR) between our method and other optimization- and learning-based methods are shown in Table 5.2. 95 Comparing direct CNN predictions, our CGI-trained model is significantly better than the best learning-based method [254], and similar to [32], even though [254] was directly trained on IIW. Additionally, running the post-processing from [254] on the results of the CGI-trained model achieves a further performance boost. Table 5.2 also shows that models trained on SUNCG (i.e., PBRS), Sintel, MIT Intrinsics, or ShapeNet generalize poorly to IIW likely due to the lower quality of training data (SUNCG/PBRS), or the larger domain gap with respect to images of real-world scenes, compared to CGI. The comparison to SUNCG suggests the key importance of our rendering decisions. We also evaluate networks trained jointly using CGI and real imagery from IIW. As in [417], we augment the pairwise IIW judgments by globally exploiting their transitivity and symmetry. The right part of Table 5.2 demonstrates that including IIW training data leads to further improvements in performance, as does also including SAW training data. Table 5.2 also shows various ablations on variants of our method, such as evaluating losses in the log domain and removing terms from the loss functions. Finally, we test a network trained on only IIW/SAW data (and not CGI), or trained on CGI and fine- tuned on IIW/SAW. Although such a network achieves ∼19% WHDR, we find that the decompositions are qualitatively unsatisfactory. The sparsity of the training data causes these networks to produce degenerate decompositions, especially for shading images. 5.5.2 Evaluation on SAW To evaluate our shading predictions, we test our models on the SAW [185] test set, utilizing the error metric introduced in [205]. We also propose a new, more challenging error metric for SAW evaluation. In particular, we found that many of the constant- shading regions annotated in SAW also have smooth image intensity (e.g., textureless 96 Method Training set AP% (unweighted) AP% (challenge) Retinex-Color [115] - 91.93 85.26 Garces et al. [99] - 96.89 92.39 Zhao et al. [409] - 97.11 89.72 Bell et al. [26] - 97.37 92.18 Zhou et al. [417] IIW 96.24 86.34 Nestmeyer et al. [254] IIW 97.26 89.94 Nestmeyer et al. [254]∗ IIW 96.85 88.64 DI [252] Sintel+MIT 95.04 86.08 Shi et al. [313] ShapeNet 86.62 81.30 Ours (log, LsiMSE) CGI 97.73 93.03 Ours (w/o Lgrad) CGI 98.15 93.74 Ours (w/o Lssmooth) CGI+IIW(A)+SAW 98.60 94.87 Ours SUNCG 96.56 87.09 Ours† CGI 98.16 93.21 Ours CGI 98.39 94.05 Ours CGI+IIW(A) 98.56 94.69 Ours CGI+IIW(A)+SAW 99.11 97.93 Table 5.3: Quantitative results on the SAW test set. Higher is better for AP%. The second column is described in Table 5.2. The third and fourth columns show perfor- mance on the unweighted SAW benchmark and our more challenging gradient-weighted benchmark, respectively. 1 1 0.9 0.9 0.8 CGI 0.8 CGI CGI+IIW CGI+IIW CGI+IIW+SAW CGI+IIW+SAW 0.7 ShapeNet [Shi et al. 2017] 0.7 ShapeNet [Shi et al. 2017] Sintel+MIT [Narihira et al. 2015] Sintel+MIT [Narihira et al. 2015] [Bell et al. 2014] [Bell et al. 2014] Retinex-Color [Grosse et al. 2009] 0.6 Retinex-Color [Grosse et al. 2009][Garces et al. 2012] 0.6 [Garces et al. 2012] [Zhao et al. 2012] [Zhao et al. 2012] [Zhou et al. 2015] [Zhou et al. 2015] 0.5 0.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Figure 5.6: Precision-Recall (PR) curve for shading images on the SAW test set. Left: PR curves generated using the unweighted SAW error metric of [205]. Right: curves generated using our more challenging gradient-weighted metric. walls), making their shading easy to predict. Our proposed metric downweights such regions as follows. For each annotated region of constant shading, we compute the average image gradient magnitude over the region. During evaluation, when we add the pixels belonging to a region of constant shading into the confusion matrices, we 97 Precision Precision multiply the number of pixels by this average gradient. This proposed metric leads to more distinguishable performance differences between methods, because regions with rich textures will contribute more to the error compared to the unweighted metric. Figure 5.6 and Table 5.3 show precision-recall (PR) curves and average precision (AP) on the SAW test set with both unweighted [205] and our proposed challenge error metrics. As with IIW, networks trained solely on our CGI data can achieve state-of- the-art performance, even without using SAW training data. Adding real IIW data improves the AP in term of both error metrics. Finally, the last column of Table 5.3 shows that integrating SAW training data can significantly improve the performance on shading predictions, suggesting the effectiveness of our proposed losses for SAW sparse annotations. Note that the previous state-of-the-art algorithms on IIW (e.g., Zhou et al. [417] and Nestmeyer et al. [254]) tend to overfit to reflectance, hurting the accuracy of shading predictions. This is especially evident in terms of our proposed challenge error metric. In contrast, our method achieves state-of-the-art results on both reflectance and shading predictions, in terms of all error metrics. Note that models trained on the original SUNCG, Sintel, MIT intrinsics or ShapeNet datasets perform poorly on the SAW test set, indicating the much improved generalization to real scenes of our CGI dataset. Qualitative results on IIW/SAW. Figure 5.7 shows qualitative comparisons between our network trained on all three datasets, and two other state-of-the-art intrinsic images algorithms (Bell et al. [26] and Zhou et al. [417]), on images from the IIW/SAW test sets. In general, our decompositions show significant improvements. In particular, our network is better at avoiding attributing surface texture to the shading channel (for instance, the checkerboard patterns evident in the first two rows, and the complex textures in the last four rows) while still predicting accurate reflectance (such as the mini-sofa in the images 98 Image Bell et al.(R) Bell et al.(S) Zhou et al.(R) Zhou et al.(S) Ours (R) Ours (S) Figure 5.7: Qualitative comparisons on the IIW/SAW test sets. Our predictions show significant improvements compared to state-of-the-art algorithms (Bell et al. [26] and Zhou et al. [417]). In particular, our predicted shading channels include significantly less surface texture in several challenging settings. of third row). In contrast, the other two methods often fail to handle such difficult settings. In particular, [417] tends to overfit to reflectance predictions, and their shading estimates strongly resemble the original image intensity. However, our method still makes mistakes, such as the non-uniform reflectance prediction for the chair in the fifth row, as well as residual textures and shadows in the shading and reflectance channels. 99 MSE LMSE DSSIM Method Training set refl. shading refl. shading refl. shading SIRFS [19] MIT 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985 DI [252] Sintel+MIT 0.0277 0.0154 0.0585 0.0295 0.1526 0.1328 Shi et al. [313] ShapeNet 0.0468 0.0194 0.0752 0.0318 0.1825 0.1667 Shi et al. [313]? ShapeNet+MIT 0.0278 0.0126 0.0503 0.0240 0.1465 0.1200 Ours CGI 0.0221 0.0186 0.0349 0.0259 0.1739 0.1652 Ours? CGI +MIT 0.0167 0.0127 0.0319 0.0211 0.1287 0.1376 Table 5.4: Quantitative Results on MIT intrinsics testset. For all error metrics, lower is better. The second column shows the dataset used for training. ? indicates models fine-tuned on MIT. Image GT SIRFS DI Shi et al. Shi et al.? Ours Ours? Figure 5.8: Qualitative comparisons on MIT intrinsics testset. Odd rows: reflectance predictions. Even rows: shading predictions. ? are the predictions fine-tuned on MIT. 5.5.3 Evaluation on MIT intrinsic images For the sake of completeness, we also test the ability of our CGI-trained networks to generalize to the MIT Intrinsic Images dataset [115]. In contrast to IIW/SAW, the MIT 100 dataset contains 20 real objects with 11 different illumination conditions. We follow the same train/test split as Barron et al. [19], and, as in the work of Shi et al. [313], we directly apply our CGI trained networks to MIT testset, and additionally test fine-tuning them on the MIT training set. We compare our models with several state-of-the-art learning-based methods using the same error metrics as [313]. Table 5.4 shows quantitative comparisons and Figure 5.8 shows qualitative results. Both show that our CGI-trained model yields better perfor- mance compared to ShapeNet-trained networks both qualitatively and quantitatively, even though like MIT, ShapeNet consists of images of rendered objects, while our dataset contains images of scenes. Moreover, our CGI-pretrained model also performs better than networks pretrained on ShapeNet and Sintel. These results further demonstrate the improved generalization ability of our CGI dataset compared to existing datasets. Note that SIRFS still achieves the best results, but as described in [252, 313], their methods are designed specifically for single objects and generalize poorly to real scenes. 5.6 Discussion We presented a new synthetic dataset for learning intrinsic images, and an end-to-end learning approach that learns better intrinsic image decompositions by leveraging datasets with different types of labels. Our evaluations illustrate the surprising effectiveness of our synthetic dataset on Internet photos of real-world scenes. We find that the details of rendering matter, and hypothesize that improved physically-based rendering may benefit other vision tasks, such as normal prediction and semantic segmentation [406]. 101 CHAPTER 6 CROWDSAMPLING THE PLENOPTIC FUNCTION 6.1 Introduction There is a thought experiment that goes something like this: Imagine a ‘camera’ with no optics or image sensor of any kind. Rather, it consists only of a box equipped with GPS, a radio for Internet access, a button for ‘taking pictures’, and a screen for displaying those pictures. When a user presses its button, the box searches the Internet for photos tagged with its current location, and from these selects a best match to display on the screen. This thought experiment is perhaps best understood in the context of popular tourist attractions, of which one can often find countless images posted online (Figure 6.1, second row). When pointed at such an attraction, one can imagine our box producing images very similar to those of a real camera, forcing us to consider whether an image we capture ourselves is meaningfully different from a near-identical one captured by strangers. For many, the ensuing philosophical debate hinges on whether an image reflects the scene as they remember it. After all, appearance is not generally constant over time, even under a fixed geometry and viewpoint; in outdoor settings, for example, weather changes, shadows move, and day turns to night—all resulting in appearance changes that can be observed from a single view of the scene. This poses an interesting challenge to the field of image-based rendering: can we use crowdsourced imagery to synthesize arbitrary views of a scene with viewing conditions that change over time? Without changing viewing conditions, this challenge would reduce to the more familiar problem of reconstructing a 4D light field L(u, v, x, y) that Time Arrangement of Captured Images, Reference View, and Target Light Field Exemplar Views from Internet Photo Collections Reconstructed Reference Light Field under Exemplar Viewing Conditions (Plenoptic Slices) Figure 6.1: Crowdsampled plenoptic slices. Given a large number of tourist photos taken at different times of day, our system learns to construct a continuous set of lightfields and to synthesize novel views capturing all-times-of-day scene appearance. describes all light in our scene [199]. When we add time to our problem, it turns into a 5D reconstruction over what Adelson and Bergen [3] call the plenoptic function.1 In this paper, we propose a novel approach to neural image-based rendering from crowdsourced images that leverages the sparse structure of the plenoptic function to learn how scene appearance changes over space and time in an unsupervised manner. Our approach takes unstructured Internet photos spanning some range of time-varying appearance in a scene and learns how to reconstruct a plenoptic slice—a representation of the light field that respects temporal structure in the plenoptic function when interpolated over time—for each of the viewing conditions captured in our input data. By designing our model to preserve the structure of real plenoptic functions, we force it to learn time- varying phenomena like the motion of shadows according to sun position. This lets us, for example, recover plenoptic slices for images taken at different times of day (Figure 6.1, bottom row) and interpolate between them to observe how shadows move as the day progresses (best seen in our supplemental video). In effect, we learn a representation of the scene that can produce high-quality views from a continuum of viewpoints and 1[3] describes the plenoptic function as 7D, but we can reduce this to a 4D color light field supplemented by time by applying the later observations of [199]. 103 viewing conditions that vary with time. Our work makes three key contributions: first, a representation, called a DeepMPI, for neural rendering that extends prior work on multiplane images (MPIs) [418] to model viewing conditions that vary with time; second, a method for training DeepMPIs on sparse, unstructured crowdsampled data that is unregistered in time; and third, a dataset of crowdsampled images taken from Internet photo collections, along with details on how it was collected and registered. Compared with previous work, our approach inherits the advantages of recent methods based on MPIs [418, 335, 240, 69, 90], including the ability to produce high-quality novel views of complex scenes in real time and the view consistency that arises from a 3D scene representation (in contrast to neural rendering approaches that decode a separate view for each desired viewpoint). To these advantages we add the key ability to synthesize and interpolate continuous, photo-realistic, time-varying changes in appearance. We compare our approach both quantitatively and qualitatively to recent neural rendering methods, such as Neural Rerendering in the Wild [238], and show that our method produces superior results. 6.2 Related Work The study of image-based rendering is motivated by a simple question: how do we use a finite set of images to reconstruct an infinite set of views? Different branches of research have explored this question from different angles and with different assumptions. Here we outline the space of approaches, highlighting work most closely related to our own. Novel view synthesis. Novel view synthesis has traditionally been approached through 104 either explicit estimation of scene geometry and color [133, 426, 58], or using coarser estimates of geometry to guide interpolation between captured views [47, 75, 329]. Light field rendering [199, 114, 55] pushes the latter strategy to an extreme by using dense structured sampling of the light field to make reconstruction guarantees independent of specific scene geometry. Subsequent works [198, 314, 73, 356, 274, 315] have leveraged observations on the structure of light fields to build on this approach. However, most IBR algorithms are designed to model static appearance, making them ill-suited for our problem. Recently, deep learning techniques have been applied to this problem. Several works [348, 134] rely on global meshes to guide view synthesis. However, such methods heavily rely on the accuracy of 3D models, and often fail to model complex scene com- ponents such as translucent and thin objects. Other works predict appearance flow [419], depth probabilities [91, 391], or RGBD light fields [164, 336]. However, many of these methods independently synthesize appearance for each view, leading to inconsistent renderings across views. Our approach builds on the use of multiplane images (MPIs) [418] for novel view synthesis. Several recent methods have shown that MPIs are an effective and learnable representation for light fields [335, 240, 69, 90]. We build on this representation by intro- ducing the DeepMPI, which further captures viewing condition–dependent appearance. We are also inspired by recent work that poses view synthesis as decoding features from a learned latent space [326, 219, 327, 348, 66, 89]. However, such work has been limited to synthetic environments or objects captured in controlled settings and is difficult to apply to crowdsampled images. Appearance modeling. Several works have modeled the time-varying appearance of outdoor scenes using physically-motivated approaches [309, 129, 189] or by combining 105 data-driven methods and dense geometry [100, 277, 400, 232]. Additionally, Martin- Brualla et al. [229, 228] reconstruct time-lapses of urban scenes from Internet photos. However, their method relies on timestamps, and models appearance changes at much coarser granularity (scene dynamics across years). The recent work of Meshry et al. [238] is probably closest to our own. They model appearance changes across varying times of day by learning an appearance embedding. However, their method relies heavily on dense multi-view stereo geometry, and tends to produce temporal artifacts under complex appearance changes. In contrast, our approach is capable of rendering a more continuous range of photo-realistic views across diverse appearances, without relying on dense input geometry. Deep image synthesis. Our work is also related to the problem of image-to-image translation [62, 154, 367, 271], multi-model image-to-image translation [422, 423, 143, 195] and style transfer [102, 353, 142, 312]. Recently, Generative Adversarial Networks (GANs) [112, 227, 117] have successfully produced photo-realistic imagery, enabling a variety of applications in deep image synthesis [381, 300, 169, 168, 194, 368]. However, there has been comparatively little investigation of 3D scene representations for deep image synthesis. Our method demonstrates the ability to learn a generative 3D scene representation and produce high-quality novel views of complex scenes. 6.3 Approach Given a set I = {I1, I2, ..., In} of crowdsampled photos with corresponding camera viewpoints C = {c1, c2, ..., cn} captured in a common scene, we formulate our problem as the reconstruction of plenoptic slices (local light fields parameterized by an apperance descriptor) around some reference view r conditioned on each of the scene appearances 106 captured in I (see Figure 6.1 for a geometric sketch of this setup). We present our approach in three parts: first, we describe how the input images I are collected and registered (Section 6.3.1); then we discuss our representation of the plenoptic function, which extends multiplane images (MPIs) to model appearance changes over time (Section 6.3.2); and finally we describe how to train this representation on our crowdsampled data (Sections 6.3.3 and 6.3.4). Note on notation: Throughout the paper, we will use superscripts to denote camera viewpoints and subscripts to denote image or voxel indices. 6.3.1 Collecting Crowdsampled Data We selected a number of popular tourist sites and downloaded ∼50K photos from Flickr for each site. For each scene, we must then register these photos by solving for a camera pose and intrinsic parameters for each image. As running structure from motion (SfM) from scratch on such quantities of images is very expensive, we instead started with a existing SfM reconstruction of each site from the MegaDepth dataset [207], and performed camera relocalization to efficiently register each new image against the existing reconstruction [304]. For each landmark, we then identified a reference viewpoint r to center our recon- struction by using a canonical view selection algorithm similar to that of Simon et al.to find viewpoints with a high density of nearby views [323]. We then select all images captured from within a sphere centered at r for use in our method, randomly splitting the set gathered from each landmark into training and test data. We manually set the field of view of the reference viewpoint so that it has good coverage of the scene. 107 (a) Trevi Fountain (b) The Pantheon (c) Sacre Coeur Figure 6.2: Registered photo collections. Example SfM reconstructions of clusters of Internet photos sharing similar viewpoints, labeled as red dots. We found that the camera parameters estimated from relocalization are sometimes inaccurate, and so we reapply a global SfM and bundle adjustment to the smaller set of selected images near each scene’s reference view to reestimate these images’ camera parameters. We used this data pipeline to gather and register photos for eight locations, and will release this data to the research community. Figure 6.2 shows final SfM reconstructions for three of these landmarks. 6.3.2 The DeepMPI Scene Representation We base our representation on the multiplane image (MPI) format [346, 418], which represents light fields locally as a stack of fronto-parallel planar RGBα layers arranged at varying distances from the camera, akin to a stack of transparencies. Novel views are rendered from an MPI by warping the layers into a new view, then performing an over operation to composite the warped layers into a rendered image. Individual RGBα elements (“voxels”) of an MPI are indexed by (x, y) position and plane depth d. While MPIs have been remarkably effective for reconstructing fixed light fields from sparse views [90], they do not encode any information about how viewing conditions may vary with time. Furthermore, even if we were given a regular MPI corresponding to viewing conditions for each of our input images, directly interpolating between these 108 MPIs would still fail to capture temporal structure in the plenoptic function. For example, interpolating between morning and afternoon MPIs would cause shadows cast by the sun to appear in duplicate when, in reality, a single shadow moved over time. This observation highlights the distinction between what we call a light field and what we call a plenoptic slice: we use the latter to describe a reparameterization of the light field that is better-suited for interpolation over time. Inspired by DeepVoxels [326], we introduce DeepMPIs to help learn this reparameteri- zation. DeepMPIs augment standard RGBα MPIs by appending a learnable latent feature vector at each MPI voxel (see Figure 6.4). For a given scene, we position a DeepMPI at the reference viewpoint r, and denote this reference DeepMPI as Dr = (Br, αr, F r). Each voxel of Dr at spatial location and depth p = (x, y, d) consists of a base RGB color Brp, an alpha weight α r p, and a latent feature vector F r p. We set the number of DeepMPI depth planes to 64 with uniform sampling in disparity space, and we adopt the method of Zhou et al. [418] to set the depth of the near and far planes of the DeepMPI. In our Appendix we relate the design of this representation and its training to priors on the sparse structure of the plenoptic function. At a high level, the α planes encode visibility information, which we expect to remain constant even as lighting and other viewing conditions change with time. The latent feature planes F r are trained to capture correlations between different viewing conditions that arise from, for example, limited variation in material properties and correlation among surface normals within the scene. A plenoptic slice then consists of a DeepMPI and some exemplar image Ik. We can convert this to a standard RGBα MPI representing appearance under the specific conditions captured in Ik by using a decoder that is trained jointly with our DeepMPI, which we describe in Section 6.3.4. To compute a DeepMPI from a collection of registered images, we use a two-stage 109 process: first, we first estimate base color and α planes (Section 6.3.3), then optimize latent features F r jointly with our neural rendering network (Section 6.3.4) to enable controllable, varying appearance. 6.3.3 Stage 1: Optimizing DeepMPI Color and α Planes In the first stage of our method, we optimize base color planes Br and alpha planes αr in our DeepMPI as if it were a standard RGBα MPI. One simple approach would be to jointly optimize Br and αr from scratch so as to minimize a reconstruction loss over all images (i.e., the difference between a known image and an MPI-predicted image from that viewpoint, averaged over all input images). However, as described in [90], such a method exhibits slow convergence and can be prone to local minima. In addition, compared to [90], our setting is more challenging because Internet photos exhibit diversity in camera parameters and viewing conditions. Instead, we propose a simple yet effective approach to estimating Br and αr given a set of posed input views. We start by creating a mean RGB plane sweep volume (PSV) at the reference viewpoint by reprojecting every image to the reference viewpoint via each depth plane, then averaging all reprojected images at each depth plane. We initialize the base color planes Br to this mean RGB PSV. Keeping these color planes fixed, we optimize the alpha planes αr to minimize reconstruction losses over the training photos. Specifically, given a photo Ik at viewpoint ck, we project both Br and αr to ck, then apply the over operation from back to front to render a base color image B̂k: ( ) B̂k = O Wk(Br),Wk(αr) , (6.1) where O is the over operation and Wk is the warping operation from the reference viewpoint r to the target viewpoint ck. We compare the rendered base color image B̂k 110 (a) viewpoint ck (b) base color B̂k (c) our depth (d) baseline depth Figure 6.3: Renderings of base color and alpha. From left to right: (a) original photos at target viewpoint ck, (b) our estimated base color at ck, (c) pseudo-depth computed from the RGBα MPI at ck using our two-phase approach, (d) pseudo-depth from the baseline. For depth maps, red=close and blue=far. and Ik using a reconstruction loss consisting of a pixel-wise l1 loss and a multi-scale gradient consistency loss [207, 203]. We observe that the gradient consistency loss leads to higher rendering quality and faster convergence. Since the mean RGB PSV cannot accurately model scene content that is occluded in the reference view, after optimizing αr with fixed Br, we unfreeze Br and jointly optimize Br and αr using the reconstruction loss described above. We observe that this two-phase training method leads to more accurate estimates of αr than the alternative of optimizing Br and αr together from scratch. Figure 6.3, shows examples of input viewpoints and rendered base color images, as well as a comparison of pseudo-depths derived from alpha planes αr computed by our two-phase training method and by the baseline. Once Br and αr are estimated, they are fixed for the subsequent stage of training, described below. 111 Figure 6.4: Learning framework. Our method builds a reference DeepMPI Dr, consist- ing of base color, alpha, and latent feature components organized into planar layers. A rendering network G takes a DeepMPI projected to a target viewpoint ck, and predicts corresponding RGB color layers. The appearance of these layers is modulated by an appearance vector zs produced by encoder E. The over operation O is applied to the resulting RGBα MPI to render a view. We jointly train the encoder E, rendering network G, and latent features F r in the DeepMPI by comparing a rendered view with an original exemplar image Ik = Is. During inference, given an exemplar photo Is, we can synthe- size novel views close to the reference viewpoint, while also preserving the exemplar’s appearance. 6.3.4 Stage 2: Learning How Appearance Changes with Time Our method’s second stage optimizes the latent features F r in our DeepMPI, together with an appearance encoder E and rendering network G, to capture and render time-varying appearance. Our learning framework is summarized in Figure 6.4. Appearance encoder. To model appearance variation, we devise a method wherein an encoder E learns to map an exemplar image Is and an auxiliary deep buffer Φrs to a latent appearance vector zs. Prior work, such as Meshry et al. [238], represents such variation by learning an appearance vector from the exemplar image and a deep buffer containing semantic and depth information. However, their deep buffer is aligned with the viewpoint of the exemplar image. This makes the encoding of exemplar data view-dependent 112 when, under fixed conditions, the information (e.g., sun direction) it reflects should be largely view-independent. In contrast, we utilize a deep buffer aligned with the reference viewpoint. In particular, our encoder E computes a latent appearance vector zs: zs = E (Is,Φ r s) (6.2) where Is is an exemplar image and Φrs is a reference viewpoint–aligned deep buffer containing (1) a rectified RGB image over-composited from a PSV that reprojects exemplar Is to the reference viewpoint via the depth planes of the reference DeepMPI, (2) a flattened base color image over-composited from base color layers Br, and (3) a flattened latent feature map at the reference viewpoint over-composited from DeepMPI features F r. Such a deep buffer allows E to learn complex appearance by aligning the illumination information in the exemplar image with the shared scene intrinsic properties encoded in the reference DeepMPI. Without such alignment, it is difficult for E to consistently establish appearance correspondence across different viewpoints. Column (d) of Fig- ure 6.5 shows examples of rendered images without use of such a deep buffer. One can see that the deep buffer guides the model to capture complex illumination effects such as the realistic shadows highlighted in the first row. Moreover, integrating the base color and latent feature map at the reference viewpoint into Φrs, and adding Is as inputs to E can help the model to extrapolate appearance outside the field of view of the exemplar image, as shown in the last row of Figure 6.6. Neural renderer. A plenoptic slice is now represented by the reference DeepMPI Dr and an appearance vector zs. Given these inputs, our neural renderer G predicts the corresponding RGB color planes. We could either predict these RGB planes at the 113 reference viewpoint, or after first warping the DeepMPI to the target viewpoint. We choose the latter because it simplifies efficient implementation, as noted below. Let Dk denote the reference DeepMPI Dr after warping into target viewpoint c kk. Given D and zs, G predicts the RGB color planes Cks of a(standar)d RGBα MPI at target viewpoint ck: Cks = G D k, zs (6.3) In particular, G takes in each layer of Dk independently and predicts a corresponding RGB layer whose appearance is controlled by zs. A rendered RGB image with the appearance of Is at viewpoint ck can then be obtained using the over operation with precomputed alpha weights αk in Dk: ( ) Îks = O Cks , αk (6.4) As shown in Fig. 6.4, during training we set exemplar image Is = Ik, i.e., we aim to reconstruct image Ik at viewpoint ck. At inference, Is is not necessarily Ik. Our rendering network G is a U-Net variant with an encoder-decoder architecture. Prior methods [238, 423] embed z in the bottleneck or input of G. Instead, we use Adaptive Instance Normalization (AdaIN) layers [142] whose parameters are dynamically generated from z via an MLP. AdaIN has been shown to be effective in capturing both global and spatially varying appearance of exemplar images. We find that AdaIN not only helps model natural scene appearance, but also stabilizes training. Column (b) of Figure 6.5 shows examples of our rendered images without AdaIN; one can see the model using AdaIN preserves more faithful scene appearance including the style and color of exemplar images. In practice, feeding a full-resolution DeepMPI into G and performing back- propagation is very memory intensive. Hence, during training, we operate on random 256× 256 crops of training images, and only the necessary portion of Dr is warped to ck and fed to G. At test time, any size input can be used. 114 Our Rendered View (a) GT (b) w/o AdaIN (c) w/o F r (d) w/o E(Φrs) (e) Ours Figure 6.5: Comparisons of images reconstructed with different configurations of our method. The images rendered from our full approach (e) are more similar to the ground truth images (a) than other configurations. In particular, the images rendered from the models without AdaIN (b) or the DeepMPI (c) are less realistic, and the model that does not feed the deep buffer Φrs to the encoder (d) fails to capture accurate scene appearance, as indicated in the highlighted regions. Losses. To train G and E, we compute losses between output views and ground-truth exemplar views. Our training loss is composed of three terms: L = LVGG + wGANLGAN + wstyleLstyle, (6.5) where LVGG, LGAN, and Lstyle denote VGG perceptual loss, adversarial loss, and style loss. For LVGG, we adopt the formulation of [418, 62]; LGAN is computed from multi-scale discriminators [367] with an objective similar to LSGAN [227]. To further enforce that the appearance of rendered images matches that of exemplar images, our style loss Lstyle compares l1 differences between Gram matrices constructed from VGG features at different layers. We empirically observe Lstyle can guide our model to correctly capture the appearance of exemplar images, especially for rare photos such as those taken at sunset. 115 6.4 Experiments We conduct extensive experiments to validate our proposed approach on our Internet photo dataset. We first compare with two baseline methods both quantitatively and quali- tatively on the tasks of view synthesis, appearance transfer and appearance interpolation. We also present an ablation study to examine the impact of different configurations of our model. Finally, we perform a user study whose results demonstrate the quality of our synthesized novel views. Data and implementation. We evaluate our approach on five of our reconstructed scenes, which contain on average 2,064 images. For each scene, images are randomly split into training and test sets with a 85:15 ratio. We train a separate model for each scene. To mask out transient objects such as people and cars during training and evaluation, we adopt state-of-the-art semantic and instance segmentation algorithms [59, 131] to create binary object masks. We set the dimension of the latent appearance vector to z 16s ∈ R , and that of our latent DeepMPI features to F r ∈ R8p . We refer readers to the Appendix for scene statistics, network architectures, and other implementation details. Baselines. We compare our approach to two state-of-the-art multi-modal image-to-image translation methods, adapted to our task: MUNIT [143] and Neural Rerendering in the Wild (NRW) [238]. To compare to MUNIT, we adapt their network G to predict an RGB image at the target viewpoint from a corresponding base color input, and train with a bidirectional reconstruction loss. For NRW, both E and G take as input base color, per-frame depth derived from the DeepMPI, and semantic segmentation at the target viewpoint. G then predicts a corresponding RGB image conditioned on the appearance vector extracted by E. We follow the same staged training strategy and use the same losses as in [238]. 116 Trevi Fountain Sacre Coeur The Pantheon Top of the Rock Piazza Navona Method l1 LPIPS PSNR l1 LPIPS PSNR l1 LPIPS PSNR l1 LPIPS PSNR l1 LPIPS PSNR MUNIT [143] 0.768 2.62 20.1 0.740 2.08 20.2 0.560 1.51 21.4 0.876 3.68 18.2 0.984 2.80 17.4 NRW [238] 0.779 2.07 20.0 0.808 1.90 19.6 0.592 1.35 21.1 0.802 2.76 19.3 1.050 2.64 17.1 w/o 2-phase 0.651 1.68 21.0 0.695 1.61 20.8 0.515 1.12 21.9 0.694 2.19 20.4 1.010 2.52 17.4 w/o AdaIN 0.780 1.87 19.8 0.801 1.89 19.6 0.609 1.30 20.9 0.773 2.58 19.3 1.150 2.97 17.1 w/o F r 0.712 1.74 20.5 0.737 1.78 20.2 0.556 1.25 21.5 0.720 2.47 19.9 1.045 2.62 17.0 w/o E(Φrs) 0.670 1.70 20.9 0.715 1.66 20.5 0.549 1.16 21.5 0.703 2.24 20.0 1.017 2.52 17.2 Ours (full) 0.618 1.56 21.8 0.676 1.57 21.0 0.495 1.08 22.5 0.642 2.48 20.7 0.933 2.32 17.6 Table 6.1: Quantitative comparisons on our test set. Lower is better for l1 and LPIPS and higher is better for PSNR. l1 errors are scaled by 10 for ease of presentation. Error metrics. Similar to [238], we report test image reconstruction errors using three error metrics: l1 error, peak signal-to-noise ratio (PSNR), and perceptual similarity (via LPIPS [405]). Prior work has found the LPIPS metric to be better correlated with human visual perception than other metrics. Quantitative comparison. For fair comparison, we train and evaluate the baselines using the same data and hyperparameter settings as our method. Table 6.1 shows results of quantitative comparisons on our test set. Our proposed approach outperforms the two baseline methods by a large margin in terms of l1 and PSNR, and is significantly better in terms of LPIPS, indicating that our method achieves higher rendering quality and realism. Ablation study. We perform an ablation study to analyze the effect of individual system components. In particular, we replace four components with simpler configurations: (1) using a train-from-scratch baseline to estimate alpha, as described in Section 6.3.3 (w/o 2-phase), (2) including z as an input to G rather than using AdaIN (w/o AdaIN), (3) removing latent features from the DeepMPI (w/o F r), and (4) encoding z only from the exemplar image and not additionally from the deep buffer (w/o E(Φrs)). Quantitative results are reported in Table 6.1. Latent DeepMPI features, as well as the use of AdaIN in our neural renderer, yield significant improvements, and lead to better rendering quality for thin structures and attached shadows, as highlighted in Figure 6.5. Encoding the 117 (a) exemplar I (b) MUNIT [143] (c) NRW [238] (d) Ours Figure 6.6: Appearance transfer comparison. From left to right: (a) exemplar images used to extract appearance vectors, (b) predictions from MUNIT [143], (c) predictions from NRW [238], (d) predictions from our method. Compared to the baselines, our rendered images are more photo-realistic and are more faithful to the appearance of the exemplar images. Please zoom in to highlighted regions for better visual comparisons. reference deep buffer also yields rendered images that better match the exemplar image. Rendering with appearance transfer. Figure 6.6 shows qualitative comparisons be- tween our method and the two baselines on our test set in terms of rendering quality and appearance transferability (i.e., how well the model can transfer illumination and 118 Start Image End Image Start Image End Image Figure 6.7: Appearance interpolation. The left- and rightmost exemplar images indicate start and end appearance. Intermediate images are generated by linearly interpolating latent vectors from the two images. Odd rows show interpolation results from NRW [238], and even rows from our method. Moving shadows are indicated in highlighted regions. appearance of an exemplar image to a target viewpoint). We demonstrate compelling results in challenging cases such as sunset, which is a rare condition in the input photos. Compared to MUNIT, our rendered images are more realistic and exhibit fewer artifacts. For example, our rendered images successfully model specularities on glass windows, details on running water and droplets, cast shadows, and directional lighting effect as shown in the highlighted regions in Figure 6.6. Our approach can also generate complex highlights and cast shadows from the sun. Compared with NRW, our rendered images are more faithful to the illumination in the exemplar image (e.g., for sunset appearance). Moreover, our approach can extrapolate appearance beyond the field of view of the exemplar image, as shown in the last row of Figure 6.6. Appearance interpolation. A key advantage of our method is the ability to interpolate between plenoptic slices in the latent appearance space. We conduct qualitative com- parisons between our approach and NRW on appearance interpolation. We choose two images to define the start and end appearance, and linearly interpolate their latent vectors 119 Ours NRW Ours NRW Figure 6.8: 4D Photos. We demonstrate an application of creating 4D photos by perform- ing spatial-temporal interpolation in which both camera viewpoint and scene illumination change simultaneously. to produce in-between appearances. Figure 6.7 shows a comparison of interpolation results. In the first two rows of the figure, we observe that our method can simulate the progression of surfaces exposed to sunlight as the sun moves, while NRW fails to produce this effect. In the last row, our approach recovers the gradual motion of shadows throughout the day, while shadows in the NRW results tend to fade less naturally during interpolation. 4D photos. Figure 6.8 shows an application of our method to generating animated 4D photos by animating the 3D viewpoints and simultaneously interpolating between latent appearance features. Our results achieve convincing changes across a variety of times of day and lighting conditions. User study. We ran a user study using 24 random sets of videos with camera movements and synthesized images from 5 different scenes. Each video is a sequence of novel views generated by our method, NRW [238], or MUNIT [143]. To quantify the performance of appearance transfer, we also show comparisons of results generated from different exemplar images selected from our test set. We invited 46 participants and asked them to rank the results of the three approaches. 88% of the time, participants responded that the videos produced by our system are the most temporally coherent. 82% of the time, they responded that the results from our method best reproduce the details of structure and 120 (a) insufficient view (b) MPI limits (c) exemplar for (d) (d) missing shadow Figure 6.9: Limitations. Some failure cases include: (a) input photo collections that do not span the full range of desired viewpoints, or (b) intrinsic limitations of MPI leading to poor extrapolation to large camera motions. In addition, as shown in (c) (exemplar image with strong shadow) and (d) (resulting rendering), our method can fail to model strong cast shadows produced by occluders outside the reference field of view. illumination one would expect of a real-world scene. 77% of the time, they responded that the results from our method are the most faithful to the corresponding exemplar. 6.5 Discussion Limitations. Our method inherits limitations from MPIs. For example, MPIs fail to generalize to viewpoints that are not well-sampled, or that are far from the reference view of the MPI (see Figure 6.9(a-b)). In addition, our model can also sometimes fail to model cast shadows from occluders outside of the reference field of view, as shown in Figure 6.9(c) and (d). Despite these limitations, we believe our work represents a significant advance towards photo-realistic capture and rendering of the world from crowd photography. Conclusion. We presented a method for synthesizing novel views of scenes under time- varying appearance from Internet photos. We proposed a new DeepMPI representation and a method for optimizing and decoding DeepMPIs conditioned on viewing conditions present in different photos. Our method can synthesize plenoptic slices that can be interpolated to recover local regions of the full plenoptic function. In the future, we 121 envision enabling even larger changes in viewpoint and illumination, including 4D walkthroughs of large-scale scenes. 122 CHAPTER 7 NEURAL SCENE FLOW FIELDS FOR SPACE-TIME VIEW SYNTHESIS OF DYNAMIC SCENES 7.1 Introduction The topic of novel view synthesis has recently seen impressive progress due to the use of neural networks to learn representations that are well suited for view synthesis tasks. Most prior approaches in this domain make the assumption that the scene is static, or that it is observed from multiple synchronized input views. However, these restrictions are violated by most videos shared on the Internet today, which frequently feature scenes with diverse dynamic content (e.g., humans, animals, vehicles), recorded by a single camera. We present a new approach for novel view and time synthesis of dynamic scenes from monocular video input with known (or derivable) camera poses. This problem is highly ill-posed since there can be multiple scene configurations that lead to the same observed image sequences. In addition, using multi-view constraints for moving objects is challenging, as doing so requires knowing the dense 3D motion of all scene points (i.e., the “scene flow”). In this work, we propose to represent a dynamic scene as a continuous function of both space and time, where its output consists of not only reflectance and density, but also 3D scene motion. Similar to prior work, we parameterize this function with a deep neural network (a multi-layer perceptron, MLP), and perform rendering using volume tracing [241]. We optimize the weights of this MLP using a scene flow fields warping loss that enforces that our scene representation is temporally consistent with the input views. Crucially, as we model dense scene flow fields in 3D, our function can represent the sharp motion discontinuities that arise when projecting the scene into image space, even with simple low-level 3D smoothness priors. Further, dense scene flow fields also enable us to interpolate along changes in both space and time. To the best our knowledge, our approach is the first to achieve novel view and time synthesis of dynamic scenes captured from a monocular camera. As the problem is highly underconstrained, we introduce several components that improve rendering quality over a baseline solution. Specifically, we introduce a disoc- clusion confidence measure to handle the inherent ambiguities of scene flow near 3D disocclusions. We also show how to use data-driven priors to avoid local minima during optimization, and describe how to effectively combine a static scene representation with a dynamic one which lets us render views with higher quality by leveraging multi-view constraints in rigid regions. In summary, our key contributions include: (1) a neural representation for space- time view synthesis of dynamic scenes that we call Neural Scene Flow Fields, that has the capacity to model 3D scene dynamics, and (2) a method for optimizing Neural Scene Flow Fields on monocular video by leveraging multiview constraints in both rigid and non-rigid regions, allowing us to synthesize and interpolate both view and time simultaneously. 7.2 Related Work Our approach is motivated by a large body of work in the areas of novel view synthesis, dynamic scene reconstruction, and video understanding. 124 Novel view synthesis. Many methods propose first building an explicit 3D scene geometry such as point clouds or meshes, and rendering this geometry from novel views [47, 58, 75, 133, 179, 329]. Light field rendering methods on the other hand, synthesize novel views by using implicit soft geometry estimates derived from densely sampled images [55, 114, 199]. Numerous other works improve the rendering quality of light fields by exploiting their special structure [73, 274, 314, 356]. Yet another promising 3D representation is multiplane images (MPIs), that have been shown to model complex scene appearance [45, 46, 69, 90, 240, 335]. Recently, deep learning methods have shown promising results by learning a repre- sentation that is suited for novel view synthesis. Such methods have learned additional deep features that exist on top of reconstructed meshes [134, 289, 348] or dense depth maps [91, 391]. Alternately, pure voxel-based implicit scene representations have become popular due to their simplicity and CNN-friendly structure [66, 89, 219, 326, 327, 348]. Our method is based on a recent variation of these approaches to represent a scene as neural radiance field (NeRF) [241], which model the appearance and geometry of a scene implicitly by a continuous function, represented with an MLP. While the above methods have shown impressive view synthesis results, they all assume a static scene with fixed appearance over time, and hence cannot model temporal changes or dynamic scenes. Another class of methods synthesize novel views from a single RGB image. These methods typically work by predicting depth maps [203, 262], sometimes with additional learned features [375], or a layered scene representation [316, 351] to fill in the content in disocclusions. While such methods, if trained on appropriate data, can be used on dynamic scenes, this is only possible on a per-frame (instantaneous) basis, and they cannot leverage repeated observations across multiple views, or be used to synthesize novel times. 125 Novel time synthesis. Most approaches for interpolating between video frames work in 2D image space, by directly predicting kernels that blend two images [259, 260, 261], or by modeling optical flow and warping frames/features [15, 161, 257, 258]. More recently, Lu et al. [221] show re-timing effect of people by using a layered representation. These approaches generate high-quality frame interpolation results, but operate in 2D and cannot be used to synthesize novel views in space. Space-time view synthesis. There are two main reasons that scenes change appearance across time. The first is due to illumination changes; prior approaches have proposed to render novel views of single object with plausible relighting [31, 33, 34], or model time-varying appearance from internet photo collections [208, 230, 238]. However, these methods operate on static scenes and treat moving objects as outliers. Second, appearance change can happen due to 3D scene motion. Most prior work in this domain [14, 28, 338, 425] require multi-view, time synchronized videos as input, and has limited ability to model complicated scene geometry. Most closely related ours, Yoon et al. [399] propose to combine single-view depth and depth from multi-view stereo to render novel views by performing explicit depth based 3D warping. However, this method has several drawbacks: it relies on human annotated foreground masks, requires cumbersome preprocessing and pretraining and tends to produce artifacts in disocclusions. Instead, we show that our model can be trained end-to-end and produces much more realistic results, and is able to represent complicated scene structure and view dependent effects along with natural degrees of motion. Dynamic scene reconstruction. Most successful non-rigid reconstruction systems either require RGBD data as input [40, 79, 150, 255, 396, 427], or can only reconstruct sparse geometry [270, 360, 324, 411]. A few prior monocular methods proposed using 126 strong hand-crafted priors to decompose dynamic scenes into piece-wise rigid parts [186, 282, 299]. Recent work of Luo et al. [222] estimates temporally consistent depth maps of scenes with small object motion by optimizing the weights of a single image depth prediction network, but we show that this approach fails to model large and complex 3D motions. Additional work has aimed to predict per-pixel scene flows of dynamic scenes from either monocular or RGBD sequences [43, 145, 160, 223, 242, 340]. 7.3 Approach We build upon prior work for static scenes [241], to which we add the notion of time, and estimate 3D motion by explicitly modeling forward and backward scene flow as dense 3D vector fields. In this section, we first describe this time-variant (dynamic) scene representation (Sec. 7.3.1) and the method for effectively optimizing this representation (Sec. 7.3.2) on the input views. We then discuss how to improve the rendering quality by adding an additional explicit time-invariant (static) scene representation, optimized jointly with the dynamic one by combining both during rendering (Sec. 7.3.3). Finally, we describe how to achieve space-time interpolation of dynamic scenes through our trained representation (Sec. 7.3.4). Background: static scene rendering. Neural Radiance Fields (NeRFs) [241] repre- sent a static scene as a radiance field defined over a bounded 3D volume. This radiance field, denoted FΘ, is defined by a set of parameters Θ that are optimized to reconstruct the input views. In NeRF, FΘ is a multi-layer perceptron (MLP) that takes as input a position (x) and viewing direction (d), and produces as output a volumetric density (σ) 127 and RGB color (c): (c, σ) = FΘ(x,d) (7.1) To render the color of an image pixel, NeRF approximates a volume rendering integral. Let r be the camera ray emitted from the center of projection through a pixel on the image plane. The expected color Ĉ∫of that pixel is then given by:tf Ĉ(r) = T (t)(σ(r∫(t)) c(r(t),d) dttn t ) where T (t) = exp − σ(r(s)) ds . (7.2) tn Intuitively, T (t) corresponds to the accumulated transparency along that ray. The loss is then the difference between the reconstructed color Ĉ, and the ground truth color C corresponding to the pixel that ray ori∑ginated from r: L 2static = ||Ĉ(r)−C(r)||2. (7.3) r 7.3.1 Neural scene flow fields for dynamic scenes To capture scene dynamics, we extend the static scenario described in Eq. 7.1 by including time in the domain and explicitly modeling 3D motion as dense scene flow fields. For a given 3D point x and time i, the model predicts not just reflectance and opacity, but also forward and backward 3D scene flow Fi = (fi→i+1, fi→i−1), which denote 3D offset vectors that point to the position of x at times i + 1 and i − 1 respectively. Note that we make the simplifying assumption that movement that occurs between observed time instances is linear. To handle disocclusions in 3D space, we also predict disocclusion weightsWi = (wi→i+1, wi→i−1) (described in Sec. 7.3.2). Our dynamic model is thus defined as: (c , σ ,F ,W ) = F dyi i i i Θ (x,d, i). (7.4) 128 Figure 7.1: Scene flow fields warping. To render a frame at time i, we perform volume tracing along ray ri with RGBσ at time i, giving us the pixel color Ĉi(ri) (left). To warp the scene from time j to i, we offset each step along ri using scene flow fi→j and volume trace with the associated color and opacity (cj, σj) (right). Note that for convenience, we use the subscript i to indicate a value at a specific time i. 7.3.2 Optimization Temporal photometric consistency. The key new loss we introduce enforces that the scene at time i should be consistent with the scene at neighboring times j ∈ N (i), when accounting for motion that occurs due to 3D scene flow. To do this, we volume render the scene at time i from 1) the perspective of the camera at time i and 2) with the scene warped from j to i, so as to undo any motion that occurred between i and j. As shown in Fig. 7.1 (right), we achieve this by warping each 3D sampled point location xi along a ray ri during volume tracing using the predicted scene flows fields Fi to look up the RGB color cj and opacity σj from neighboring time j. This yields a rendered image, denoted 129 Ĉj→i, of the scene at time j wi∫th both camera and scene motion warped to time i:tf Ĉj→i(ri) = Tj(t)σj(ri→j(t)) cj(ri→j(t),di)dt tn where ri→j(t) = ri(t) + fi→j(ri(t)). (7.5) We minimize the mean squared error (MSE) between each warped rendered view and the ground truth view: ∑ ∑ Lpho = ||Ĉj→i(ri)−Ci(ri)||22 (7.6) ri j∈N (i) An important caveat is that this loss is not valid at 3D disocculusion regions caused by motion. Analogous to 2D dense optical flow [235], there is no correct scene flow when a 3D location becomes occluded or disoccluded between frames. These regions are especially important as they occur at the boundaries of moving objects (see Fig. 7.2 for an illustration). To mitigate errors due to this ambiguity, we predict two extra continuous disocclusion weight fieldswi→i+1 andwi→i−1 ∈ [0, 1], corresponding to fi→i+1 and fi→i−1 respectively. These weights serve as an unsupervised confidence of where the temporal photoconsistency loss should be applied; ideally they should be low at disocclusions and close to 1 everywhere else. We apply these weights by volume rendering the weight along the ray ri with opacity from time j, and multiplying the accumulated weight at each 2D pixel: ∫ tf Ŵj→i(ri) = Tj(t)σj(ri→j(t))wi→j(ri(t))dt (7.7) tn We avoid the trivial solution where all predicted weights are zero by adding `1 regularization to encourage predicted weights to be close to one, giving us a new weighted 130 F a e F a e Bac g d M i g Objec Re de ed I age ? Wa ed I age ? Figure 7.2: Scene flow disocclusion ambiguity. In this 2D orthographic example, a single blue object translates to the right by one pixel from frame at time i to frame at time j. Here, the correct scene flow at the point labeled a, e.g., fi→j (a), points one unit to the right, however, for the scene flow fi→j (c) (and similarly fj→i (a)), there can be multiple answers. If fi→j (c) = 0, then the scene flow would incorrectly point to the foreground in the next frame, and if fi→j (c) = 1, the scene flow would point to the freespace location d in the next frame. loss: ∑ ∑ Lpho = Ŵ 2j→i(ri)||Ĉj→i(ri)−Ci(ri)||2 ri j∈N (i) ∑ + βw ||wi→j(xi)− 1‖|1, (7.8) xi where βw is a regularization weight which we set to 0.1 in all our experiments. We use N (i) = {i, i ± 1, i ± 2}, and chain scene flow and disocclusion weights for the i ± 2 case. Note that when j = i, there is no scene flow warping or disocculusion weights involved (fi→j = 0, Ŵj→i(ri) = 1), meaning that Ĉi→i(ri) = Ĉi(ri), as in Fig. 7.1(left). Comparing Fig. 7.3(e) and Fig. 7.3(d), we can see that adding this disocclusion weight improves rendering quality near motion boundaries. Scene flow priors. To regularize the predicted scene flow fields, we add a 3D scene flow cycle consistency term to encourage that at all sampled 3D points xi, the predicted forward scene flow fi→j is consistent with the backward scene flow fj→i at the corre- 131 Figure 7.3: Qualitative ablations. Results of our full method with different loss com- ponents removed. The odd rows show zoom-in rendered color and the even rows show corresponding pseudo depth. Each component reduces the overall quality in different ways. sponding location sampled at time j (i.e. at position xi→j = xi + fi→j). Note that this cycle consistency is also only valid outside 3D disocclusion regions, so we use the same predicted disocclusion weights to modulate this term, giving us: ∑ ∑ Lcyc = wi→j||fi→j(xi) + fj→i(xi→j)||1 (7.9) xi j∈N (i) We additionally add low-level regularizations Lreg on the predicted scene flow fields. First, following prior work [255], we enforce scene flow spatial smoothness by minimiz- ing the `1 difference between scenes flows sampled at neighboring 3D position along each ray. Second, we enforce scene flow temporal smoothness by encouraging 3D point trajectories to be piece-wise linear [360]. Finally, we encourage scene flow to be small in most places [357] by applying an `1 regularization term, since motion is isolated to dynamic objects. All terms are weighted equally in all experiments shown in this paper. Please see the supplementary material for complete descriptions. 132 Data-driven priors. Since reconstruction of dynamic scenes with a monocular camera is highly ill-posed, the above losses can on occasion converge to sub-optimal local minima when randomly initialized. Therefore, we introduce two data-driven losses, a geometric consistency prior and a single-view depth prior: Ldata = Lgeo + βzLz. We set βz = 2 in all our experiments. The geometric consistency prior helps model build correspondence association more accurately between adjacent frames. In particular, it minimizes the reprojection error of scene flow displaced 3D points w.r.t. the derived 2D optical flow which we compute using FlowNet2 [146]. Suppose pi is a 2D pixel position at time i. The corresponding 2D pixel location in the neighboring frame at time j displaced through 2D optical flow ui→j can be computed as pi→j = pi + ui→j . To estimate the expected 2D point location p̂i→j at time j displaced by predicted scene flow fields, we first compute the expected scene flow F̂i→j(ri) and the expected 3D point location X̂i(ri) of the ray ri through volume rendering. p̂i→j is then computed by performing perspective projection of the expected 3D point location displaced by the scene flow (i.e. X̂i(ri) + F̂i→j(ri)) into the viewpoint corresponding to the frame at time j. The geometric consistency is computed as the `1 difference between p̂i→j and pi→j , ∑ ∑ Lgeo = ||p̂i→j(ri)− pi→j(ri))||1. (7.10) ri j∈{i±1} We also add a single view depth prior that encourages the expected termination depth Ẑi computed along each ray to be close to the depth Zi predicted from a pre-trained single-view depth network [281]. As single-view depth predictions are defined up to an 133 Combined render Static only Dynamic only Figure 7.4: Dynamic and static components. Our method learns static and dynamic components in the combined representation. Note that person is almost still in the second example. unknown scale and shift, we utilize a robust scale-shift invariant loss [281]: ∑ Lz = ||Ẑ∗i (ri)− Z∗i (ri)||1 (7.11) ri where ∗ is a whitening operation that normalizes the depth to have zero mean and unit scale. From Fig 7.3(b), we see that adding data-driven priors help the model learn correct scene geometry especially for dynamic regions. However, as both of these data-driven priors are noisy (rely on inaccurate or incorrect predictions), we use these for initialization only, and linearly decay the weight of Ldata to zero during training for a fixed number of iterations. 7.3.3 Integrating a static scene representation The method described so far already outperforms the state of the art, as shown in Tab. 7.1. However, unlike NeRF, our warping-based temporal loss can only be used in a local temporal neighborhood N (i), as dynamic components typically undergo too 134 much deformation to reliably infer correspondence over larger temporal gaps. Rigid regions, however, should be consistent and should leverage observations from all frames. Therefore, we propose to combine our dynamic (time-dependent) scene representation with a static (time-independent) one, and require that when combined, the resulting volume-traced images match the input frames. We model each representation with its own MLP, where the dynamic scene component is represented with Eq. 7.4, and the static one is represented as a variant of Eq. 7.1: (c, σ, v) = F stΘ(x,d) (7.12) where v is an unsupervised 3D blending weight field, that linearly blends the RGBσ from static and dynamic scene representations along each ray. Intuitively, v should assign a low weight to the dynamic representation at rigid regions, as these can be rendered in higher fidelity by the static representation, while assigning a lower weight to the static representation in regions that are moving, as these can be better modeled by the dynamic representation. We found adding the extra v leads to better results and more stable convergence than the configuration without v. The combined rendering equation is then written as: ∫ tf Ĉcbi (r ) = T cb i i (t)σ cb cb i (t) ci (t)dt, (7.13) tn where σcbi (t) c cb i (t) is a linear combination of static and dynamic scene components, weighted by v(t): σcbi (t) c cb i (t) = v(t) c(t)σ(t) + (1-v(t)) ci(t)σi(t) (7.14) For clarity, we omit ri in each prediction. We then train the combined scene representation 135 Figure 7.5: Static scene representation ablation. Adding a static scene representation yields higher fidelity renderings, especially in static regions (a,c) when compared to the pure dynamic model (b). by minimizing MSE between Ĉcbi with the corresponding input view:∑ Lcb = ||Ĉcb 2i (ri)−Ci(ri)||2. (7.15) ri This loss is added to the previously defined losses on the dynamic representation, giving us the final combined loss: L = Lcb + Lpho + βcycLcyc + βdataLdata + βregLreg (7.16) where the β coefficients weight each term. Fig. 7.4 shows separately rendered static and dynamic scene components, and Fig. 7.5 visually compares renderings with and without integrating a static scene representation. 136 Figure 7.6: Novel time synthesis. Rendering images by interpolating the time index (top) yields blending artifacts compared to our scene flow based rendering (bottom). 7.3.4 Space-time view synthesis To render novel views at a given time, we simply volume render each pixel using Eq. 7.5 (dynamic) or Eq. 7.13 (static+dynamic). However, we observe that while this approach produces good results at times corresponding to input views, the representation does not allow us to interpolate time-variant geometry at in-between times, leading instead to rendered results that look like linearly blended combinations of existing frames (Fig. 7.6). Instead, we render intermediate times by warping the scene based on the predicted scene flow. For efficient rendering, we propose a splatting-based plane-sweep volume tracing approach. To render an image at intermediate time i+ δi, δi ∈ (0, 1) at a specified target viewpoint, we sweep over every ray emitted from the target viewpoint from front to back. At each sampled step t along the ray, we query point information through our models at both times i and i+ 1, and displace all 3D points at time i by the scene flow xi + δifi→i+1(xi), and similarity for time i + 1. We then splat the 3D displaced points onto a (c, α) accumulation buffer at the target viewpoint, and blend splats from time i and i+ 1 with linear weights 1− δi, δi. The final rendered view is obtained by volume rendering the accumulation buffer (see supplementary material for a diagram). 137 3D Photo NeRF Yoon et al. Ours GT Figure 7.7: Qualitative comparisons on the Dynamic Scenes dataset. Compared with prior methods, our rendered images more closely match the ground truth, and include fewer artifacts, as shown in the highlighted regions. 7.4 Experiments Implementation details. We use COLMAP [304] to estimate camera intrinsics and extrinsics, and consider these fixed during optimization. As COLMAP assumes a static scene, we mask out features from regions associated with common classes of dynamic objects using off-the-shelf instance segmentation [131]. During training and testing, we sample 128 points along each ray and normalize the time indices i ∈ [0, 1]. As with NeRF [241], we use positional encoding to transform the inputs, and parameterize scenes using normalized device coordinates. A separate model is trained for a each scene using the Adam optimizer [177] with a learning rate of 0.0005. While integrating the static scene representation, we optimize two networks simultaneously. Training a full model takes around two days per scene using two NVIDIA V100 GPUs and rendering takes roughly 6 seconds for each 512× 288 frame. We refer readers to the supplemental material for our network architectures, hyperparameter settings, and other implementation details. 138 Dynamic Only Full Methods MV SSIM (↑) PSNR (↑) LPIPS (↓) SSIM (↑) PSNR (↑) LPIPS (↓) SinSyn [375] No 0.371 14.61 0.341 0.488 16.21 0.295 MPIs [351] No 0.494 16.44 0.383 0.629 19.46 0.367 3D Ken Burn [262] No 0.462 16.33 0.224 0.630 19.25 0.185 3D Photo [316] No 0.486 16.73 0.217 0.614 19.29 0.215 NeRF [241] Yes 0.532 16.98 0.314 0.893 24.90 0.098 Luo et al. [222] Yes 0.530 16.97 0.207 0.746 21.37 0.141 Yoon et al. [399] Yes 0.547 17.34 0.199 0.761 21.78 0.127 Ours (wo/ static) Yes 0.760 21.88 0.108 0.906 26.95 0.071 Ours (w/ static) Yes 0.758 21.91 0.097 0.928 28.19 0.045 Table 7.1: Quantitative evaluation of novel view synthesis on the Dynamic Scenes dataset. MV indicates whether the approach makes use multi-view information or not. Methods Dynamic Only Full SSIM (↑) PSNR (↑) LPIPS (↓) SSIM (↑) PSNR (↑) LPIPS (↓) NeRF [241] 0.522 16.74 0.328 0.862 24.29 0.113 [316] + [260] 0.490 16.97 0.216 0.616 19.43 0.217 [399] + [260] 0.498 16.85 0.201 0.748 21.55 0.134 Ours (w/o static) 0.720 21.51 0.149 0.875 26.35 0.090 Ours (w/ static) 0.724 21.58 0.143 0.892 27.38 0.066 Table 7.2: Quantitative evaluation of novel view and time synthesis. See Sec. 7.4.2 for a description of the baselines. 7.4.1 Baselines and error metrics We compare our approach to state-of-the-art single-view and multi-view novel view syn- thesis algorithms. For single-view methods, we compare to MPIs [351] and SinSyn [375], trained on indoor real estate videos [418]; 3D Photos [316] and 3D Ken Burns [262] were trained mainly on images in the wild. Since these methods can only compute depth up to an unknown scale and shift, we align the predicted depths with the SfM sparse point clouds before rendering. For multi-view, we compare to a recent dynamic view synthesis method [399]. Since the authors do not provide source code, we reimplemented their approach based on the paper description. We also compare to a video depth prediction method [222] and perform novel view synthesis by rendering the point cloud into novel 139 views while filling in disoccluded regions. Finally, we train a standard NeRF [241], with and without the added time domain, on each dynamic scene. We report the rendering quality of each approach with three standard error metrics: structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and perceptual similarity through LPIPS [405], on both the entire scene (Full) and in dynamic regions only (Dynamic Only). 7.4.2 Quantitative evaluation We evaluate on the Nvidia Dynamic Scenes Dataset [399], which consists of 8 scenes with human and non-human motion recorded by 12 synchronized cameras. As in the original work [399], we simulate a moving monocular camera by extracting images sampled from each camera viewpoint at different time instances, and evaluate the result of view synthesis with respect to known held-out viewpoints and frames. For each scene, we extract 24 frames from the original videos for training and use the remaining 11 held-out images per time instance for evaluation. Novel view synthesis. We first evaluate our approach and other baselines on the task of novel view synthesis (at the same time instances as the training sequences). The quantitative results are shown in Table 7.1. Our approach without the static scene representation (Ours w/o static) already significantly outperforms other single-view and multi-view baselines in both dynamic regions and on the entire scene. NeRF has the second best performance on the entire scene, but cannot model scene dynamics. Moreover, adding the static scene representation improves overall rendering quality by more than 30%, demonstrating the benefits of leveraging global multi-view information 140 from rigid regions where possible. Novel view and time synthesis. We also evaluate the task of novel view and time syn- thesis by extracting every other frame from the original Dynamic Scenes dataset videos for training, and evaluating on the held-out intermediate time instances at held-out camera viewpoints. Since we are not aware of prior monocular space-time view interpolation methods, we use two state-of-the-art view synthesis baselines [316, 399] to synthesize images at the testing camera viewpoints followed by 2D frame interpolation [260] to render intermediate times, as well as NeRF evaluated directly at the novel space-time views. Table 7.2 shows that our method significantly outperforms all baselines in both dynamic regions and the entire scene. Ablation study. We analyze the effect of each proposed system component in the task of novel view synthesis by removing (1) all added losses, which gives us NeRF extended to the temporal domain (NeRF (w/ time)); (2) the single view depth prior (w/o Lz); (3) the geometry consistency prior (w/o Lgeo); (4) the scene flow cycle consistency term (w/o Lcyc); (5) the scene flow regularization term (w/o Lreg); (6) the disocculusion weight fields (w/oWi); (7) the static representation (w/o static). The results, shown in Table 7.3, demonstrate the relative importance of each component, with the full system performing the best. 7.4.3 Qualitative evaluation We provide qualitative comparisons on the Dynamic Scenes dataset (Fig. 7.7) and on monocular video clips collected in-the-wild from the internet featuring complex ob- ject motions such as jumping, running, or dancing with various occlusions (Fig. 7.8). 141 Methods Dynamic Only Full SSIM (↑) PSNR (↑) LPIPS (↓) SSIM (↑) PSNR (↑) LPIPS (↓) NeRF (w/ time) 0.630 18.89 0.159 0.875 24.33 0.081 w/o Lz 0.710 19.66 0.132 0.882 25.16 0.078 w/o Lgeo 0.713 19.74 0.139 0.885 25.19 0.079 w/o Lcyc 0.721 20.26 0.121 0.889 26.08 0.076 w/o Lreg 0.740 21.22 0.110 0.892 26.27 0.074 w/oWi 0.754 21.31 0.112 0.894 26.23 0.074 w/o static 0.760 21.88 0.108 0.906 26.95 0.071 Full (w/ static) 0.758 21.91 0.097 0.928 28.19 0.045 Table 7.3: Ablation study on the Dynamic Scenes dataset. See Sec. 7.4.2 for detailed descriptions of each of the ablations. Our rendered views 3D Photo NeRF Yoon et al. Luo et al. Ours Figure 7.8: Qualitative comparisons on monocular video clips. When compared to baselines, our approach more correctly synthesizes hidden content in disocclusions (shown in the last three rows), and locations with complex scene structure such as the fence in the first row. NeRF [241] correctly reconstructs most static regions, but produces ghosting in dy- namic regions since it treats all the moving objects as view-dependent effects, leading to incorrect interpolation results. The state-of-the-art single-view method [316] tends to synthesize incorrect content at disocclusions, such as the bins and speaker in the 142 (a) Non-seen disocclusion (b) GT for (c) (c) Missing details Figure 7.9: Limitations. Our method is unable to extrapolate content unseen in the training views (a), and has difficulty recovering high frequency details if a video involves extreme object motions (b,c). last three rows of Fig. 7.8. In contrast, methods based on reconstructing explicit depth maps [222, 399] have difficulty modeling complex scene appearance and geometry such as the thin structures in the third row of Fig. 7.7 and the first row of Fig. 7.8. 7.5 Discussion Limitations. Monocular space-time view synthesis of dynamic scenes is very challeng- ing, and we have only scratched the surface with our proposed method. In particular, there are several limitations to our approach. Similar to NeRF, training and rendering times are high, even at limited resolutions. Additionally, each scene has to be reconstructed from scratch and our representation is unable to extrapolate content unseen in the training views (See Fig. 7.9(a)). Furthermore, we found that rendering quality degrades when either the length of the sequence is increased given default number of model parameters (most of our sequences were trained for 1∼2 seconds), or when the amount of motion is extreme (See Fig. 7.9(b-c), where we train a model on a low frame rate video). Finally, our method can end up in the wrong local minima if object motion and camera motion are close to a degenerate case, e.g., colinear, as described in Park et al. [270]. 143 Conclusion. We presented an approach for monocular novel view and time synthesis of complex dynamic scenes by Neural Scene Flow Fields, a new representation that implicitly models scene time-variant reflectance, geometry and 3D motion. We have shown that our method can generate compelling space-time view synthesis results for scenes with natural in-the-wild scene motion. In the future, we hope that such methods can enable high-resolution views of dynamic scenes with larger scale and larger viewpoint changes. 144 CHAPTER 8 ETHICS IN DATA-DRIVEN COMPUTER VISION 8.1 Introduction Deep learning has revolutionized almost all computer vision problems and has made significant progress in achieving automatic scene understanding of our physical world. The effectiveness of data-driven computer vision has also led to intensive growth of interest in technology transfer, ranging from autonomous vehicles, and mixed reality, to agriculture, transportation, health care, and education. Despite growing enthusiasm in this field over the past few years, there is also alarm about the ethical consequences of the rapid progress in computer vision. For example, when we apply computer vision algorithms on large amounts of Internet visual data, how can we guarantee that sensitive or private information will be kept secret? How can we ensure that these techniques will not create discrimination and inequality across gender and race? How do we guarantee that a model is trustworthy and transparent to humans while performing a task? To better understand these questions, in this chapter I focus on three major ethical implications of data-driven computer vision that are closely related to the topics we present in this thesis. Specifically, in Section 8.2 I explore privacy and security concerns raised by current computer vision and machine learning systems. In Section 8.3, fairness issues surrounding modern data-driven approaches are discussed in detail. In Section 8.4, concepts of interpretability that can be used to help build a robust and safe vision system are described. In each section, I provide motivation as well as potential opportunities and challenges we must face. I also categorize these ethical challenges based on different assumptions, and describe case studies that have attempted to resolve these issues. Lastly, in Section 8.5 I briefly discuss ethical aspects that are of concern in other fields such as social science, business, and law. 8.2 Privacy and Security One of the most important ethical issues we must deal with when we develop a data-driven computer vision system is privacy and security. On one hand, computer vision driven techniques help people enhance privacy and security in their daily-life activities [6]. For instance, FaceID technology has been shown to be a better and more secure authentication system for protecting people’s privacy. On the other hand, these techniques also expose alarming privacy and security issues due to several factors, including inappropriate stor- age, tracking and processing of the personal data, and model vulnerability to adversarial attacks. These factors can cause security problems, in which the attacks subvert common behaviors of the models by forcing them to make undesired predictions. They can also cause breach of privacy. For instance, recent work has demonstrated that computer vision algorithms can passively recover sounds from a silent video captured from a consumer camera [74]; a press of a key on a mobile phone can also be recorded even if the victim is out of sight of the attacker [390]. Therefore, it is our responsibility to understand these potential problems with the aim of developing secure and privacy-preserving computer vision systems. 8.2.1 Security Security implies that the system behaves normally and that its correctness, efficiency, and integrity hold, without being compromised by external adversarial attacks. Current security attacks in data-driven computer vision can be categorized according to different 146 Figure 8.1: Facial expression manipulation. Computer vision has enabled control of facial expression from arbitrary videos by using image and depth information. This has raised significant security concerns regarding misuse in fake news or propaganda. Figure adapted from Thies et al.[349]. phases in the life cycle of a model [12, 217, 119]. The most prevalent type of security attack is evasion attack, where an estimated perturbation (i.e., noise) is added to input images to compromise performance during the inference stage [113, 345, 53, 268, 247]. over last few years, evasion attacks have quickly evolved from ideal white-box attacks to more practical black-box attacks, in which an attack can occur without having access to the model and training data [267, 42, 60, 147, 148, 118]. Security breaches can also be achieved through poisoning attacks, where attackers compromise model performance by adding malicious images to the training data, intentionally modifying the training data, or installing a backdoor [249, 393, 250, 183, 308, 420, 218, 65]. To defend against security attacks, a variety of techniques have been proposed. For example, the use of adversarial examples, robust statistics, or regularization in the case of adversarial training [383, 10, 332, 269, 224, 294, 243, 384, 279] is an effective way to increase a model’s robustness. Detecting and rejecting of poisoned examples (i.e, data sanitization) has also been introduced for the purpose of defending poisoning attacks [236, 239, 388, 265]. 147 Image Forensics. Chapters 6 and 7 of this thesis exemplify rapid progress on image and view synthesis, and these methods demonstrate the power of generating or manipulating images or videos that are difficult to distinguish from real media. Despite such progress, their potential use in a wide variety of applications also raises significant security con- cerns. For example, several research methods and software such as Face2Face [349] and DeepFakes technology [373] have shown that current computer vision and graphics algorithms are able to faithfully transfer and manipulate human facial expressions, or to replace one person face with that of someone else (Figure 8.1). While these results are exciting from a technical standpoint, people are aware of their significant security fallout after such technologies go viral. At the individual and organizational levels, these techniques can be used to enable cyber-criminals to engage in blackmail or financial fraud via impersonation [171]. Furthermore, there is evidence of their broader social-political ramifications. For instance, a fabricated video made by the Gabon government showing the president’s appearance triggered an attempted military coup [41] in 2019. In addition, in an election campaign in India in 2020, a Delhi candidate used a similar technique to criticize the incumbent Delhi government in English in order to promote voter bases with different language backgrounds [70]. Beyond human faces, many neural rendering and generative modeling algorithms have emerged, demonstrating their potential for synthesizing and editing images of arbitrary objects or scenes in the wild [130, 76, 421, 350, 343, 169, 240, 241, 204]. For instance, video inpainting algorithms enable completion of corrupted regions of a video or removal of unwanted regions of a video. Despite their wide use in actual practice, they also raise significant security concerns, since these methods can be used in a malicious way, including for spread of misinformation and fake news, financial fraud, and introduction of counterfeit evidence in a court of law . 148 To resolve these potential security issues caused by image forgery techniques, sig- nificant efforts have been made by computer vision and graphics researchers. On the research end, special datasets and techniques have been introduced for image foren- sics. For instance, FaceForensics++ [295] is a large-scale dataset that helps researchers develop learning-based models that accurately detect facial forgeries. Several other works [144, 378, 363, 9, 364, 414] also demonstrate how to automatically detect and localize arbitrary image and video forgeries in order to prevent image synthesis and editing algorithms from being mishandled. Social media platforms such as Twitter, YouTube, and Facebook also have already taken actions to prevent misuse of synthesis and manipulation of images [127]. Specif- ically, in the United States, the states of Virginia, California, New York, and Texas have introduced legislation to combat misuse of manipulated images [184, 44]. Also, the Chinese government announced that starting in 2020 the use of DeepFace-related technology without clear notice would be considered a crime [394]. 8.2.2 Privacy Preservation of privacy suggests that information from the models or the dataset which is considered confidential and sensitive cannot be traced or inferred. Privacy threats can be either direct or indirect, depending on whether the attackers have access to the original data. Direct privacy breaches are due mainly to unintentional handling of data on the service provider’s side [67], lack of use of effective encryption mechanism during trans- mission [266, 211, 182], or bypassing of authentication through backdoor attacks [371]. Indirect privacy breaches, on the other hand, can be often carried out via a membership 149 Figure 8.2: Privacy-preserving image synthesis from 3D reconstruction. From left to right: original photo, synthesized image from standard SfM reconstruction [304] through a technique from Pittaluga et al. [278], synthesized image from privacy-preserving SfM 3D reconstruction [105], which excludes sensitive visual information such as humans. Figure adapted from Geppert et al. [333]. inference attack, in which the adversary trains a surrogate model to speculate on whether an instance belongs to the training data [318, 397, 220]. For instance, Fredrikson et al. [93] showed that a facial image in the training set can be mostly recovered even if the attackers have access only to the model’s predicted confidence score and the person’s name. To defend against privacy attacks, many strategies that incorporate knowledge from computer security, cryptography, and machine learning have been proposed. There are three major defense techniques widely used in data-driven computer vision. The first one is homomorphic encryption (HE), which forces computation to be performed on the encrypted data without explicitly decrypting it [107, 136, 54]. The second approach is secure multi-party computation (SMC), in which multiple computing parties are involved but each party has access to only a portion of the entire set of private data [246, 215, 297]. Lastly, the notion of differential privacy (DP) was introduced to protect sensitive information from being inferred. DP defends against attacks by adding various kinds of randomized noise to different phases of a deep learning pipeline in order to increase its robustness to privacy attacks [1, 276, 266, 82, 163]. 150 Privacy-preserving 3D vision. One concrete case study related to our discussion of privacy preservation involved topics visual 3D localization and reconstruction. Most systems proposed in this thesis depend on structure from motion (SfM) or simultaneous localization and mapping (SLAM) algorithms. SfM and SLAM are typical techniques for automatically estimating camera poses and for sensing the surrounding 3D environment from a set of 2D images. SfM and SLAM algorithms have been deployed in numerous real-world VR and AR applications with cloud or mobile services. However, these applications usually require users to upload images to local platforms or to remote servers, which raises significant privacy issues in regard to potential disclosure of users’ sensitive information. For instance, the images uploaded on the user side might include users’ identity, and distributing such data will be risky if confidential information can easily be obtained by criminals. Moreover, recent research has shown that even if the images are compressed into a latent feature represented by a low-dimensional vector, it is still possible to use them to reveal the essential contents of the scene. For instance, Pittaluga et al. [278] showed that the images can be reconstructed from sparse SfM 3D point clouds of corresponding scenes even if original input images are discarded (Center of Figure 8.2). Fortunately, a number of methods have been developed to address these 3D privacy concerns. The goal is to preserve original capabilities of persistent localization and image query from the environment while obscuring privacy-related structures and contents (image on the right side of Figure 8.2). In particular, recent work [333, 333, 105] proposes hiding of user information by transforming 2D or 3D image point features to random obfuscated 2D or 3D line features, as shown in Figure 8.3). Such a randomized oriented- line representation manages to hide most of the sensitive image contents while still implicitly including sufficient geometric information to enable robust and accurate localization, image query, and entire 3D mappings. 151 Figure 8.3: Privacy-preserving 3D representation. Instead of the use of points as a 3D representation, the use of randomized 3D lines to enable privacy-persevering localization has been proposed. Figure adapted from Speciale et al. [333]. 8.3 Fairness and Bias The second important question we must ask about a data-driven computer vision system is whether it ensures fairness. A model is fair if its outputs are independent of (i.e., have no correlation with) the inputs’ attributes such as gender and skin tones. Why do we care about fairness in the the age of deep learning? The reason is that data and model unfairness can harm individuals, especially minority, and can exacerbate discrimination and inequality that already exist in our society. For instance, a recent report shows a job-recruiting platform based on data-driven models that tend to assign much higher rankings to men who are less qualified than to women who are more qualified in terms of skills [178, 190]. Vision-based applications such as image captioning and facial recognition have also been reported to have serious gender and ethnicity bias, where the systems perform much better on white men than on African Americans women [403, 48, 135] (Figure 8.4). Technically speaking, unfairness suggests that a trained model tends to have biased predictions for certain groups of individuals with specific identities or attributes, while the 152 Figure 8.4: Unfairness in data-driven computer vision. State-of-the-art facial recog- nition systems all reveal gender and ethnicity bias in the model’s predictions. These algorithms perform much better on light-skinned males than dark-skinned females. Figure adapted from Buolamwini et al. [48] predictions for some other groups can be less reliable or less accurate. One reason behind such unfairness is data. Most deep learning based approaches require large a amount of data in order to learn useful priors. However, crowd-sourced data are usually unbalanced, skewed, tainted, or limited [16]. This can be due to external noise, human inherent bias, or sample-size disparity (i.e., disproportionately represented groups contribute less to the training set). Consequently, such intrinsic data bias can cause a model to downweight the importance of under-represented groups of individuals. For instance, less than 2 percent of the people in the ImageNet appear as individuals above the age of 60 [81]. As a result, the models trained on the ImageNet can perform poorly on elderly person. Interestingly, recent work [408, 365] also shows that even a deep neural network can implicitly associate a certain correlation between targets and the underlying attributes of the individuals portrayed, and can amplify the stereotypes existing in our society, even if the training dataset is perfectly balanced. They showed that there exist some features that are more critical and more highly correlated with an individual’s identity than others for making predictions. As a result, it’s easier for a deep neural network to memorize 153 only these biased input features during training in order to obtain good accuracy while ignoring possible sources of diversity. In response, many metrics have been proposed to help measure and assess the fairness of a model [98, 358]. One typical metric is based on individual fairness or group fairness [83, 97, 29]. Individual fairness means that similarity should be higher for pairs of individuals who have similar attributes. Group fairness, on the other hand, measures similarity on different groups separately, where each group includes individuals with similar sensitive attributes. In practice, group fairness is more widely adopted in current research and applications, and it can be further measured by demographic parity [36], equality of odds [124], and predictive quality parity [124]. To ensure model fairness, many techniques have been introduced recently, and most of them can be categorized based on when they are applied [27]. Pre-processing approaches focus on reducing model biases by improving the quality of the training dataset. This can be achieved by extending the training data with data augmentation, adding more diverse data sources, or adjusting the weights of the training samples. Masking or hiding relevant sensitive attributes of the inputs have also been shown to be useful on improving model fairness [366, 165]. In-processing strategies seek a solution that corrects the bias during training; most current solutions of this type are based on the idea of adding extra regularization terms by explicitly incorporating fairness metrics into model training [293, 7, 166]. Lastly, post-processing strategies are applied during the inference stage. For instance, the fairness metrics we described can be applied to calibrate the model during inference in order to reduce discrimination [123, 408]. Data bias in monocular depth prediction. The issue of data bias also appears in 3D vision. For example, Chapters 2 and 3 introduce two large-scale RGBD datasets that help 154 Figure 8.5: Dataset bias in depth prediction. Qualitative comparison of state-of-the-art depth prediction model [281], and our proposed depth prediction models in this thesis which was trained on MegaDepth [207] and MannquinChallenge [203], using images from the Microsoft COCO dataset [210]. Figure adapted from Ranftl et al. [281]. us achieve accurate monocular depth prediction, but more recent work [281] demonstrates that our trained models still expose bias due to lack of diversity in the training dataset. For instance, in the MegaDepth dataset [207], most of the ground-truth depth is generated for outdoor buildings or statues, making the trained networks generalize poorly to indoor environments. On the other hand, the MannquinChallenge dataset was designed to yield more accurate estimates of the geometry for people, hence the trained models might not generalize well to non-human objects such as animals or vehicles. Furthermore, most videos in the MannquinChallenge dataset include children and teenagers from North America. As a consequence, our trained model might not perform equally well on people of different ages and skin colors. This fairness issue is manifested in Figure 8.5. The Current state-of-the-art depth prediction model alleviates this problem by training a model on a more diverse training data from different sources, and it demonstrates 155 Ranftl et al.[281] Li et al.[207] Li et al.[203] Input less-bias depth prediction and better generalization to unseen settings. 8.4 Interpretability and Transparency The third major issue we have to face is model interpretability, meaning that a model’s reasoning should be understood and analyzed by humans. It is clear that deep learn- ing based approaches have dominated almost all the benchmarks in computer vision. However, if these data-driven methods work so well nowadays, why can’t we simply trust the model when it makes a prediction? The reason is that the level of accuracy and error in the benchmarks leads to an incomplete description of a model, as described by Doshi-Velez and Kim [78]. In order to gain better control and understanding of designed computer vision systems, researchers and practitioners need to pay greater attentions to interpretability. Why is model interpretability important in our real-world tasks? First, interpretability can build trust between humans and machines. For instance, in certain life-critical tasks such as medical diagnosis, the decision by a model will be adopted only if its mechanism is transparent to humans. Furthermore, interpretability can also help humans identify underlying problems in order to address other issues such as unfairness and privacy, meaning that we can propose a better solution to defend against attacks once we have better understanding of the inner of the models. Second, interpretability can enhance safety. The most typical example is vision-based perception systems deployed in current self-driving vehicles. It is important that object detection system in a self-driving car perform well for pedestrians and cyclists at all times of the day and night, in all types of whether, and in all seasons of the year, because failure to do so, even once, can have serious consequences. For instance, in a famous 156 case of an Uber fatality that occurred in 2018, the vision system in the self-driving car didn’t detect pedestrians at night and consequently didn’t issue warnings [372]. As a result, the Uber vehicle struck and killed a person. Thus in order to guarantee safety in such life-critical applications, understanding when and why a method will fail is an important topic that researchers and developers should take into consideration. Generally speaking, interpretability of a model can be categorized into two main types, as proposed by Lipton et al. [212]. The first type is intrinsic transparency, meaning that whether users can walk through each step of reasoning in the system, and whether the inner workings of each system component can be analyzed and understood before training. Most of the classical shallow learning models belong to this type. For example, people have the ability to understand each step of K-nearest neighbors, shallow decision trees, or support vector machine algorithms, and their mathematical properties can be analyzed even before training. The second type is post-hoc interpretability, meaning the model’s mechanism is analyzed and understood by people after training. Most deep neural network models belong to this type, and numerous efforts have been made to analyze why deep neural networks succeed in many vision tasks. For example, the technique researchers often adopt is feature visualization [87, 264, 225, 325, 23], which is used to visualize learned CNN features by finding the inputs that maximize the hidden activations [264]. This method tells a story about what important information the network tries to learn from the training data. As an alternative, pixel attribution [401, 334, 307, 341] analyzes which part of the input image is responsible for a network to make a certain prediction. Uncertainty for interpretable inverse graphics. Last but not least, one particular case related to the topic of inverse graphics is modeling of uncertainty. Many inverse graphics problems we discussed can be treated as pixel-to-pixel transformation task, with the goal 157 Figure 8.6: Uncertainty modeling in depth prediction. Uncertainty modeling can help in identifying confident regions during depth prediction, making the model robust to noise and potential attacks. From left to right: input image, ground-truth depth, depth prediction, estimated aleatoric uncertainty, and estimated epistemic uncertainty. Figure adapted from Kendall et al. [173]. of predicting dense scene intrinsic properties from corresponding input RGB images. However, most of the proposed approaches do not explicitly capture model uncertainty, which prevents us from understanding model reliability in a transparent manner. In the classical machine learning and signal processing communities, uncertainty modeling is a well-studied problem [172], and has become a key building block in many applications, including flight navigation and aircraft landing. However, it is not straightforward to incorporate uncertainty modeling into current deep learning techniques, because of the extremely high dimensions of the parameter space. In order to address this issue in differ- ent inverse graphics tasks, the Bayesian deep learning approach [173, 174] has recently been revived. In particular, the Baysian deep learning approach introduces two types of uncertainty to help in understanding when and where the model is confident during prediction. The first one is epistemic uncertainty, which captures the uncertainty due to lack of sufficiency and diversity of the training data. This uncertainty is important for safety-critical scenarios where the training dataset is small such as in medical diagnosis. 158 Figure 8.7: Uncertainty modeling in novel view synthesis. By modeling transient and sensitive objects in Internet photos as aleatoric uncertainty, we obtain novel view synthesis results with better rendering quality and privacy-preserving properties. From left to right: rendered static component, rendered transient component, composite rendered image, original photo, and estimated aleatoric uncertainty. Figure adapted from Martin-Brualla et al. [230]. The second one is aleatoric uncertainty, which captures information that the model itself is not able to explain even if given enough training data. This uncertainty is important in situations where we have sufficient data or real-time performance is required such as in autonomous driving. Uncertainty modeling has recently been used in several inverse graphics tasks. For example, in the case of monocular depth prediction shown in Fig- ure 8.6, uncertainty modeling can help people identify and understand where the model has inaccurate predictions and what the causes of making wrong predictions are. This is important for ensuring the safety and transparency of deep learning models in tasks that require 3D understanding of the scene [173]. The idea of uncertainty modeling was also adopted recently in the task of novel view synthesis , shown in Figure 8.7, where the authors proposed modeling transient objects in Internet photos by incorporating aleatoric uncertainty. Since these transient objects usually carry private or sensitive information, they show that uncertainty-modeling technique not only help improve rendering quality but also ensuring preservation of privacy [230]. 159 8.5 Other Aspects In this section, I discuss several other social implications that should be considered in developing a data-driven computer vision system. These topics are interdisciplinary and also play important roles in different aspects in our society. 8.5.1 Policy and Regulation. The first topic is legal aspects surrounding deployment of data-driven systems. To pro- mote the benefits of current techniques for every individual, and to properly manage the risks associated with these techniques, it is necessary to let lawmakers become involved in the development of vision systems. Recent work suggests that policy and regulation should be encoded before a system is deployed, in order to prevent foreseen problems, and should also be applied after the system is built in order to prevent unforeseen prob- lems [68]. As emphasized in several recent reports [337, 50, 376], legislative issues to which attention must be paid are as follows: • Legal liability for possible negative outcomes on individuals by the system, and legal accountability for the use of personal data • Proper regularization on potential misuse of data and models for cyber-security and cyber-privacy, as well as regularization on the safety and health of individual • Regulations on algorithmic discrimination and unlawful profiling by automated decision-making from the data-driven systems • Intellectual-Property and copyright laws applied to the ownership of emerging techniques and to the process of data collection 160 Fortunately, several countries have started to address these legal issues by establishing new policies for better system development. The most notable example is general data protection regulation (GDPR) [361], which is designed to protect private information of European Union citizens as well as to ensure fairness and transparency while using personal data for research and product development. Other countries, such as China, have also announced plans to regularize and boost the healthy development of AI systems in the long run [197]. 8.5.2 Employment and HCI Automation and unemployment. As a result of rapid progress in the development of data-driven computer vision, many industries have adopted different techniques in auto- mated processes, and this has drastically improved efficiency and productivity. However, it has also raised concerns regarding the harmful consequences for the workforce and employment, especially in the areas of transportation, financial services, and commerce. For instance, we can imagine autonomous Uber cars or driverless truck taking over the roads in a couple of years. The fact that AI systems will take over many tasks that have traditionally been performed by human has raised increasing concerns about massive unemployment and reduction in wages [80, 290]. On the other hand, as described by Petropoulos [275], technological innovations always affect employment in two ways: They can displace the labor of humans labors, but they can also introduce new job opportunities and spark a demand for employees who are capable of dealing with the new technology. For example, deep learning research has created many job opportunities such as crowd-sourcing services provided through Amazon Mechanical Turk. Therefore, in order to adapt to this new era of automation by deep learning, different parties and individuals need better ways to accommodate possible outcomes triggered by the recent 161 deep learning revolution. Human–computer collaboration. The importance of interaction and collaboration between humans and vision systems cannot be denied, since such collaboration enables their complementary strengths to be maximized [288]. On one hand, computer vision systems can assist humans to extend their abilities in their daily life. For instance, vision-aided 3D SLAM systems have been used to assist visually impaired persons to navigate in unknown environments [137, 138]. On the other hand, humans can also assist computer vision systems. This is often related to the notion of human-in-the-loop, where involvement of human intelligence can help improve model performance by increasing the quality of training data and extra user guidance can lead to better decision-making on the part of machines. For instance, recent computer vision and graphics work has demonstrated that user guidance significantly helps deep learning models produce more desirable effects in a variety of image-editing applications, such as colorization [404], inpainting [410], segmentation [209], and content creation [271]. Thus having a human- centered system will be a critical step in forming a virtuous cycle in human–AI relations that contribute to the social welfare. 8.5.3 Control and Surveillance As described earlier, collecting a large amount of data from diverse sources in the wild can help improve performance of deep neural networks. Without proper regularization, however, this could lead to dystopian control of human liberty, freedom, and democracy. For example, facial-recognition systems have recently has been criticized for their po- tential to bring about algorithmic authoritarian control and governance [248, 126]. As reported by the New York Times, millions of cameras have been set up in China as a 162 governmental surveillance mechanism for criminal identification. However, as pointed out by Hartzog [126], such invasive mass surveillance is intrinsically oppressive and will leads to a “Panopticon”, in which the civil liberties of individuals can be harmed, since people are more likely to follow the rules and act differently than they otherwise would, if they concerned that they may be under surveillance. Another typical consequence is the “Surveillance Capitalism” described by Shoshana Zuboff [429], in which personal-behavior data obtained from users of a product is incorporated into prediction and shaping of future user behavior surrounding that product. This is usually due to massive surveillance underlying online services such as search engines and social media. For example, if you have recently browsed images of laptops, you have seen see advertisements related to discounts on those laptops on different websites. In fact, this technique was used by Google, where people abused personal data from the user side to grow the target product market [430]. Since surveillance capitalism has been widely used to translate personal information into predictions of future behavior for purposes of production and profit-making, its impact on intervention in and control of the behavior of individuals in a direction that favors capitalists has not taken place without notice by the public. As warned by Zuboff [429], surveillance capitalism will encroach on the ability of humans to act autonomously and on the functioning of society in a democracy, and it will introduce new types of social inequality and injustice because of its unilateral nature and its asymmetric effects on different individuals. Thus people should be aware of the potential damage caused by leakage of private data in terms of shaping our values, and governments are urged to enact legislation that protects human rights and democracy, and prevents us from stepping into a system of totalitarianism. 163 8.5.4 Sustainability Machine learning and computer vision techniques can also have a broad impact on sustainable development. Social sustainable development. Data-driven vision systems have been shown to be useful for facilitating social equality. For example, computer vision models have recently been used as an aid in identification of poverty and famine through satellite images [49, 11], as well as in prediction of livelihood and socioeconomic attributes through street- view images [103, 196] and in management and discovery of natural resources [392, 88]. However, it could have a negative effect on social cohesion. For instance, greatly increased use of data-driven media platforms could lead to the phenomena of “filter bubble” [431], meaning that social media can shape our views of the world in which we live, in that the algorithms could limit us to being exposed only toe the texts, images, and voice that we would like to see, which could be different from what we should see [122]. That could in turn lead to sociopolitical polarization, increased social isolation, and intensification of bias and hatred. Economic sustainable development. Data-driven vision techniques also contribute to development of greater efficiency in numerous areas, such as agricultural management, where computer vision techniques can be used for analysis and prediction of yields of crops and fruits [121] and for for early diagnosis of diseases in plants [245]. However, some recent reports [71] also show that data-driven techniques can potentially cause larger income gaps and increased inequality across countries because of unbalanced resource allocations. Furthermore, because of the limited transparency of current deep- learning-based approaches, product economic stability cannot be guaranteed [30]. 164 Environmental sustainable development. On the bright side, machine learning mod- els can help support low-carbon energy systems with higher productivity and efficiency, and thus they are an important tool for addressing problem of global climate warm- ing [359]. We have also seen how these algorithms can improve the health of our ecosys- tem. For instance, some vision techniques are used to reduce environmental pollution by identifying the locations of oil spoils in the oceans [175]. The trends of desertification and invasive species can also be analyzed and tracked by deep neural network models [244, 24, 86], which aid in policy-making on environmental protection and remedy. On the other hand, the high energy demands and expensive computational resources needed by many current data-driven studies and product designs lead to increasing demand for electricity and increasing waste of natural resources [162]. Therefore, designing a more efficient and energy-preserving deep learning platform is key to sustainable development of our environment. 8.6 Discussion In this chapter, I presented important ethical principles in modern data-driven compute vision. I discussed both the benefits and the potential problems that current techniques would bring to our society. In particular, I discussed the issues of privacy, security, fairness, and interpretability that we have to face in this age of deep learning evolution. I presented different approaches that have been proposed to deal with these issues, by pointing to concrete case studies. Furthermore, I discussed the broader financial, legal, and societal impacts of these techniques. There are still a lot of challenges we have to address in order to achieve an ethical, fair, safe, and transparent data-driven computer vision system. For example, we still 165 don’t have a unified and scalable framework for evaluating the interpretability of deep learning models, since current approaches often involve tedious human-in-the-loop inspections. Therefore, development of an efficient interpretable approach that can extend to wide deployment at scale is an area that deserves further study. In addition, deeper understanding of deep neural networks can also improve the effectiveness of strategies for defending against potential privacy and security attacks [149]. Researchers need to have a better understanding of the trade-offs among privacy, fairness, and efficiency, since previous work has demonstrated that decreasing biases and increasing security can compromise models’ prediction performance [106]. Therefore, greater efforts should be devoted to building an unbiased and privacy-preserving model that also achieves the desired prediction accuracy. 166 CHAPTER 9 CONCLUSION In this thesis, I presented new approaches for addressing the tasks in inverse graphics, with the goal of learning scene geometry, appearance and dynamics. By using massive amount of Internet visual data, my proposed approach can achieve state-of-the-art results on varieties of in the wild scenarios. In particular, in Chapters 2 and 3, I introduced new large- scale dataset from Internet photo collections and Youtube Videos of MannquionChallenge. Based on these, I developed new models for learning better dense depth from single RGB image or regular videos of dynamic scenes with moving people. In chapters 4 and 5, I addressed the problem of intrinsic image decomposition, with its goal of estimating scene surface material and illumination properties by decomposing an image into reflectance and shading maps. My work introduced two new datasets from Internet time-lapse videos and synthetic images rendered by physically based rendering engine. I proposed novel strategies for training deep neural networks on these datasets to obtain state-of-the-art decomposition results on photos of real world scenes. In chapters 6 and 7, I further demonstrated how to tackle the challenging problem of space-time view synthesis. Specifically, I showed that a large number of Internet photos can constitute an useful data source for helping DeepMPI scene representation to enable synthesizing photo-realistic novel views, while modeling time-varying appearance changes of different landmarks around world. Furthermore, scene dynamics from a monocular video can also be encoded into neural scene flow fields representation, that allows us to simultaneously perform novel view synthesis and create slow-mo effect. Lastly, in chapter 8 I discussed several potential ethical problems that the current deep learning and computer vision techniques can bring about. It served as a survey that reminds researchers and developers of the importance of their social and political influence, and guide people to develop more effective techniques to address these ethical issues. Future direction. Although my work has made great advancements on better under- standing space-time information of our physical world, a number of open challenges still remain to be solved. For example, I tend to solve each scene intrinsics estimation problem independently, and I also separate scene dynamics into two individual factors, namely illumination changes and object motions. But I believe the ultimate goal of inverse graphics is to bring scenes truly to life. In other words, I imagine a model that can jointly capture all the scene intrinsic properties and model both short-term and long-term temporal changes of our physical world. Thus an interesting research direction is how to use both Internet photos and videos in deploying a unified system for holistic scene understanding and modeling from images in the wild. 168 APPENDIX A CHAPTER 2 APPENDIX A.1 Depth Map Refinement and Enhancement In this section, we provide additional details for our depth map refinement and enhance- ment methods presented in Section 3.2 and 3.3 of Chapter 2. A.1.1 Modified MVS algorithm Our modified MVS algorithm and semantic segmentation-based depth map filtering are summarized in Algorithm 1. Our algorithm first runs PatchMatch [35] using photometric consistency constraints, as implemented in COLMAP, to solve for an initial depth map D0 (with some pixels whose depth could not be estimated marked as invalid). Next, K iterations of PatchMatch using geometric consistency constraints are run. For each iteration k, we compare the depth values at each pixel before and after the update and keep the smaller (closer) of the two, to get an updated depth map Dk. After K iterations of PatchMatch, we apply a median filter to DK and only keep depths whose values are stable, in that they are close to their median-filtered value. Finally, we remove spurious depths from transient objects based on semantic segmentation, as described in Section 3.3 of Chapter 2. Regarding the parameters defined in Algorithm 1, we set τ1 = τ2 = 1.15 and K = 3. Two additional examples of depth maps with and without our refinements are shown in Figure A.1. Algorithm 1 Depth Refinement and Semantic Cleaning Input: Input image I , semantic segmentation map L (divided into subregions F (fore- ground), B (background), and S (sky). Output: Refined depth map D for image I . 1: Run PatchMatch using photometric consistency constraints to solve for initial depth estimate of D0. Pixels in D0 without an assigned depth are instead assigned a NaN sentinel value. 2: for round k = 1 to K do 3: Run PatchMatch using geometric consistency constraints on Dk−1 to get updated depth estimate Dk. 4: Rk = Dk/Dk−1 (element-wise) 5: for each valid (non-NaN) pixel p of Rk do 6: if Rkp > τ1 then 7: Dkp = D k−1 p 8: else 9: Dk kp = Dp 10: Apply 5× 5 median filter on DK , storing result in D̂K . 11: Filter (replace with NaN) unstable pixels from DKp for which max(D̂Kp /D K , DKp p /D̂ K) > τ2. 12: for each connected component C from F do 13: if fraction of valid depths in C is > 50% then 14: keep depths in region C from DK . 15: else 16: remove all depths in region C from DK . 17: Filter out all depths in sky region S. 18: Apply morphological erosion followed by small connected components removal operation on DK to obtain final depth map D. A.1.2 Foreground and background classes In this subsection, we provide details of the foreground object classes used to define the foreground mask F for each image, and similarly the background object classes used to define the background mask B. These classes are subsets of the classes recognized by our semantic segmentation module, as described in Section 3.3 of the Chapter 2. Foreground classes. F = {person, table, chair, seat, signboard, flower, book, bench, boat, bus, truck, streetlight, booth, poster, van, ship, fountain, bag, minibike, ball, animal, 170 (a) Input photo (b) Raw depth (c) Refined depth Figure A.1: Additional example comparisons between MVS depth maps with and without our proposed refinement/cleaning methods. Column (b) (before filtering): the plinth of the statue in the first row and the “Statue of Liberty” in the second row both show depth bleeding effect. Column (c) (after filtering): our refinement method corrects or removes such depth values. bicycle, sculpture, traffic light, bulletin board} Background classes. B = {building, house, skyscraper, hill, tower, waterfall, mountain}. 171 A.1.3 Automatic ordinal depth labeling In this subsection, we provide additional details for our automatic ordinal depth labeling method. Recall that O (“Ordinal”) is the subset of photos that do not satisfy the “no selfies” criterion described in Chapter 3. Recall that the “no selfies” criterion rejects images I for which < 30% of the pixels (ignoring the sky region S) consists of valid depth values)—otherwise, these images are added to the set O. For each image I ∈ O, and given foreground pixel F and B in I as defined above, we compute two regions, Ford ∈ F and Bord ∈ B, such that all pixels in Ford are likely in front of all pixels in Bord. In particular, we assign any connected component C of F to Ford if the area of C is larger than 5% of the image. We assign a pixel p ∈ B to Bord if it satisfies the following conditions: 1. p belongs to the background region B, 2. the area of p’s connected component in B is larger than 5% of the image, and 3. p has a valid depth value that lies in the last quartile of the full range of depths for I . Originally, we considered a more complex approach involving geometric reasoning (e.g., estimating where foreground objects touch the ground), but we found that the simple approach above works very well (> 95% accuracy in pairwise ordinal relationships), likely because natural photos tend to be composed in certain common ways. Additional examples of our automatic ordinal depth labels are shown in Figure A.2. 172 Figure A.2: Additional examples of automatic ordinal labeling. Blue mask: fore- ground (Ford) derived from semantic segmentation. Red mask: background (Bord) derived from reconstructed depth. A.2 SfM Disagreement Rate (SDR) In this section, we provide additional details for our SfM Disagreement Rate (SDR) error metric defined in Section 5.1 of Chapter 3. SDR is based on the rate of disagreement between a predicted depth map and the ordinal depth relationships derived from estimated ground truth SfM points. We use sparse SfM points for this purpose rather than dense MVS depths for two reasons: (1) we found that sparse SfM points can capture some structures not reconstructed by MVS (e.g., complex objects such as lampposts), and (2) we can select a robust subset of SfM points based on measures from SfM such as the number of observing cameras or uncertainty of the estimated depth computed by bundle adjustment. We define SDR(D,D∗), the ordinal disagreement rate between the predicted (non- log) depth map D = exp(L) and gr∑ound-(truth SfM depths D ∗, as: 1 ) SDR(D,D∗) = 1 ord(Di, Dj) 6= ord(D∗, D∗i j ) (A.1)n i,j∈P where P is the set of pairs of pixels with available SfM depths to compare, n is the total number of pairwise comparisons, and ord(·, ·) is one of three depth relations (further-than, 173 closer-than, and same-depth-as):  1 if DiD > 1 + δj ord(Di, Dj) = −1 if Di D < 1− δ (A.2)  j0 if 1− δ ≤ DiD ≤ 1 + δj In other words, SDR is the rate of disagreement between predicted and ground-truth depths in terms of pairwise depth orderings. Note that SDR is an unweighted measure for simplicity (all measurements count the same towards the cost), but we can also integrate depth uncertainty derived from bundle adjustment as a weight. We also define SDR= and SDR 6= as the disagreement rate with ord(D∗i , D ∗ j ) = 0 and ord(D∗, D∗i j ) 6= 0 respectively. In our experiments, we set δ = 0.1 for tolerance to uncertainty in SfM points. Because SDR is based on point pairs and hence takes O(n2) time to compute, for efficiency we subsample SfM points by splitting each image into 15× 15 blocks, and for each block, randomly sampling an SfM point (if any exist). We then use these sampled points to create a clique of ordinal relations, where each edge connecting two features is augmented with the ordinal depth label. To obtain reliable sparse points we only sample SfM points seen by > 5 cameras and with reprojection error < 3 pixels. Figure A.3 shows several examples of SfM points we sample for evaluating SDR. 174 Figure A.3: Examples of sampled SfM points. Red circles indicate sampled SfM points with the radius indicating estimated depth derived from SfM; small radius = small (close) depth, large radius = large (far) depth. 175 APPENDIX B CHAPTER 3 APPENDIX B.1 Derivations of depth from motion parallax Here we provide detailed derivations of depth from motion parallax using the Plane+Parallax representation (Section 4.1). Recall in Chapter 3, we define the relative camera pose as R ∈ SO(3), t ∈ R3 from source image Is to reference image Ir with common intrinsics matrix K. We denote the forward flow from Ir to Is as ffwd, and the backward flow from Is to Ir as fbwd. Let Π denote a real or virtual planar surface, and let d′Π denote the distance between the camera center of source image Is and the plane Π, and h the distance between the 3D scene point corresponding to 2D pixel p and Π. It can be shown (See Appendix of [152] for full intermediate derivations) that h tz h p = pw + D (p) d′ pw − ′ Kt (B.1) pp Π Dpp(p)dΠ h = pw + D (p)d′ (tzpw −Kt) (B.2) pp Π where Dpp(p) is the estimated depth at p in the reference image Ir, tz is the third component of translation vector t, and pw is the 2D image point in Ir that results from warping the corresponding 2D pixel (by optical flow f sfwd) in I by a homography A: Ap′ pw = (′ ) (B.3)aT3 p n′T where A = K R + t K−1 d′Π where p′ = p + ffwd(p), aT3 is the third row of A, and n ′ is normal of plane Π with respect to the camera of source image Is. Note that the original paper [152] divides the P+P representation into two cases depending on whether tz = 0, but we combine these two cases into one equation shown in Equation B.2 by algebraic manipulations. Now, if we set plane Π at infinity, using L’Hôpital’s rule, we can cancel out H and d′Π and obtain the following equations: tzpw −Kt p = pw + (B.4) Dpp(p) ||tzpw −Kt||2 Dpp(p) = ,||p− pw||2 A′p′ where pw = ′ and A ′ = KRK−1. a Tp′3 B.2 Derivation of error metrics Recall that in Section 5 of Chapter 3 we define five different deth error metrics based the on scale-invariant RMSE (si-RMSE). Here we provide definitions of each error metric. Note that we can use similar algebraic manipulations to those proposed in [206] to evaluate all terms in time linear in the number of pixels. As in Chapter 3, we denote with D̂ the predicted depth, and denote with Dgt the ground truth depth. We define R(p) = log D̂(p) − logDgt(p), i.e., the difference between computed and ground truth log-depth. We also denote human regions as H (with Nh valid pixels), non-human (environment) regions as E (with Ne valid pixels), and the full image region as I = H ∪ E (with N = Ne +Nh valid pixels). Our error metrics are defined as follows: si-full measures the si-RMSE between all pairs of pixels, giving the overall accuracy 177 across the entire image: 1 ∑∑ si-full = (R(p)−R(q))2 (B.5) N2 1 ∑p∈I∑q∈I = R(p)2( +R(q) 2 − 2R(p)R(q) (B.6) N2 p∈I q 2 ∑∈I ∑ ∑ ) = N R(p)2 − R(p) R(q) (B.7) N2 ∑ p∈I (p∑∈I )q∈I22 2 = R(p)2 − R(p) (B.8) N N2 p∈I p∈I si-env measures pairs of pixels in non-human regions E thus computing the accuracy of the depth in the environmen∑t:1 ∑ si-env = ( (R(p)−R(q)) 2 ) (B.9)N2e p∈E q∈∑E2 ∑ ∑ = N R(p)2e − R(p) R(q) (B.10) N2e p∈E p∈E q∈E si-hum measures pairs where one pixel lies in the human region H and one lies anywhere in the image, thus computing overall depth accuracy for the people in the scene: 1 ∑∑ si-hum = R(p)2 +R(q)2( − 2R(p)R(q) ) (B.11)NNh p∈H∑q∈I1 ∑ ∑ ∑ = N R(p)2 +Nh R(q) 2 − 2 R(p) R(q) (B.12) NNh p∈H q∈I p∈H q∈I si-hum can further be divided into the sum of two error measures: si-intra measures si-RMSE withinH, or human a∑ccu∑racy independent of the environment:1 si-intra = ( (R(p)−R(q)) 2 ) (B.13)N2h p∈H q∈∑H2 ∑ ∑ = Nh R(p) 2 − R(p) R(q) (B.14) N2h p∈H p∈H q∈H 178 (a) (b) (c) Figure B.1: Network Architecture. Each block with a different color (id) in (a) indicates a convolutional layer. The block labeled H indicates a 3× 3 convolutional layer and all other blocks are implemented as a variant of an Inception module [344], as shown in (b). Parameters for each type of layer are shown in (c). We use bilinear interpolation to upsample features in the network. Figures modified from Chen et al. [63]. si-inter measures si-RMSE between pixels inH and in E , or human accuracy w.r.t. the environment: 1 ∑∑ si-inter = 2( R(p) +R(q) 2 − 2R(p)R(q) ) (B.15)NeNh p∈H q∑∈E1 ∑ ∑ ∑ = Ne R(p) 2 +N 2h R(q) − 2 R(p) R(q) . (B.16) NeNh p∈H q∈E p∈H q∈E B.3 Network Architecture Our network architecture is a variant of the hourglass network proposed by Chenet al. [63], and is shown in Figure B.1. Specifically, our network has a standard encoder and decoder U-Net structure, with matching input and output resolution, consisting of 179 approximately 5M parameters. In addition, an Inception module [344] is used in each convolutional layer of the network. We replaced the nearest-neighbor upsampling layers by bilinear upsampling layers, which we found produced sharper depth maps while slightly improving overall accuracy. 180 APPENDIX C CHAPTER 4 APPENDIX C.1 Hyperparameters Setting For all experiments, we set our hyperparameters as follows. For the overall energy function defined in Equation 3 in the Chapter 4, we set w1 = 1, w2 = 6 and w3 = 2. For Equation 8 describing the affinity between pixels, we define a covariance matrix Σ between reflectance feature vectors fp and fq as follows: we set Σ to be a diagonal matrix for simplicity, and define Σ = diag(0.12, 0.12, 0.12, 0.0252, 0.0252). Lastly, for Equations 11 and 12 relating to shading smoothness, we set λmed = 20 and λmed = 4. C.2 All-Pairs Weighted Least Squares (APWLS) In this section, we provide a detailed derivation of our proposed All-Pairs Weighted Least Squares computation (APWLS), as described in Section 5.5 of the Chapter 4. Suppose that we have a image sequence with m images, and each image has n pixels. Now suppose each image I i is associated with two matrices P i and Qi and two predictions X i and Y i. We then can write APWLS as ∑m ∑m APWLS = ||P i ⊗Qj ⊗ (X i − Y j)||2F (C.1) ∑i=1 ∑j=1n m ∑m ( ) = P iQj(X i − Y j 2( p(p p p ) )) (C.2)∑p=1 i=∑1 j=1n m ∑m = (P i)2 (Qj)2( p (( p (X i )p − Y j 2 p ) (C.3) ∑p=1n ∑i=1 j=∑1m m ∑m ∑ ))m = (P i)2 (Qj)2 (X i 2 j 2p p p) + (Qp) (Y j)2 − 2X ip p (Qj)2Y jp p p=1 i=1 j=1 j=1 j=1 (C.4) =1>(ΣQ2 ⊗ ΣP 2X2 + ΣP 2 ⊗ ΣQ2Y 2 − 2ΣP 2X ⊗ ΣQ2Y )1 (C.5) ∑ where Σ = m ∑ Q2 i=1Q i ⊗ Qi; ΣP 2 = P i ⊗ P i; Σ m i i i iP 2X2 = i=1 P ⊗ P ⊗ X ⊗ X ; Σ = QiQ2Y 2 ⊗Qi ⊗ Y i ⊗ Y i; ΣP 2X = P i ⊗ P i ⊗X i; Σ = Qi ⊗Qi ⊗ Y iQ2Y . C.3 Additional details for SAW evaluation metrics In this section, we reiterate the two improvements we made to the metric used to evaluate results on SAW annotations (described in Section 6.2 of the Chapter 4) and provide more detailed explanations. First, the original SAW error metric, as described by Kovacs et al. [185], is based on classifying a pixel p as having smooth/nonsmooth shading based on the gradient magnitude of the predicted shading image, ||∇S||2, normalized to the range [0, 1]. Instead, we measure the gradient magnitude in the log domain. We do this because of the scale ambiguity inherent to shading and reflectance, and because it is possible to have very bright values in the shading channel (e.g., due to strong sunlight), and in such cases if we normalize shading to [0, 1] then most of the resulting values will be close to 0. In contrast, 182 computing the gradient magnitude of log shading ||∇ logS||2 achieves scale invariance, resulting in fairer comparisons for all methods. As in [185], we sweep a threshold τ to create a precision-recall (PR) curve that captures how well each method captures smooth and non-smooth shading. Second, Kovacs et al. [185] apply a 10× 10 maximum filter to the shading gradient magnitude image before computing PR curves, because many shadow boundary annota- tions are not precisely localized. However, this maximum filter can result in degraded performance for smooth shading regions. Consider adding 1% salt-and-pepper noise to the shading estimate. Applying a maximum filter to this noisy gradient magnitude image would make it seem as if there are large changes everywhere. Moreover, we found several annotated smooth regions are close to the boundaries of shading changes caused by depth/normal discontinuities, and if we apply a maximum filter, we might integrate incorrect shading information out of annotated regions into our evaluation. Instead, we create two maps, the original ||∇ logS||2, and the 10× 10 maximum filtered to ||∇ logS||2, which we denote ||∇ logS||max2 . We use ||∇ logS||2 to classify smooth shading annotations and ||∇ logS||max2 to classify non-smooth annotations. 183 APPENDIX D CHAPTER 5 APPENDIX D.1 Additional details for training losses D.1.1 Ordinal term for CGINTRINSICS Recall LCGI (Equation 2 in Chapter 5), the loss defined for our CGINTRINSICS training data: LCGI = Lsup + λordLord + λrecLreconstruct (D.1) We now provide the full formula for Lord, the ordinal loss term. In particular, for a given CGINTRINSICS training image and predicted reflectance R, we accumulate losses for each pair of pixels (i, j) generated from a set of pixels P , where one pixel is sampled at random from oversegmented regions in that image: ∑ Lord(R) = fi,j(R), (D.2) (i,j)∈P×P i 6=j where   (logRi − logRj)2, −τ1 < P ∗i,j < τ 1 (max(0, τ2 − logRi + logRj)) 2 , P ∗i,j > τ fi,j(R) =  2 (D.3) (max(0, τ2 − logRj + logR )) 2 , P ∗  i i,j < −τ2 0, otherwise where R∗ is the rendered ground truth reflectance, P ∗ = logR∗i,j i − logR∗j , τ1 = log(1.05) and τ2 = log(1.5). The intuition is that we categorize pairs of ground truth reflectances at pixels (i, j) as having an “equal,” “greater than,” or “less than” relationship, and then add a penalty if the predicted reflectances at those pixels do not satisfy the same relationship. D.1.2 Additional hyperparameter settings In all experiments described in Chapter 5, we set λIIW = λSAW = 2, λord = λrs = λS/NS = 1, λrec = 2 and λss = 4. The number of image scales L = 4. The margin in Equation 7 in Chapter 5 m = 0.425. For simplicity, the covariance matrix Σ defined in Lrsmooth (Equation 10 in Chapter 5) is a diagonal matrix,defined as: 2σp  2 Σ =  σ p  σ2   I 2 σc    σ2c where σp = 0.1, σI = 0.12, σc = 0.03. 185 APPENDIX E CHAPTER 6 APPENDIX E.1 Priors on the Plenoptic Function Our scene representation and approach to training are motivated by simple priors on structure in the plenoptic function. Given that our crowdsampling of a scene is unpre- dictable and unregistered in time, we focus primarily on periodic changes in reflectance and illumination—most notably, this includes how the appearance of a scene changes from day to night. For most scenes, this change is dominated by the motion of the sun. However, we also see scenes where, for example, visible lights turn off and on throughout the day (e.g., a cityscape or an attraction that lights up at night). Below we describe two priors that motivate the design of our representation and training. E.1.1 Constant Visibility and Light Field Gradients Even in non-Lambertian scenes with changing reflectance and illumination, we may expect the structure of visibility to remain relatively constant over time. If we consider slices of the plenoptic function at different times, we can think of this expectation in terms of gradients in the respective light fields. To see this, consider that every scene point corresponds to some 4D hyperplane in the light field. If the light transport function around our point is smooth, then we can expect that this hyperplane will be locally constant. Gradients then primarily occur at boundaries between hyperplanes, which include occlusion boundaries and edges in reflectance (e.g., surface texture) or illumination (e.g., shadows). The structure of visibility in a scene determines the adjacency of these hyperplanes, thereby limiting the set of gradients that can be introduced by changing reflectance or illumination. Our approach leverages this prior that different slices of the plenoptic function share visibility structure by fixing alpha values of our DeepMPI. Recall that each voxel of an MPI can be interpreted as a floating semi- transparent surface point. This corresponds to a constant hyperplane in our reconstructed light field just as an analogous real surface point would. Fixing the alphas determines the visibility of such points, and therefore the adjacency of their corresponding hyperplanes in the reconstructed light field. E.1.2 Common Light Sources, Material Properties, and Normals We can think of the light transport function around every point in our scene as mapping incoming light to outgoing light. The appearance of a point in a particular image is then a sample from this transport function. Without explicitly modeling the transport function, we can reason about correlations among the samples provided by different scene points and across different viewing conditions. For example: it is reasonable to expect that many visible points in a given scene will share the same material properties, and that the relationship between surface normals at different points will remain constant (this is true so long as surface geometry does not change). Furthermore, we can expect correlation due to different points being lit by the same source—often, the sun. We learn how to leverage these many sources of correlation by training our DeepMPI with feature vectors attached to each voxel. Intuitively, this creates a latent space where surface points with highly correlated appearance end up with similar feature vectors, preserving important correlations in our generated MPIs. 187 E.2 Scene Statistics Table E.1 shows statistics for each scene, including number of valid images, field of view (FoV) of the reference DeepMPI, and depth of the near and far MPI planes. We adopt the same method as that of Zhou et al. [418] to estimate the scale of each scene in order to set near and far plane depth. All data, including original images, registered poses, and SfM reconstructions will be released to the research community. Table E.1: Scene statistics. We include (1) total number of images, 2) field of view (FoV) of the reference DeepMPI, and (3) depth of near and far MPI planes. The first five scenes are used for evaluation in Chapter6. Scenes # images FoV (°) (near/far) plane depth Trevi Fountain 3453 70 1/4 Sacre Coeur 2112 65 1/20 The Pantheon 1917 65 1/25 Top of the Rock 2232 75 1/75 Piazza Navona 606 70 1/25 Mount Rushmore 3075 30 1/4 Lincoln Memorial 2582 45 1/4 Eiffel Tower 1999 65 1/20 E.3 Losses E.3.1 Losses optimizing DeepMPI color and α planes Recall that in Section 3.3 of Chapter 6, we compare the rendered base color image B̂k and real photo Ik at the target viewpoint using a reconstruction loss Lrecon. Lrecon consists 188 of a pixel-wise l1 loss and a multi-scale gradient consistency loss [203]: ∑Ns L k k k krecon = ||B̂ (0)− I (0)||1,1 + wgrad ||∇B̂ (s)−∇I (s)||1,1, (E.1) i=1 where S is the number of scales we create for calculating the gradient consistency loss, and I(s) is the image at scale s (where s = 0 is equivalent to the original resolution). In our experiments, we set S = 3 and use nearest neighbor downsampling to create image pyramids for both rendered and ground truth images. E.3.2 Training Losses Recall that in Section 3.4 of Chapter 6, to train the rendering network G, the appearance encoder E, and the latent feature in the DeepMPI, we compute losses between output views and ground-truth exemplar views. Specifically, our training loss is composed of three terms: L = LVGG + wGANLGAN + wstyleLstyle, (E.2) LVGG is a normalized VGG perceptual loss similar to that used in [418, 62]: ∑ LVGG = wl||φ kl(Î )− φl(Ik)||1, (E.3) l where φl(x) indicates an output from VGG layer l ∈ { conv1 2, conv2 2, conv3 2, conv4 2, conv5 2 } with input x, and weight wl is proportional to the reciprocal of the number of neurons in the corresponding VGG layer. Furthermore, we add an adversarial loss LGAN to improve the realism of the rendered images. In particular, LGAN is computed from multi-scale discriminators [367] with an 189 objective similar to LSGAN [227]: LGAN = LGAN(D[)(+ LGAN(G)), ] [ ( ) ] (E.4) LGAN(D) = E 2 2 Ik∼p(I) [(D(I k ( )− 1 )+ Ez)∼p](Ik) D G(D k, z) , (E.5) z L k 2GAN(G) = Ez∼p (Ik) D G(D , z) − 1 , (E.6)z where LGAN(D) is the loss for the discriminator, and LGAN(G) is the loss for our neural render. To further enforce that the appearance of rendered images match the appearance of the exemplar images, we add a style loss Lstyle, which compares the l1,1 norm of the difference between Gram matric∑es constructed from VGG features at different layers: Lstyle = ||g(φ k kl(Î ))− g(φl(I ))||1,1 (E.7) l where g(x) is the Gram matrix from a VGG feature x. E.4 Network Architecture Let D and C denote number of depth planes and channels in our DeepMPI, and let Hk and W k denote the height and width of the view at target viewpoint ck. Appearance encoder. Our appearance encoder consists of two encoders, denoted E1 and E2. E1 takes as input a reference feature buffer Φrs with fixed resolution of 512×512, and produces a latent feature vector z ∈ R5121 . We adopt the encoder implemented by Park et al. [271] for E1. E2 takes as input an exemplar image Is with varying aspect ratios, and produces a latent feature vector z ∈ R2562 . We adopt the encoder from Huang et al. [143] as E2. The two latent vectors are then passed through a fully connected layer in order to produce a final latent appearance vector z ∈ R16 we describe in Chapter 6. 190 Figure E.1: Visual examples of reference base color images. These are over- composited from base color planes of the reference DeepMPI. Neural renderer. We adopt the U-Net modified from Zhu et al. [423] as our neural rendering network. In summary, during training and evaluation, we feed the DeepMPI at the target viewpoint with size D × C ×Hk ×W k to the rendering network, and the network predicts RGB MPI planes with size D× 3×Hk ×W k. However, the rendering network operates at each depth slice of the DeepMPI at target viewpoint independently (with size C ×Hk ×W k), and predicts the corresponding RGB color image (with size 3×Hk ×W k). Therefore, our rendering network independently processes every depth slice of DeepMPI, without considering interactions between them. Our rendering network consists of five convolutional layers in both the encoder and decoder. Each layer of the encoder consists of a 3 × 3 stride-2 convolutional layer followed by Instance Normalization [352] and leaky ReLu. Each layer of the decoder consists of bilinear sampling followed by a 3× 3 convolutional layer. Adaptive Instance Normalization layers (AdaIN) [142] are embedded between bilinear sampling and feature concatenation of skip connections. Discriminator. We adopt the network architecture from Huang et al. [143] as the discriminator used for our GAN loss. In particular, the discriminator takes as input images at three scales, and predicts scores from each patch of the input image. 191 E.5 Training and Implementation We implement our framework using PyTorch. In all our experiments, we empirically set hyper-parameters wgrad = 0.25, wGAN = 0.2, wstyle = 5. We set the resolution of reference DeepMPI to 784× 784. In the first stage, we optimize base color and α planes in the reference DeepMPI for 100 epochs in total (70 epochs in phase one, and 30 epoch in phase two) using a single Tesla T40 GPU. We adopt the Adam [177] optimizer with initial learning rate 1× 10−3 for the optimization. In the second stage, we use 4 Tesla T40 GPUs to jointly train the rendering network G, appearance encoder E, and latent features F r in the DeepMPI for 50 epochs. During training, we adopt the Adam optimizer [177] and set a learning rate of 3× 10−4 for E, G and F r, and a learning rate of 1× 10−5 for the discriminator. In addition, since Internet photos have varying aspect ratios and orientations, we resize their weight and height to a factor of 32 depending on images’ original aspect ratios. Due to GPU memory limits, during training, we randomly crop a patch of 256× 256 from the resized images and we only render a view corresponding to the patch. However, our method cann render a full image with resolution up to 640× 480 at inference time on a single GPU. E.6 Visual Illustrations Examples of mean RGB PSV and base color. Figure E.2 shows examples of the reference mean RGB color PSV at different depth layers from different scenes. These are used for initializing base color planes, as described in Section 3.3 of Chapter 6. In addition, Figure E.1 shows estimated reference base colors, which are over-composited 192 Figure E.2: Visual illustration of reference mean RGB PSV. Different images in each row indicate different depth planes of the plane sweep volume (PSV). The mean RGB images at different depth planes have different in-focus regions. from base color planes in the reference DeepMPI. Examples of rectified RGB images. Figure E.3 shows visual examples of rectified RGB images in the feature buffer Φrs aligned with the reference viewpoint, described in Section 3.4 of Chapter 6. E.7 User Study Table E.2-E.4 show scores from 1104 votes (46 participants × 24 comparisons) for each of following questions, respectively (where users are shown results from multiple algorithms to choose from): 193 Figure E.3: Visual examples of rectified RGB images. The reference rectified images are geometrically stable and globally aligned up to disocculusion. MUNIT [143] NRW [238] Ours Trevi Fountain 2% 12% 86% Piazza Navona 6% 25% 69% Top of the Rock 0% 8% 92% Sacre Coeur 1% 14% 85% The Pantheon 4% 18% 78% Total 3% 15% 82% Table E.2: User study, Share of votes on Q1. Q1: “Which one looks most photo-realistic? e.g. which video best reproduces the details of geometry and illumination you would expect of a real world scene?” Q2: “Which one appears to be most consistent across viewpoints, with the least jitter or flicker across frames?” 194 MUNIT [143] NRW [238] Ours Trevi Fountain 1% 7% 93% Piazza Navona 1% 28% 71% Top of the Rock 0% 9% 91% Sacre Coeur 1% 10% 89% The Pantheon 0% 7% 93% Total 1% 11% 88% Table E.3: User study, Share of votes on Q2. MUNIT [143] NRW [238] Ours Trevi Fountain 4% 13% 83% Piazza Navona 9% 30% 61% Top of the Rock 0% 9% 91% Sacre Coeur 3% 16% 81% The Pantheon 4% 27% 69% Total 4% 19% 77% Table E.4: User study, Share of votes on Q3. Q3: “Which one is most faithful to the appearance of the source image? For instance, which image best resembles the illumination and shading on the building in the source image?” The user study contains three sets of video comparisons and two sets of image comparisons randomly selected from each scene. Our method received the majority of votes on all questions across all five scenes. 195 APPENDIX F CHAPTER 7 APPENDIX F.1 Scene Flow Regularization Details Recall that Lreg is used as a regularization loss for the predicted scene flow fields, consisting of three terms with equal weights: Lreg = Lsp + Ltemp + Lmin, corresponding to spatial smoothness, temporal smoothness, and small scene flow. Scene flow spatial smoothness [255] minimizes the weighted `1 difference between scenes flows sampled at neighboring 3D position along each ray ri. In particular, the spatial smoothness term is written as: ∑ ∑ ∑ L distsp = w (xi,yi)||fi→j(xi)− fi→j(yi)||1, (F.1) xi yi∈N (xi) j∈{i±1} where N (xi) is the neighboring points of xi sampled along the ray ri, and weights are computed by the Euclidean distance between the two points: wdist(x,y) = exp (−2||x− y||2). Scene flow temporal smoothness, inspired by Vo et al. [360], encourages 3D point trajectories to be piece-wise linear. This is equivalent to minimizing sum of forward scene flow and backward scene flow from each sampled 3D point along the ray: ∑ L 1temp = ||fi→i+1(xi) + fi→i−1(xi)||22 (F.2)2 xi Finally, we encourage scene flow to be minimal in most of 3D space [357] by applying a l1 regularization term to each predicted scene flow: ∑ ∑ Lmin = ||fi→j(xi)||1 (F.3) xi j∈{i±1} F.2 Data Driven Prior Details Geometric consistency prior. Recall the geometric consistency prior minimizes the reprojection error of scene flow displaced 3D points w.r.t. the derived 2D optical flow. Suppose pi is a 2D pixel position at time i. The corresponding 2D pixel location in the neighboring frame at time j displaced through 2D optical flow ui→j can be computed as pi→j = pi + ui→j . To estimate the expected 2D point location p̂i→j at time j displaced by predicted scene flow fields, we first compute the expected scene flow F̂i→j(ri) and the expected 3D point location X̂i(ri) of the ra∫y ri through volume rendering:tf F̂i→j(ri) = ∫ Ti(t)σi(ri(t)) fi→j(ri(t))dt, (F.4)tn tf X̂i(ri) = Ti(t)σi(ri(t))xi(ri(t))dt. (F.5) tn p̂i→j is then computed by performing perspective projection of the expected 3D point location displaced by the scene flow (i.e. X̂i(ri) + F̂i→j(ri)) into the viewpoint corre- sponding to the frame at time j: p̂ ji→j(ri) = π(K(R (X̂i(ri) + F̂i→j(ri)) + t j)), (F.6) where (Rj, tj) ∈ SE(3) are rigid body transformations that transform 3D points from the world coordinate system to the coordinate system of frame at time j. K is a camera intrinsic matrix shared among all the frames, and π is perspective division operation. The 197 geometric consistency can be applied by comparing the l1 difference between p̂i→j and pi→j: ∑ ∑ Lgeo = ||p̂i→j(ri)− pi→j(ri))||1. (F.7) ri j∈N (i) Single-view depth prior. The single view depth prior encourages the expected termi- nation depth Ẑi computed along each ray to be close to the depth Zi predicted from a pre-trained single-view depth network [281]. As single-view depth predictions are defined up to an unknown scale and shift, we utilize a robust scale-shift invariant loss [281]: ∑ L = ||Ẑ∗ ∗z i (ri)− Zi (ri)||1 (F.8) ri We normalize the depths to have zero translation and unit scale using robust estimator: ∗ Z(ri)− shift(Z)Z (ri) = ,scale(Z) where shift(Z) = median(Z), scale(Z) = mean(|Z − shift(Z)|). (F.9) Due to computational limits, we are not able normalize the entire depth image during training, so we normalize the depth value using the shift and scale estimate from current sampled points in each training iteration. Furthermore, since we reconstruct the entire scene in normalized device coordinate (NDC) space, and the MiDAS model [281] predicts disparity in Euclidean space with an unknown scale and shift, we can use the NDC ray space derivation from NeRF [241] to derive that the depth in NDC space is equal to negative disparity in Euclidean space up to scale and shift, so our single-view term is implemented as: ∑ ∗ L 1z = ||Ẑ∗i (ri) + (ri)||1 (F.10)Z r ii 198 Figure F.1: Space-time view synthesis. We propose a 3D splatting-based approach to perform space-time interpolation at specified target viewpoint (shown as green camera) at an intermediate time i+ δi. Specifically, we sweep a plane over every ray r emitted from the specified target viewpoint from front to back. At each sampled step t along the ray, we query the color and density information (c, α), and the scene flows at times i and i + 1. We then displace the 3D points along the ray by the scaled scene flow δifi→i+1, (1 − δi)fi→i−1 respectively (left). The 3D displaced points are then splatted from time i and i+ 1 onto a (c, α) accumulation buffer at the target viewpoint, and the splats are blended with linear weights 1− δi, δi (middle). The final rendered view is obtained by volume rendering the accumulation buffer (right). F.3 Space-Time Interpolation Visualization In Sec 3.4 of our Chapter 6, we propose a splatting-based plane-sweep volume tracing approach to perform space-time interpolation to synthesize novel views in at novel view points, and in between input time indices. We show a visual illustration of this in Fig. F.1. In practice, we use the CUDA implementation of average splatting from Niklaus et al. [258] to efficiently perform forward splatting of the 3D points through scene flow fields. 199 F.4 Volume Rendering Equation Approximation Recall in Sec 3.3, the combined ren∫dering equation is written as:tf Ĉcbi (r ) = T cb i i (t)σ cb i (t) c cb i (t)dt, , (F.11) tn where σcbi (t) c cb i (t) is a linear combination of static scene components c(ri(t),di)σ(ri(t)) and dynamic scene components ci(ri(t),di)σi(ri(t)), weighted by v(ri(t)): σcb(t) ccbi i (t) = v(t) c(t)σ(t) + (1-v(t)) ci(t)σi(t). (F.12) We approximate this combined rendering equation using the same quadrature ap- proximations technique described in prior work [230, 241]. Suppose {tl}Ll=1 are the points sampled within the near and far bounds and we denote the distance between every sampled points δl = tl+1 − tl, the discrete approximation of Eq. F.11 is then written as: ∑L ( ) Ĉcbi (ri) = T cb i (t l) v(tl)α(σ(tl)δl) c(tl) + (1− v(tl))α(σ li(t )δl) c li(t ) , l=1 ( ∑l−1 ( )′ ′ ′ ′ ) where T cbi (t l) = exp − v(tl )σ(tl ) + (1− v(tl ))σi(tl ) δl′ , l′=1 and α(x) = 1− exp(−x) (F.13) F.5 Network Architecture Our network architecture is a variant of the original NeRF, which adopts MLPs as a backbone. Our full model consists of two separate MLPs, corresponding to a static (time-independent) scene representation (Fig. F.2) and a dynamic (time-dependent) scene representation (Fig. F.3). 200 Figure F.2: Network architecture of static (time-invariant) scene representation. Modified from the original NeRF architecture diagram. We predict an extra blending weight field v from intermediate features along with opacity σ. Figure F.3: Network Architecture of dynamic (time-variant) scene representation. Modified from the original NeRF architecture diagram. We encode and input time indices i into the MLP and predict time-dependent scene flow fields Fi and disocculusion weight fieldsWi from the intermediate features along with opacity σi. F.6 Implementation Details Initialization. We denote the initialization stage as the first 1000N iterations during training, where N is the number training views. To warm up the optimization, during the 201 initialization stage, we only compute all the temporal losses only in temporal window of size 3, i.e. j ∈ {i, i ± 1}, and switch to a temporal window of size 5, i.e. j ∈ N (i) = {i, i± 1, i± 2} after the initialization stage. Additionally, as both of the data-driven priors are noisy (in that they rely on inaccurate or incorrect predictions), we use these for initialization only, and linearly decay the weight of Ldata to zero during training for a fixed number of iterations. In particular, we linearly decrease the weight by factor of 10 every 1000N iterations. Hard mining sampling. Optionally, to sufficiently initialize the depth and scenes flows of small, fast moving objects such as the limbs of a person, we precompute a coarse binary motion segmentation mask from each frame, and sample an additional 512 points from the motion mask regions during the initialization stage. These additionally sampled points are added to the loss used in our dynamic (time-variant) scene representation. Similar to prior work [379], we compute the above coarse binary motion segmentation using a combination of physical and semantic estimates of rigidity. In particular, the physical mask is wherever the distance between the optical flow [146] at each pixel and its corresponding epipolar line from the neighboring frame at time j ∈ N (i) is greater than 1 pixel, and the semantic mask is computed using an off-the-shelf instance segmentation network [131] to label all pixels corresponding to possible moving objects such as people and animals. Finally, We union the two masks followed by morphological dilation to obtain the final binary mask. Note this coarse motion segmentation is mainly used to slightly increase the number of samples for the data-driven priors during initialization and does not need to be very accurate. 202 Hyperparameters and evaluation details. We implement our framework using Py- Torch. We empirically set βcyc = 1, βdata = 0.04, βreg = 0.1 in our experiments. When we evaluate all the baselines, we resize the sizes of their rendered images to be same as our rendered images before performing evaluation. In addition, recall we evaluate all the methods quantitatively on the entire scenes and in dynamic region only. In order to accu- rately determine which region is moving, we compute ground truth dynamic masks for numerical evaluation (Dynamic Only described in Chapter 7) from the multi-view videos through optical flow between two consecutive time instances at the same viewpoint, and segment out the dynamic region where the flow magnitude is larger than one pixel. 203 BIBLIOGRAPHY [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Pro- ceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016. [2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. Trans. on Pattern Analysis and Machine Intelligence, 34(11), 2012. [3] Edward H. Adelson and James R. Bergen. The plenoptic function and the elements of early vision. In Computational Models of Visual Processing, pages 3–20. MIT Press, 1991. [4] Edward H Adelson, James R Bergen, et al. The plenoptic function and the elements of early vision, volume 2. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of . . . , 1991. [5] Edward H Adelson and Alex P Pentland. The perception of shading and reflectance. Perception as Bayesian inference, pages 409–423, 1996. [6] Sadia Afroz and Rachel Greenstadt. Phishzoo: Detecting phishing websites by looking at them. In 2011 IEEE fifth international conference on semantic computing, pages 368–375. IEEE, 2011. [7] A. Agarwal, A. Beygelzimer, M. Dudı́k, J. Langford, and H. Wallach. A reductions approach to fair classification. ArXiv, abs/1803.02453, 2018. [8] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski. Building Rome in a day. In Proc. Int. Conf. on Computer Vision (ICCV), 2009. [9] Shruti Agarwal, Hany Farid, Tarek El-Gaaly, and Ser-Nam Lim. Detecting deep- fake videos from appearance and behavior. In 2020 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2020. [10] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pages 274–283. PMLR, 2018. [11] Kumar Ayush, Burak Uzkent, Marshall Burke, David Lobell, and Stefano Ermon. Efficient poverty mapping using deep reinforcement learning. arXiv preprint arXiv:2006.04224, 2020. [12] Ho Bae, Jaehee Jang, Dahuin Jung, Hyemi Jang, Heonseok Ha, and Sungroh Yoon. Security and privacy issues in deep learning. arXiv preprint arXiv:1807.11655, 2018. [13] Mohammad Haris Baig and Lorenzo Torresani. Coupled depth learning. In Proc. Winter Conf. on Computer Vision (WACV), 2016. [14] Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan. 4d visualization of dynamic events from unconstrained multi-view videos. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5366–5375, 2020. [15] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming- Hsuan Yang. Depth-aware video frame interpolation. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2019. [16] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016. [17] Jonathan T Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. Fast bilateral-space stereo for synthetic defocus. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4466–4474, 2015. [18] Jonathan T Barron and Jitendra Malik. Intrinsic scene properties from a single RGB-D image. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 17–24, 2013. [19] Jonathan T Barron and Jitendra Malik. Shape, illumination, and reflectance from shading. Trans. on Pattern Analysis and Machine Intelligence, 37(8):1670–1687, 2015. [20] Jonathan Tilton Barron. Shapes, Paint, and Light. University of California, Berkeley, 2013. [21] Tali Basha, Shai Avidan, Alexander Hornung, and Wojciech Matusik. Structure and motion from scene registration. In Proc. Computer Vision and Pattern Recognition (CVPR), 2012. [22] Tali Basha, Yael Moses, and Nahum Kiryati. Multi-view scene flow estimation: A view centered variational approach. Int. J. of Computer Vision, 2013. 205 [23] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. [24] Haluk Bayram, Nikolaos Stefas, Kazim Selim Engin, and Volkan Isler. Tracking wildlife with multiple uavs: System design, safety and field experiments. In 2017 International symposium on multi-robot and multi-agent systems (MRS), pages 97–103. IEEE, 2017. [25] Shida Beigpour, Marc Serra, Joost van de Weijer, Robert Benavente, Marı́a Vanrell, Olivier Penacchio, and Dimitris Samaras. Intrinsic image evaluation on synthetic complex scenes. Int. Conf. on Image Processing, 2013. [26] Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM Trans. Graphics, 33(4):159, 2014. [27] R. Bellamy, K. Dey, M. Hind, Samuel C. Hoffman, S. Houde, K. Kannan, Pranay Lohia, J. Martino, Sameep Mehta, A. Mojsilovic, Seema Nagar, K. Ramamurthy, J. Richards, Diptikalyan Saha, P. Sattigeri, M. Singh, Kush R. Varshney, and Y. Zhang. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ArXiv, abs/1810.01943, 2018. [28] Mojtaba Bemana, Karol Myszkowski, Hans-Peter Seidel, and Tobias Ritschel. X-fields: Implicit neural view-, light-and time-image interpolation. ACM Trans. Graphics, 39(6), 2020. [29] Alex Beutel, J. Chen, T. Doshi, Hai Qian, Allison Woodruff, Christine Luu, Pierre Kreitmann, Jonathan Bischof, and Ed Huai hsin Chi. Putting fairness principles into practice: Challenges, metrics, and improvements. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019. [30] David Bholat. The impact of machine learning and ai on the uk economy- conference overview. Available at SSRN 3602563, 2020. [31] S. Bi, Zexiang Xu, Pratul P. Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Milovs Havsan, Yannick Hold-Geoffroy, D. Kriegman, and R. Ramamoorthi. Neural reflectance fields for appearance acquisition. ArXiv, abs/2008.03824, 2020. [32] Sai Bi, Xiaoguang Han, and Yizhou Yu. An l1 image transform for edge-preserving smoothing and scene-level intrinsic decomposition. ACM Trans. Graph., 34:78:1– 78:12, 2015. 206 [33] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Deep reflectance volumes: Relightable reconstructions from multi-view photometric images. Proc. European Conf. on Computer Vision (ECCV), 2020. [34] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, and Ravi Ramamoorthi. Deep 3d capture: Geometry and reflectance from sparse multi-view images. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5960–5969, 2020. [35] Michael Bleyer, Christoph Rhemann, and Carsten Rother. Patchmatch stereo- stereo matching with slanted support windows. In Proc. British Machine Vision Conf. (BMVC), 2011. [36] Anthony E Boardman. Another analysis of the eeocc “four-fifths” rule. Manage- ment Science, 25(8):770–776, 1979. [37] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter V. Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proc. European Conf. on Computer Vision (ECCV), pages 561–578, 2016. [38] Nicolas Bonneel, Balazs Kovacs, Sylvain Paris, and Kavita Bala. Intrinsic decom- positions for image editing. Computer Graphics Forum (Eurographics State of the Art Reports 2017), 36(2), 2017. [39] Ivaylo Boyadzhiev, Sylvain Paris, and Kavita Bala. User-assisted image composit- ing for photographic lighting. ACM Trans. Graphics, 32:36:1–36:12, 2013. [40] Aljaz Bozic, Michael Zollhofer, Christian Theobalt, and Matthias Niessner. Deep- deform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2020. [41] Ali Breland. The bizarre and terrifying case of the “deepfake” video that helped bring an african nation to the brink. Mother Jones, 2019. [42] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248, 2017. [43] Fabian Brickwedde, Steffen Abraham, and Rudolf Mester. Mono-sf: Multi-view geometry meets single-view depth for monocular scene flow estimation of dynamic 207 traffic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2780–2790, 2019. [44] Nina Brown. Congress wants to solve deepfakes by 2020. https: //slate.com/technology/congress-deepfake-regulation- 230-2020.html, 2020. [45] Michael Broxton, Jay Busch, Jason Dourgarian, Matthew DuVall, Daniel Erickson, Dan Evangelakos, John Flynn, Ryan Overbeck, Matt Whalen, and Paul Debevec. A low cost multi-camera array for panoramic light field video capture. In SIGGRAPH Asia 2019 Posters, SA ’19, New York, NY, USA, 2019. Association for Computing Machinery. [46] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation. ACM Trans. Graph., 39(4), July 2020. [47] Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured lumigraph rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 425–432, 2001. [48] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy dispar- ities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018. [49] Marshall Burke, Anne Driscoll, David Lobell, and Stefano Ermon. Using satellite imagery to understand and promote sustainable development. Technical report, National Bureau of Economic Research, 2020. [50] TJ Burke and S Trazo. Emerging legal issues in an ai-driven world. gowling wlg, 2018. [51] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. European Conf. on Computer Vision (ECCV), 2012. [52] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In Proc. European Conf. on Computer Vision (ECCV), pages 611–625, 2012. 208 [53] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017. [54] Hervé Chabanne, Amaury de Wargny, Jonathan Milgram, Constance Morel, and Emmanuel Prouff. Privacy-preserving classification on deep neural network. IACR Cryptol. ePrint Arch., 2017:35, 2017. [55] Jin-Xiang Chai, Xin Tong, Shing-Chow Chan, and Heung-Yeung Shum. Plenoptic sampling. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, page 307–318, USA, 2000. ACM Press/Addison-Wesley Publishing Co. [56] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Int. Conf. on 3D Vision (3DV), pages 667–676, 2017. [57] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qix- ing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015. [58] Gaurav Chaurasia, Sylvain Duchene, Olga Sorkine-Hornung, and George Drettakis. Depth synthesis and local warps for plausible image-based navigation. ACM Trans. Graphics, 32(3):1–12, 2013. [59] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [60] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017. [61] Qifeng Chen and Vladlen Koltun. A simple model for intrinsic image decomposi- tion with depth cues. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 241–248, 2013. [62] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1511–1520, 2017. 209 [63] Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth per- ception in the wild. In Neural Information Processing Systems, pages 730–738, 2016. [64] Weifeng Chen, Donglai Xiang, and Jia Deng. Surface normals in the wild. Proc. Int. Conf. on Computer Vision (ICCV), pages 1557–1566, 2017. [65] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted back- door attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017. [66] Zhang Chen, Anpei Chen, Guli Zhang, Chengyuan Wang, Yu Ji, Kiriakos N Kutulakos, and Jingyi Yu. A neural rendering framework for free-viewpoint relighting. arXiv preprint arXiv:1911.11530, 2019. [67] Long Cheng, Fang Liu, and Danfeng Yao. Enterprise data breach: causes, chal- lenges, prevention, and future directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(5):e1211, 2017. [68] M Chessen. Encoded laws, policies, and virtues. Cornell Policy Review, 2018. [69] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, and Jan Kautz. Extreme view synthesis. In Proc. Int. Conf. on Computer Vision (ICCV), pages 7781–7790, 2019. [70] N Christopher. We’ve just seen the first use of deepfakes in an indian election campaign. Vice News, 2020. [71] Iain M Cockburn, Rebecca Henderson, and Scott Stern. The impact of artificial intelligence on innovation. Technical report, National bureau of economic research, 2018. [72] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas A Funkhouser, and Matthias Niessner. ScanNet: Richly-annotated 3D reconstruc- tions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5828–5839, 2017. [73] Abe Davis, Marc Levoy, and Frédo Durand. Unstructured light fields. Comput. Graph. Forum, 31:305–314, 2012. [74] Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham J Mysore, Fredo Durand, 210 and William T Freeman. The visual microphone: Passive recovery of sound from video. ACM Trans. Graphics (SIGGRAPH), 2014. [75] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 11–20, 1996. [76] Tali Dekel, Michael Rubinstein, Ce Liu, and William T Freeman. On the effective- ness of visible watermarks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2146–2154, 2017. [77] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [78] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv: Machine Learning, 2017. [79] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip L. Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graphics, 35:114:1–114:13, 2016. [80] Kevin Drum. The ai revolution is coming—and it will take your job sooner than you think. https://www.motherjones.com/kevin- drum/2017/10/you-will-lose-your-job-to-a-robot-and- sooner-than-you-think-2/, 2018. [81] Chris Dulhanty and A. Wong. Auditing imagenet: Towards a model-driven framework for annotating demographic attributes of large-scale image datasets. ArXiv, abs/1905.01347, 2019. [82] Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19. Springer, 2008. [83] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012. 211 [84] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2650–2658, 2015. [85] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems, pages 2366–2374, 2014. [86] Selim Engin and Volkan Isler. Active localization of multiple targets from noisy relative measurements. In Algorithmic Foundations of Robotics XIV: Proceedings of the Fourteenth Workshop on the Algorithmic Foundations of Robotics 14, pages 398–413. Springer International Publishing, 2021. [87] D. Erhan, Yoshua Bengio, Aaron C. Courville, and P. Vincent. Visualizing higher- layer features of a deep network. In Technical Report, Univeristé de Montréal, 2009. [88] Stefano Ermon, Jon Conrad, Carla Gomes, and Bart Selman. Risk-sensitive poli- cies for sustainable renewable resource allocation. In Twenty-Second International Joint Conference on Artificial Intelligence. Citeseer, 2011. [89] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Mor- cos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204– 1210, 2018. [90] John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and Richard Tucker. DeepView: View synthesis with learned gradient descent. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2367–2376, 2019. [91] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5515–5524, 2016. [92] Jan-Michael Frahm, Pierre Fite Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, and Svetlana Lazebnik. Building Rome on a cloudless day. In Proc. European Conf. on Computer Vision (ECCV), 2010. [93] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of 212 the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333, 2015. [94] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [95] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1434–1441, 2010. [96] A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4340–4349, 2016. [97] Pratik Gajane. On formalizing fairness in prediction with machine learning. ArXiv, abs/1710.03184, 2017. [98] Pratik Gajane and Mykola Pechenizkiy. On formalizing fairness in prediction with machine learning. arXiv preprint arXiv:1710.03184, 2017. [99] Elena Garces, Adolfo Munoz, Jorge Lopez-Moreno, and Diego Gutierrez. Intrinsic images by clustering. Computer Graphics Forum (Proc. EGSR 2012), 31(4), 2012. [100] Rahul Garg, Hao Du, Steven M Seitz, and Noah Snavely. The dimensionality of scene appearance. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1917–1924. IEEE, 2009. [101] Ravi Garg, Gustavo Carneiro, and Ian Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proc. European Conf. on Computer Vision (ECCV), pages 740–756, 2016. [102] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016. [103] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Erez Lieber- man Aiden, and Li Fei-Fei. Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences, 114(50):13108–13113, 2017. [104] Andreas Geiger. Are we ready for autonomous driving? The KITTI Vision 213 Benchmark Suite. In Proc. Computer Vision and Pattern Recognition (CVPR), 2012. [105] Marcel Geppert, Viktor Larsson, Pablo Speciale, Johannes L Schönberger, and Marc Pollefeys. Privacy preserving structure-from-motion. In Proc. European Conf. on Computer Vision (ECCV), pages 333–350. Springer, 2020. [106] Gabriel Ghinita. Understanding the privacy-efficiency trade-off in location based queries. In Proceedings of the SIGSPATIAL ACM GIS 2008 International Work- shop on Security and Privacy in GIS and LBS, pages 1–5, 2008. [107] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proc. Int. Conf. on Machine Learning, pages 201–210. PMLR, 2016. [108] Yotam I. Gingold, Ariel Shamir, and Daniel Cohen-Or. Micro perceptual human computation for visual tasks. ACM Trans. Graphics, 2012. [109] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [110] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [111] Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M Seitz. Multi-view stereo for community photo collections. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1–8, 2007. [112] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, pages 2672–2680, 2014. [113] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [114] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 43–54, 1996. 214 [115] Roger Grosse, Micah K Johnson, Edward H Adelson, and William T Freeman. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In Proc. Int. Conf. on Computer Vision (ICCV), 2009. [116] Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. DensePose: Dense Hu- man Pose Estimation In The Wild. Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [117] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Neural Information Process- ing Systems, pages 5767–5777, 2017. [118] Chuan Guo, Jacob Gardner, Yurong You, Andrew Gordon Wilson, and Kilian Weinberger. Simple black-box adversarial attacks. In International Conference on Machine Learning, pages 2484–2493. PMLR, 2019. [119] Trung Ha, Tran Khanh Dang, Hieu Le, and Tuan Anh Truong. Security and privacy issues in deep learning: a brief review. SN Computer Science, 1(5):1–15, 2020. [120] Marc Habermann, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Chris- tian Theobalt. Livecap: Real-time human performance capture from monocular video. ACM Transactions on Graphics (TOG), 38(2):14, 2019. [121] Nicolai Häni, Pravakar Roy, and Volkan Isler. Apple counting using convolutional neural networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2559–2565. IEEE, 2018. [122] Mark Hansen, Meritxell Roca-Sales, Jonathan M Keegan, and George King. Artificial intelligence: Practice and implications for journalism. Tow Center for Digital Journalism, Columbia University, 2017. [123] Moritz Hardt, E. Price, and Nathan Srebro. Equality of opportunity in supervised learning. In NIPS, 2016. [124] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413, 2016. [125] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. [126] Woodrow Hartzog and Evan Selinger. Facial recognition is the perfect tool for oppression. Medium, 2018. 215 [127] D Harvey. Help us shape our approach to synthetic and manipulated media. twitter, 2019. [128] Daniel Hauagge, Scott Wehrwein, Kavita Bala, and Noah Snavely. Photometric ambient occlusion. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2515–2522, 2013. [129] Daniel Cabrini Hauagge, Scott Wehrwein, Paul Upchurch, Kavita Bala, and Noah Snavely. Reasoning about photo collections using models of outdoor illumination. In Proc. British Machine Vision Conf. (BMVC), 2014. [130] James Hays and Alexei A Efros. Scene completion using millions of photographs. ACM Trans. Graphics, 26(3):4–es, 2007. [131] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2961–2969, 2017. [132] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proc. Int. Conf. on Computer Vision (ICCV), 2015. [133] Peter Hedman, Suhib Alsisan, Richard Szeliski, and Johannes Kopf. Casual 3d photography. ACM Trans. Graphics, 36:234:1–234:15, 2017. [134] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. Deep blending for free-viewpoint image-based rendering. ACM Trans. Graphics, 37(6):1–15, 2018. [135] Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV), pages 771–787, 2018. [136] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. Cryptodl: Deep neural networks over encrypted data. arXiv preprint arXiv:1711.05189, 2017. [137] Joel A Hesch and Stergios I Roumeliotis. An indoor localization aid for the visually impaired. In Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 3545–3551. IEEE, 2007. [138] Joel A Hesch and Stergios I Roumeliotis. Design and analysis of a portable indoor 216 localization aid for the visually impaired. The International Journal of Robotics Research, 29(11):1400–1415, 2010. [139] Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a single image. In Proc. Int. Conf. on Computer Vision (ICCV), volume 1, pages 654–661, 2005. [140] Ian P Howard. Seeing in depth, Vol. 1: Basic mechanisms. University of Toronto Press, 2002. [141] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view stereopsis. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [142] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1501–1510, 2017. [143] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsuper- vised image-to-image translation. In Proc. European Conf. on Computer Vision (ECCV), pages 172–189, 2018. [144] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. Fighting fake news: Image splice detection via learned self-consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 101–117, 2018. [145] Junhwa Hur and Stefan Roth. Self-supervised monocular scene flow estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 7396–7405, 2020. [146] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2462–2470, 2017. [147] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adver- sarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2137–2146. PMLR, 2018. [148] Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black- box adversarial attacks with bandits and priors. arXiv preprint arXiv:1807.07978, 2018. 217 [149] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175, 2019. [150] Matthias Innmann, Michael Zollhöfer, Matthias Niessner, Christian Theobalt, and Marc Stamminger. VolumeDeform: Real-time volumetric non-rigid reconstruction. In Proc. European Conf. on Computer Vision (ECCV), 2016. [151] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. on Machine Learning, pages 448–456, 2015. [152] Michal Irani and Prabu Anandan. Parallax geometry of pairs of points for 3d scene analysis. In Proc. European Conf. on Computer Vision (ECCV), pages 17–30, 1996. [153] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 6967–5976, 2017. [154] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1125–1134, 2017. [155] Nathan Jacobs, Nathaniel Roman, and Robert Pless. Consistent temporal variations in many outdoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1–6, 2007. [156] Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org. [157] Michael Janner, Jiajun Wu, Tejas Kulkarni, Ilker Yildirim, and Joshua B Tenen- baum. Self-Supervised Intrinsic Image Decomposition. In Neural Information Processing Systems, 2017. [158] Junho Jeon, Sunghyun Cho, Xin Tong, and Seungyong Lee. Intrinsic image decomposition using structure-texture separation and surface normals. In Proc. European Conf. on Computer Vision (ECCV), 2014. [159] Hanqing Jiang, Haomin Liu, Ping Tan, Guofeng Zhang, and Hujun Bao. 3d reconstruction of dynamic scenes with multiple handheld cameras. In Proc. European Conf. on Computer Vision (ECCV), 2012. 218 [160] Huaizu Jiang, Deqing Sun, Varun Jampani, Zhaoyang Lv, Erik Learned-Miller, and Jan Kautz. Sense: A shared encoder network for scene-flow estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 3195–3204, 2019. [161] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik G. Learned- Miller, and Jan Kautz. Super slomo: High quality estimation of multiple inter- mediate frames for video interpolation. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 9000–9008, 2018. [162] Nicola Jones. How to stop data centres from gobbling up the world’s electricity. Nature, 561(7722):163–167, 2018. [163] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2018. [164] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning- based view synthesis for light field cameras. ACM Trans. Graphics, 35(6):1–10, 2016. [165] F. Kamiran and T. Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33:1–33, 2011. [166] Toshihiro Kamishima, S. Akaho, Hideki Asoh, and J. Sakuma. Fairness-aware classifier with prejudice remover regularizer. In ECML/PKDD, 2012. [167] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End- to-end recovery of human shape and pose. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [168] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [169] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4401–4410, 2019. [170] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth extraction from video using non-parametric sampling. In Proc. European Conf. on Computer Vision (ECCV), pages 775–788, 2012. 219 [171] Kaspersky. Deepfake and fake videos-how to protect yourself? https://usa. kaspersky.com/resource-center/threats, 2013. [172] Steven M Kay. Fundamentals of statistical signal processing. Prentice Hall PTR, 1993. [173] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017. [174] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018. [175] Iphigenia Keramitsoglou, Constantinos Cartalis, and Chris T Kiranoudis. Auto- matic identification of oil spills on satellite images. Environmental modelling & software, 21(5):640–652, 2006. [176] Seungryong Kim, Kihong Park, Kwanghoon Sohn, and Stephen Lin. Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In Proc. European Conf. on Computer Vision (ECCV), pages 143–159, 2016. [177] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. [178] Svetlana Kiritchenko and Saif M. Mohammad. Examining gender and race bias in two hundred sentiment analysis systems. In *SEM@NAACL-HLT, 2018. [179] Felix Klose, Oliver Wang, Jean-Charles Bazin, Marcus Magnor, and Alexander Sorkine-Hornung. Sampling based scene-space video processing. ACM Transac- tions on Graphics (TOG), 34(4):67, 2015. [180] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graphics, 36(4), 2017. [181] Philip A Knight, Daniel Ruiz, and Bora Uçar. A symmetry preserving algorithm for matrix scaling. SIAM Journal on Matrix Analysis and Applications, 35(3):931– 955, 2014. [182] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, 220 Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, et al. Spectre attacks: Exploiting speculative execution. In 2019 IEEE Symposium on Security and Privacy (SP), pages 1–19. IEEE, 2019. [183] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885–1894. PMLR, 2017. [184] Kirsten Korosec. Deepfake’ revenge porn is now illegal in Virginia. https://techcrunch.com/2019/07/01/deepfake-revenge- porn-is-now-illegal-in-virginia/, 2019. [185] Balazs Kovacs, Sean Bell, Noah Snavely, and Kavita Bala. Shading annotations in the wild. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 850–859, 2017. [186] Suryansh Kumar, Yuchao Dai, and Hongdong Li. Monocular dense 3d reconstruc- tion of a complex dynamic scene from two perspective frames. Proc. Int. Conf. on Computer Vision (ICCV), pages 4659–4667, 2017. [187] Pierre-Yves Laffont and Jean-Charles Bazin. Intrinsic decomposition of image sequences from local temporal variations. In Proc. Int. Conf. on Computer Vision (ICCV), pages 433–441, 2015. [188] Pierre-Yves Laffont, Adrien Bousseau, Sylvain Paris, Frédo Durand, and George Drettakis. Coherent intrinsic images from photo collections. In ACM Trans. Graphics (SIGGRAPH), 2012. [189] Pierre-Yves Laffont, Adrien Bousseau, Sylvain Paris, Frédo Durand, and George Drettakis. Coherent intrinsic images from photo collections. ACM Trans. Graphics, 31:202:1–202:11, 2012. [190] Preethi Lahoti, Krishna P Gummadi, and Gerhard Weikum. ifair: Learning individually fair data representations for algorithmic decision making. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1334– 1345. IEEE, 2019. [191] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. on 3D Vision (3DV), pages 239–248, 2016. 221 [192] Edwin H Land and John J McCann. Lightness and retinex theory. Josa, 61(1):1–11, 1971. [193] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, and Peter V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [194] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunning- ham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4681–4690, 2017. [195] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proc. European Conf. on Computer Vision (ECCV), pages 35–51, 2018. [196] Jihyeon Janel Lee, Dylan Grosz, Sicheng Zheng, Burak Uzkent, Marshall Burke, David Lobell, and Stefano Ermon. Predicting livelihood indicators from crowd- sourced street level images. arXiv preprint arXiv:2006.08661, 2020. [197] K Lee and Paul Triolo. China’s artificial intelligence revolution: Understanding beijing’s structural advantages. China Embraces AI,” Eurasia Group, 2017. [198] Anat Levin and Frédo Durand. Linear view synthesis using a dimensionality gap light field prior. Proc. Computer Vision and Pattern Recognition (CVPR), pages 1831–1838, 2010. [199] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 31–42, 1996. [200] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, and Mingyi He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1119–1127, 2015. [201] Yunpeng Li, Noah Snavely, Daniel P. Huttenlocher, and Pascal Fua. Worldwide pose estimation using 3D point clouds. In Large-Scale Visual Geo-Localization, pages 147–163. Springer, 2016. [202] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, and Noah 222 Snavely. MannequinChallenge Dataset. https://google.github.io/ mannequinchallenge/, 2019. [203] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. Learning the depths of moving people by watching frozen people. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4521–4530, 2019. [204] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. arXiv preprint arXiv:2011.13084, 2020. [205] Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from watching the world. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [206] Zhengqi Li and Noah Snavely. Learning Intrinsic Image Decomposition from Watching the World. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [207] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2041–2050, 2018. [208] Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the plenoptic function. In Proc. European Conf. on Computer Vision (ECCV), 2020. [209] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 577–585, 2018. [210] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [211] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, et al. Meltdown: Reading kernel memory from user space. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 973–990, 2018. 223 [212] Zachary Chase Lipton. The mythos of model interpretability. Queue, 16:31 – 57, 2018. [213] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. [214] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. Trans. on Pattern Analysis and Machine Intelligence, 38:2024–2039, 2016. [215] Jian Liu, Mika Juuti, Yao Lu, and Nadarajah Asokan. Oblivious neural network predictions via minionn transformations. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 619–631, 2017. [216] Miaomiao Liu, Mathieu Salzmann, and Xuming He. Discrete-continuous depth estimation from a single image. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 716–723, 2014. [217] Ximeng Liu, Lehui Xie, Yaopeng Wang, Jian Zou, Jinbo Xiong, Zuobin Ying, and Athanasios V Vasilakos. Privacy and security issues in deep learning: A survey. IEEE Access, 2020. [218] Yingqi Liu, Shiqing Ma, Yousra Aafer, W. Lee, Juan Zhai, Weihang Wang, and X. Zhang. Trojaning attack on neural networks. In NDSS, 2018. [219] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. ACM Trans. Graphics, 38(4):65, 2019. [220] Yunhui Long, Vincent Bindschaedler, and Carl A Gunter. Towards measuring membership privacy. arXiv preprint arXiv:1712.09136, 2017. [221] Erika Lu, F. Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, D. Salesin, W. Free- man, and M. Rubinstein. Layered neural rendering for retiming people in video. ACM Trans. Graphics (SIGGRAPH Asia), 2020. [222] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Trans. Graph., 39(4), July 2020. [223] Zhaoyang Lv, Kihwan Kim, Alejandro Troccoli, Deqing Sun, James M Rehg, and Jan Kautz. Learning rigidity in dynamic scenes with a moving camera for 3d 224 motion field estimation. In Proc. European Conf. on Computer Vision (ECCV), pages 468–484, 2018. [224] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. [225] Aravindh Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5188–5196, 2015. [226] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [227] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2794–2802, 2017. [228] Ricardo Martin-Brualla, David Gallup, and Steven M Seitz. 3d time-lapse recon- struction from internet photos. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1332–1340, 2015. [229] Ricardo Martin-Brualla, David Gallup, and Steven M Seitz. Time-lapse mining from internet photos. ACM Trans. Graphics, 34(4):1–8, 2015. [230] Ricardo Martin-Brualla, N. Radwan, Mehdi S. M. Sajjadi, J. Barron, A. Doso- vitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. ArXiv, abs/2008.02268, 2020. [231] Yasuyuki Matsushita, Stephen Lin, Sing Bing Kang, and Heung-Yeung Shum. Estimating intrinsic images from image sequences with biased illumination. In Proc. European Conf. on Computer Vision (ECCV), pages 274–286, 2004. [232] Kevin Matzen and Noah Snavely. Scene chronology. In Proc. European Conf. on Computer Vision (ECCV), pages 615–630. Springer, 2014. [233] Oier Mees, Andreas Eitel, and Wolfram Burgard. Choosing Smartly: Adaptive Multimodal Fusion for Object Detection in Changing Environments. In Int. Conf. on Intelligent Robots and Systems (IROS), 2016. 225 [234] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Moham- mad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graphics, 36:44:1–44:14, 2017. [235] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. arXiv preprint arXiv:1711.07837, 2017. [236] Dongyu Meng and Hao Chen. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 135–147, 2017. [237] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. [238] Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-Brualla. Neural rerendering in the wild. Proc. Computer Vision and Pattern Recognition (CVPR), pages 6871–6880, 2019. [239] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017. [240] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalan- tari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Prac- tical view synthesis with prescriptive sampling guidelines. ACM Trans. Graphics, 38(4):1–14, 2019. [241] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Proc. European Conf. on Computer Vision (ECCV), 2020. [242] Himangi Mittal, Brian Okorn, and David Held. Just go with the flow: Self- supervised scene flow estimation. In Proc. Computer Vision and Pattern Recogni- tion (CVPR), pages 11177–11185, 2020. [243] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adver- sarial training: a regularization method for supervised and semi-supervised learn- ing. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979– 1993, 2018. [244] Abdolreza Mohamadi, Zahedeh Heidarizadi, Hadi Nourollahi, et al. Assessing 226 the desertification trend using neural network classification and object-oriented techniques. J. Fac. Istanb. Univ, 66:683–690, 2016. [245] Sharada P Mohanty, David P Hughes, and Marcel Salathé. Using deep learning for image-based plant disease detection. Frontiers in plant science, 7:1419, 2016. [246] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy- preserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages 19–38. IEEE, 2017. [247] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1765–1773, 2017. [248] Paul Mozur. Inside china’s dystopian dreams: Ai, shame and lots of cameras. The New York Times, 8:2018, 2018. [249] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 27–38, 2017. [250] Luis Muñoz-González, Bjarne Pfitzner, Matteo Russo, Javier Carnerero-Cano, and Emil C Lupu. Poisoning attacks with generative adversarial nets. arXiv preprint arXiv:1906.07773, 2019. [251] Raul Mur-Artal and Juan D Tardós. Orb-Slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. [252] Takuya Narihira, Michael Maire, and Stella X Yu. Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2992–2992, 2015. [253] Takuya Narihira, Michael Maire, and Stella X Yu. Learning lightness from human judgement on relative reflectance. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2965–2973, 2015. [254] Thomas Nestmeyer and Peter V Gehler. Reflectance adaptive filtering improves intrinsic image estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 227 [255] Richard A Newcombe, Dieter Fox, and Steven M Seitz. DynamicFusion: Recon- struction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. [256] Bingbing Ni, Gang Wang, and Pierre Moulin. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proc. ICCV Workshops, 2011. [257] Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpola- tion. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1701–1710, 2018. [258] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5436–5445, 2020. [259] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2270–2279, 2017. [260] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In Proc. Int. Conf. on Computer Vision (ICCV), pages 261–270, 2017. [261] Simon Niklaus, Long Mai, and Oliver Wang. Revisiting adaptive convolutions for video frame interpolation. arXiv preprint arXiv:2011.01280, 2020. [262] Simon Niklaus, Long Mai, Jimei Yang, and F. Liu. 3d ken burns effect from a single image. ACM Trans. Graphics, 38:1 – 15, 2019. [263] David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3d object categories by looking around them. Proc. Int. Conf. on Computer Vision (ICCV), pages 5218–5227, 2017. [264] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization. [265] Tianyu Pang, Chao Du, Yinpeng Dong, and Jun Zhu. Towards robust detection of adversarial examples. arXiv preprint arXiv:1706.00633, 2017. [266] Nicolas Papernot, Martı́n Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016. 228 [267] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016. [268] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372–387. IEEE, 2016. [269] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pages 582–597. IEEE, 2016. [270] Hyun Soo Park, Takaaki Shiratori, Iain Matthews, and Yaser Sheikh. 3d recon- struction of a moving point from a series of 2d projections. In Proc. European Conf. on Computer Vision (ECCV), pages 158–171. Springer, 2010. [271] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2337–2346, 2019. [272] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019. [273] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Dani- ilidis. Coarse-to-fine volumetric prediction for single-image 3D human pose. Proc. Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, 2017. [274] Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis. ACM Trans. Graphics, 36(6):1–11, 2017. [275] Georgios Petropoulos. The impact of artificial intelligence on employment. Praise for Work in the Digital Age, 119, 2018. [276] NhatHai Phan, Yue Wang, Xintao Wu, and Dejing Dou. Differential privacy preservation for deep auto-encoders: an application of human behavior prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [277] Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, and George Drettakis. 229 Multi-view relighting using a geometry-aware network. ACM Trans. Graphics, 38(4):1–14, 2019. [278] Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. Revealing scenes by inverting structure from motion reconstructions. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 145–154, 2019. [279] Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvi- jotham, Alhussein Fawzi, Soham De, Robert Stanforth, and Pushmeet Kohli. Ad- versarial robustness through local linearization. arXiv preprint arXiv:1907.02610, 2019. [280] Ravi Ramamoorthi and Pat Hanrahan. A signal-processing framework for inverse rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 117–128, 2001. [281] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero- shot cross-dataset transfer. Trans. on Pattern Analysis and Machine Intelligence, 2020. [282] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun. Dense monocular depth estimation in complex dynamic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. [283] Erik Reinhard, Michael Stark, Peter Shirley, and James Ferwerda. Photographic tone reproduction for digital images. In ACM Trans. Graphics (SIGGRAPH), 2002. [284] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve Seitz. Soccer on your tabletop. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2018. [285] Christian Richardt, Hyeongwoo Kim, Levi Valgaerts, and Christian Theobalt. Dense wide-baseline scene flow from two handheld video cameras. In 2016 Fourth International Conference on 3D Vision (3DV), pages 276–285. IEEE, 2016. [286] Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2232–2241, 2017. [287] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for 230 data: Ground truth from computer games. In Proc. European Conf. on Computer Vision (ECCV), pages 102–118, 2016. [288] Mark O Riedl. Human-centered artificial intelligence and machine learning. Human Behavior and Emerging Technologies, 1(1):33–36, 2019. [289] Gernot Riegler and V. Koltun. Free view synthesis. Proc. European Conf. on Computer Vision (ECCV), 2020. [290] Greg Robinson. Are we prepared for the rise of automation? https://info.aiim.org/aiim-blog/are-we-prepared-for- the-rise-of-automation, 2018. [291] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Int. Conf. on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015. [292] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 3234–3243, 2016. [293] A. Ross, M. Hughes, and Finale Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. In IJCAI, 2017. [294] Andrew Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [295] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019. [296] Carsten Rother, Martin Kiefel, Lumin Zhang, Bernhard Schölkopf, and Peter V Gehler. Recovering intrinsic images with a global sparsity prior on reflectance. In Neural Information Processing Systems, pages 765–773, 2011. [297] Bita Darvish Rouhani, M Sadegh Riazi, and Farinaz Koushanfar. Deepsecure: Scalable provably-secure deep learning. In Proceedings of the 55th Annual Design Automation Conference, pages 1–6, 2018. 231 [298] Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural regression forest. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. [299] Chris Russell, Rui Yu, and Lourdes Agapito. Video pop-up: Monocular 3d reconstruction of dynamic scenes. In Proc. European Conf. on Computer Vision (ECCV), pages 583–598. Springer, 2014. [300] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Scribbler: Controlling deep image synthesis with sketch and color. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5400–5409, 2017. [301] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single monocular images. In Neural Information Processing Systems, volume 18, pages 1–8, 2005. [302] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3D: Learning 3D scene structure from a single still image. Trans. on Pattern Analysis and Machine Intelligence, 31(5), 2009. [303] Carolin Schmitt, Simon Donné, Gernot Riegler, Vladlen Koltun, and Andreas Geiger. On joint estimation of pose, geometry and svbrdf from a handheld scanner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3493–3503, 2020. [304] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016. [305] Johannes L. Schönberger, Filip Radenovic, Ondrej Chum, and Jan-Michael Frahm. From single image query to detailed 3D reconstruction. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. [306] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In Proc. European Conf. on Computer Vision (ECCV), pages 501–518, 2016. [307] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017. 232 [308] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural networks. arXiv preprint arXiv:1804.00792, 2018. [309] Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa, and Steven M. Seitz. The visual turing test for scene reconstruction. Int. Conf. on 3D Vision (3DV), pages 25–32, 2013. [310] Li Shen, Ping Tan, and Stephen Lin. Intrinsic image decomposition with non-local texture cues. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1–7, 2008. [311] Li Shen and Chuohao Yeo. Intrinsic images decomposition using a local and global sparse representation of reflectance. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 697–704, 2011. [312] Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2018. [313] Jian Shi, Yue Dong, Hao Su, and Stella X Yu. Learning non-Lambertian object intrinsics across ShapeNet categories. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5844–5853, 2017. [314] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Frédo Durand. Light field reconstruction using sparsity in the continuous fourier domain. ACM Trans. Graphics, 34:12:1–12:13, 2014. [315] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Fredo Durand. Light field reconstruction using sparsity in the continuous fourier domain. ACM Trans. Graphics, 34(1), December 2015. [316] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3d photography using context-aware layered depth inpainting. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 8028–8038, 2020. [317] Yichang Shih, Sylvain Paris, Frédo Durand, and William T Freeman. Data- driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013. [318] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Member- 233 ship inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. [319] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [320] Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras. Neural face editing with intrinsic image disentangling. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5444–5453, 2017. [321] Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In ICCV Workshops, 2011. [322] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. In Proc. European Conf. on Computer Vision (ECCV), pages 746–760, 2012. [323] Ian Simon, Noah Snavely, and Steven M Seitz. Scene summarization for online image collections. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1–8. IEEE, 2007. [324] Tomas Simon, Jack Valmadre, Iain A. Matthews, and Yaser Sheikh. Kronecker- Markov Prior for Dynamic 3D Reconstruction. Trans. on Pattern Analysis and Machine Intelligence, 39:2201–2214, 2017. [325] K. Simonyan, A. Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2014. [326] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2437–2446, 2019. [327] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Neural Information Processing Systems, pages 1119–1130, 2019. [328] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3D. In ACM Trans. Graphics (SIGGRAPH), 2006. 234 [329] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3D. In ACM Trans. Graphics (SIGGRAPH), 2006. [330] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. Proc. Com- puter Vision and Pattern Recognition (CVPR), 2017. [331] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 190–198, 2017. [332] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017. [333] Pablo Speciale, Johannes L Schonberger, Sudipta N Sinha, and Marc Pollefeys. Privacy preserving image queries for camera localization. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1486–1496, 2019. [334] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Ried- miller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014. [335] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane images. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 175–184, 2019. [336] Pratul P Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi Ramamoorthi, and Ren Ng. Learning to synthesize a 4d rgbd light field from a single image. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2243–2251, 2017. [337] Mirjana Stankovic, Ravi Gupta, RB Andre, G Myers, and Marco Nicoli. Exploring legal, ethical and policy implications of artificial intelligence. White paper of the global forum on law justice and development., 2017. [338] Timo Stich, Christian Linz, Georgia Albuquerque, and Marcus Magnor. View and time interpolation in image space. In Computer Graphics Forum, volume 27, pages 1781–1787. Wiley Online Library, 2008. [339] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ 235 International Conference on Intelligent Robots and Systems (IROS), pages 573– 580, 2012. [340] Deqing Sun, Erik B Sudderth, and Hanspeter Pfister. Layered rgbd scene flow estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 548–556, 2015. [341] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328. PMLR, 2017. [342] Kalyan Sunkavalli, Wojciech Matusik, Hanspeter Pfister, and Szymon Rusinkiewicz. Factored time-lapse video. In ACM Transactions on Graphics (TOG), volume 26, page 101. ACM, 2007. [343] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Syn- thesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017. [344] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. [345] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [346] Richard Szeliski and Polina Golland. Stereo matching with transparency and matting. Int. J. of Computer Vision, 1998. [347] Dean Takahashi. How Pixar made Monsters University, its latest technological marvel. https://venturebeat.com/2013/04/24/the-making- of-pixars-latest-technological-marvel-monsters- university/, 2013. [348] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graphics, 38(4):1–12, 2019. [349] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb 236 videos. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2387– 2395, 2016. [350] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming- Hsuan Yang. Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3789–3797, 2017. [351] Richard Tucker and Noah Snavely. Single-view view synthesis with multiplane images. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2020. [352] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. [353] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved texture net- works: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 6924–6932, 2017. [354] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [355] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5622–5631, 2017. [356] Suren Vagharshakyan, Robert Bregovic, and Atanas P. Gotchev. Light field reconstruction using shearlet transform. Trans. on Pattern Analysis and Machine Intelligence, 40:133–147, 2015. [357] Jack Valmadre and S. Lucey. General trajectory prior for non-rigid reconstruction. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1394–1401, 2012. [358] Sahil Verma and Julia Rubin. Fairness definitions explained. In 2018 ieee/acm international workshop on software fairness (fairware), pages 1–7. IEEE, 2018. [359] Ricardo Vinuesa, L Fdez. de Arévalo, M Luna, and H Cachafeiro. Simulations and experiments of heat loss from a parabolic trough absorber tube over a range of 237 pressures and gas compositions in the vacuum chamber. Journal of Renewable and Sustainable Energy, 8(2):023701, 2016. [360] Minh Vo, S. Narasimhan, and Yaser Sheikh. Spatiotemporal bundle adjustment for dynamic 3d reconstruction. Proc. Computer Vision and Pattern Recognition (CVPR), pages 1710–1718, 2016. [361] Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10:3152676, 2017. [362] Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [363] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard Zhang, and Alexei A Efros. Detecting photoshopped faces by scripting photoshop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10072–10081, 2019. [364] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8695–8704, 2020. [365] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5309–5318, 2019. [366] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5310–5319, 2019. [367] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 8798–8807, 2018. [368] Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial networks. In Proc. European Conf. on Computer Vision (ECCV), pages 318–335. Springer, 2016. 238 [369] Andreas Wedel, Thomas Brox, Tobi Vaudrey, Clemens Rabe, Uwe Franke, and Daniel Cremers. Stereoscopic scene flow computation for 3d motion understanding. Int. J. of Computer Vision, 95(1):29–51, 2011. [370] Yair Weiss. Deriving intrinsic images from image sequences. In Proc. Int. Conf. on Computer Vision (ICCV), volume 2, pages 68–75, 2001. [371] Alma Whitten and J Doug Tygar. Why johnny can’t encrypt: A usability evaluation of pgp 5.0. In USENIX Security Symposium, volume 348, pages 169–184, 1999. [372] Wikipedia. Death of Elaine Herzberg. https://en.wikipedia.org/ wiki/Death_of_Elaine_Herzberg, 2018. [373] Wikipedia. Deepfake. https://en.wikipedia.org/wiki/Deepfake, 2018. [374] Wikipedia. Mannequin Challenge. https://en.wikipedia.org/wiki/ Mannequin_Challenge, 2018. [375] Olivia Wiles, Georgia Gkioxari, R. Szeliski, and J. Johnson. Synsin: End-to- end view synthesis from a single image. Proc. Computer Vision and Pattern Recognition (CVPR), pages 7465–7475, 2020. [376] Bernd W Wirtz, Jan C Weyerer, and Carolin Geyer. Artificial intelligence and the public sector—applications and challenges. International Journal of Public Administration, 42(7):596–615, 2019. [377] Changchang Wu. Towards linear-time incremental structure from motion. In Int. Conf. on 3D Vision (3DV), 2013. [378] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. Mantra-net: Manip- ulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9543–9552, 2019. [379] Jonas Wulff, Laura Sevilla-Lara, and Michael J Black. Optical flow in mostly rigid scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4671–4680, 2017. [380] Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 239 [381] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. TextureGAN: Controlling deep image synthesis with texture patches. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 8456–8465, 2018. [382] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3D: A database of big spaces reconstructed using sfm and object labels. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1625–1632, 2013. [383] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017. [384] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 501–509, 2019. [385] Junyuan Xie, Ross B. Girshick, and Ali Farhadi. Deep3D: Fully automatic 2D-to- 3D video conversion with deep convolutional neural networks. In Proc. European Conf. on Computer Vision (ECCV), 2016. [386] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [387] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. Trans. on Pattern Analysis and Machine Intelligence, 2018. [388] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017. [389] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian Theobalt. MonoPerfCap: Human per- formance capture from monocular video. ACM Transactions on Graphics (ToG), 37(2):27, 2018. [390] Yi Xu, Jared Heinly, Andrew M White, Fabian Monrose, and Jan-Michael Frahm. Seeing double: Reconstructing obscured typed input from repeated compromising reflections. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 1063–1074, 2013. [391] Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Su, and Ravi Ra- 240 mamoorthi. Deep view synthesis from sparse photometric images. ACM Trans. Graphics, 38(4), 2019. [392] Yexiang Xue, Stefano Ermon, Carla P Gomes, and Bart Selman. Uncovering hidden structure through parallel problem decomposition for the set basis prob- lem: Application to materials discovery. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [393] Chaofei Yang, Qing Wu, Hai Li, and Yiran Chen. Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340, 2017. [394] Yingzhi Yang, B Gog, and E Gibbs. China seeks to root out fake news and deepfakes with new online content rules. online], https://www. reuters. com/article/us-china-technology/china-seeks-to-root-out-fake-news-and- deepfakes-with-newonline-content-rules-idUSKBN1Y30VU, 2019. [395] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi-view Stereo. Proc. European Conf. on Computer Vision (ECCV), pages 767–783, 2018. [396] Mao Ye and Ruigang Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proc. Computer Vision and Pattern Recognition (CVPR), 2014. [397] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282. IEEE, 2018. [398] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. [399] Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5336– 5345, 2020. [400] Ye Yu and William AP Smith. Inverserendernet: Learning single image inverse rendering. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 3155–3164, 2019. 241 [401] Matthew D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. [402] Chenxi Zhang, Jizhou Gao, Oliver Wang, Pierre Fite Georgel, Ruigang Yang, James Davis, Jan-Michael Frahm, and Marc Pollefeys. Personal photograph enhancement using internet photo collections. IEEE Trans. on Visualization and Computer Graphics, 2014. [403] Maggie Zhang. Google photos tags two african-americans as gorillas through facial recognition software. Forbes, July, 2015. [404] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016. [405] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. [406] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, and Thomas Funkhouser. Physically-based rendering for indoor scene un- derstanding using convolutional neural networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5057–5065, 2017. [407] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [408] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP, 2017. [409] Qi Zhao, Ping Tan, Qiang Dai, Li Shen, Enhua Wu, and Stephen Lin. A closed- form solution to retinex with nonlocal texture constraints. Trans. on Pattern Analysis and Machine Intelligence, 34(7):1437–1444, 2012. [410] Yinan Zhao, Brian Price, Scott Cohen, and Danna Gurari. Guided image inpainting: Replacing an image region by pulling content from another image. In Proc. Winter Conf. on Computer Vision (WACV), pages 1514–1523. IEEE, 2019. [411] Enliang Zheng, Dinghuang Ji, Enrique Dunn, and Jan-Michael Frahm. Sparse Dynamic 3D Reconstruction from Unsynchronized Videos. Proc. Int. Conf. on Computer Vision (ICCV), pages 4435–4443, 2015. 242 [412] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [413] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. DeepTAM: Deep Tracking and Mapping. In Proc. European Conf. on Computer Vision (ECCV), 2018. [414] Peng Zhou, Ning Yu, Zuxuan Wu, Larry S Davis, Abhinav Shrivastava, and Ser- Nam Lim. Deep video inpainting detection. arXiv preprint arXiv:2101.11080, 2021. [415] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [416] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. [417] Tinghui Zhou, Philipp Krahenbuhl, and Alexei A Efros. Learning data-driven re- flectance priors for intrinsic image decomposition. In Proc. Int. Conf. on Computer Vision (ICCV), pages 3469–3477, 2015. [418] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images. ACM Trans. Graphics, 37:1 – 12, 2018. [419] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance flow. In Proc. European Conf. on Computer Vision (ECCV), pages 286–301. Springer, 2016. [420] Chen Zhu, W Ronny Huang, Hengduo Li, Gavin Taylor, Christoph Studer, and Tom Goldstein. Transferable clean-label poisoning attacks on deep neural nets. In International Conference on Machine Learning, pages 7614–7623. PMLR, 2019. [421] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A Efros. Learning a discriminative model for the perception of realism in composite images. In Proceedings of the IEEE International Conference on Computer Vision, pages 3943–3951, 2015. [422] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to- 243 image translation using cycle-consistent adversarial networks. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2223–2232, 2017. [423] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Neural Information Processing Systems, pages 465–476, 2017. [424] Yu Zhu, Wenbin Chen, and Guodong Guo. Evaluating spatiotemporal interest point features for depth-based action recognition. Image and Vision Computing, 32(8):453–464, 2014. [425] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. High-quality video view interpolation using a layered representation. ACM Trans. Graphics, 23:600–608, 2004. [426] C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon A. J. Winder, and Richard Szeliski. High-quality video view interpolation using a layered representation. In SIGGRAPH 2004, 2004. [427] Michael Zollhöfer, Matthias Niessner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, et al. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graphics, 33(4):156, 2014. [428] Daniel Zoran, Phillip Isola, Dilip Krishnan, and William T Freeman. Learning ordinal relationships for mid-level vision. In Proc. Int. Conf. on Computer Vision (ICCV), pages 388–396, 2015. [429] Shoshana Zuboff. Big other: surveillance capitalism and the prospects of an information civilization. Journal of Information Technology, 30(1):75–89, 2015. [430] Shoshana Zuboff, Norma Möllers, David Murakami Wood, and David Lyon. Surveillance capitalism: An interview with shoshana zuboff. Surveillance & Society, 17(1/2):257–266, 2019. [431] Frederik Zuiderveen Borgesius, Damian Trilling, Judith Möller, Balázs Bodó, Claes H De Vreese, and Natali Helberger. Should we worry about filter bubbles? Internet Policy Review. Journal on Internet Regulation, 5(1), 2016. 244