LEARNING GEOMETRY, APPEARANCE AND
MOTION IN THE WILD
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Zhengqi Li
May 2021
© 2021 Zhengqi Li
ALL RIGHTS RESERVED
LEARNING GEOMETRY, APPEARANCE AND MOTION IN THE WILD
Zhengqi Li, Ph.D.
Cornell University 2021
Physics-based computer vision can be formulated as an inverse process of graphics
rendering engine: we seek to take RGB images and recover the intrinsic properties of a
scene, including geometry, material, illumination, and object motions. Computer vision
as inverse graphics plays an important role in numerous real-world applications such as
virtual reality, in which recovered scene intrinsics can be further used to render images at
novel viewpoints with plausible lighting.
However, most previous techniques either require a multi-camera setup or assume
that the underlying scene is static, i.e., that the appearance and geometry do not change
over time. In contrast, the photos we see on the Internet only constitute a single view
observation for each scene; the videos often involve dynamics due to a variety of time-
varying factors such as illumination changes and object motions.
Therefore, In this thesis, I address these problems to in-the-wild scenarios by lever-
aging a compelling source of data: massive quantities of unlabeled photos and videos
people take and upload to the Internet every day. I demonstrate how to make use of
such massive but noisy visual data to capture scene geometry, appearance, lighting, and
motions from a single RGB image or videos of dynamic scenes, which further enables
me to synthesize photo-realistic novel view in both space and time.
BIOGRAPHICAL SKETCH
Zhengqi Li is a CS Ph.D. candidate at Cornell Tech, Cornell University where he is
advised by Prof. Noah Snavely. He will become a research scientist at Google Research
starting Summer 2021. He received Bachelor of Computer Engineering with High
Distinction at University of Minnesota, Twin Cities where he was advised by Prof.
Stergios Roumeliotis and was a research assistant at the MARS Lab and Google Project
Tango. He was also a member of the Robotic Sensor Networks (RSN) Lab where he
worked closely with Prof. Volkan Isler on agricultural robotics and vision.
His research interests span 3D and 4D computer vision, inverse graphics, and novel
view synthesis, for images and videos in the wild. He is a recipient of the CVPR 2019
Best Paper Honorable Mention, 2020 Google Ph.D. Fellowship, 2020 Adobe Research
Fellowship, 2016 CRA Outstanding Undergraduate Researchers Honorable Mentions.
This thesis is dedicated to my parents and all my other family members.
ACKNOWLEDGEMENTS
During my five year Ph.D. journey, I would like to give the highest appreciation and
gratitude to my adviser Professor Noah Snavely for his invaluable support, inspiration
and advice. Noah is the best advisor and the nicest person I have ever seen in my life, and
he always give me freedom and support in both research and life. It is Noah’s guidance
that help me grow over recent five years to become a great and independent researcher.
I would also like to thank my thesis committee members Professor Serge Belongie
and Professor Mor Naaman, who provide feedback and suggestions on my thesis and
dissertation defense. I also would like to thank my collaborator and colleagues at Cornell
Tech and Cornell University, including Kai Zhang, Hadar Averbuch-Elor, Qianqian Wang,
Jin Sun, Wenqi Xian, Abe Davis. They are great to work with and I always learn a lot
from the discussion with them. In addition, I am also fortunate to have chance to work
with many great researchers outside Cornell University on different exciting research
projects. I would like to give special thanks to Tali Dekel, Forrester Cole, Richard Tucker,
Ce Liu and William T. Freeman in VisCam team at Google Research, Fernando De La
Torre at Facebook Reality Lab, Oliver Wang and Simon Niklaus at Adobe Research. They
all offer me great advice on the research projects during my three summer internships.
Last but not least, I would to dedicate my dissertation to my parents and my other family
members since my Ph.D. journey would not have been possible without their countless
love and support.
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction 1
2 Learning Single View Depth Prediction from Internet Photos 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The MegaDepth Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Photo calibration and reconstruction . . . . . . . . . . . . . . . 11
2.3.2 Depth map refinement . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Depth enhancement via semantic segmentation . . . . . . . . . 13
2.3.4 Creating a dataset . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Depth estimation network . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Evaluation and ablation study on MD test set . . . . . . . . . . 21
2.5.2 Generalization to other datasets . . . . . . . . . . . . . . . . . 23
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Learning the Depths of Moving People by Watching Frozen People 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 The MannequinChallenge Dataset . . . . . . . . . . . . . . . . . . . . 34
3.4 Depth Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Depth from motion parallax . . . . . . . . . . . . . . . . . . . 40
3.4.2 Depth confidence . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.1 Evaluation on the MC test set . . . . . . . . . . . . . . . . . . 50
3.5.2 Evaluation on the TUM RGBD dataset . . . . . . . . . . . . . 51
3.5.3 Internet videos of dynamic scenes . . . . . . . . . . . . . . . . 55
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Learning Intrinsic Image Decomposition from Watching the World 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Overview and network architecture . . . . . . . . . . . . . . . . . . . . 63
4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Image reconstruction loss . . . . . . . . . . . . . . . . . . . . 68
4.5.2 Reflectance consistency . . . . . . . . . . . . . . . . . . . . . 69
4.5.3 Dense spatio-temporal reflectance smoothness . . . . . . . . . 69
4.5.4 Multi-scale shading smoothness . . . . . . . . . . . . . . . . . 71
4.5.5 All-pairs weighted least squares (APWLS) . . . . . . . . . . . 73
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.1 Evaluation on IIW . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.2 Evaluation on SAW . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6.3 Qualitative results on IIW and SAW . . . . . . . . . . . . . . . 78
4.6.4 Evaluation on MIT intrinsic images . . . . . . . . . . . . . . . 79
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Learning Better Intrinsic Images through Physically-Based Rendering 82
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 CGINTRINSICS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Learning Cross-Dataset Intrinsics . . . . . . . . . . . . . . . . . . . . . 89
5.4.1 Supervised losses . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.2 Smoothness losses . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.3 Reconstruction loss . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.4 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.1 Evaluation on IIW . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5.2 Evaluation on SAW . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.3 Evaluation on MIT intrinsic images . . . . . . . . . . . . . . . 100
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Crowdsampling the Plenoptic Function 102
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.1 Collecting Crowdsampled Data . . . . . . . . . . . . . . . . . 107
6.3.2 The DeepMPI Scene Representation . . . . . . . . . . . . . . . 108
6.3.3 Stage 1: Optimizing DeepMPI Color and α Planes . . . . . . . 110
6.3.4 Stage 2: Learning How Appearance Changes with Time . . . . 112
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes123
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3.1 Neural scene flow fields for dynamic scenes . . . . . . . . . . . 128
7.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3.3 Integrating a static scene representation . . . . . . . . . . . . . 134
7.3.4 Space-time view synthesis . . . . . . . . . . . . . . . . . . . . 137
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4.1 Baselines and error metrics . . . . . . . . . . . . . . . . . . . . 139
7.4.2 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . 140
7.4.3 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . 141
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8 Ethics in Data-driven Computer Vision 145
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2 Privacy and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2.2 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3 Fairness and Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4 Interpretability and Transparency . . . . . . . . . . . . . . . . . . . . . 156
8.5 Other Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.5.1 Policy and Regulation. . . . . . . . . . . . . . . . . . . . . . . 160
8.5.2 Employment and HCI . . . . . . . . . . . . . . . . . . . . . . 161
8.5.3 Control and Surveillance . . . . . . . . . . . . . . . . . . . . . 162
8.5.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9 Conclusion 167
A Chapter 2 Appendix 169
A.1 Depth Map Refinement and Enhancement . . . . . . . . . . . . . . . . 169
A.1.1 Modified MVS algorithm . . . . . . . . . . . . . . . . . . . . . 169
A.1.2 Foreground and background classes . . . . . . . . . . . . . . . 170
A.1.3 Automatic ordinal depth labeling . . . . . . . . . . . . . . . . . 172
A.2 SfM Disagreement Rate (SDR) . . . . . . . . . . . . . . . . . . . . . . 173
B Chapter 3 Appendix 176
B.1 Derivations of depth from motion parallax . . . . . . . . . . . . . . . . 176
B.2 Derivation of error metrics . . . . . . . . . . . . . . . . . . . . . . . . 177
B.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
C Chapter 4 Appendix 181
C.1 Hyperparameters Setting . . . . . . . . . . . . . . . . . . . . . . . . . 181
C.2 All-Pairs Weighted Least Squares (APWLS) . . . . . . . . . . . . . . . 181
C.3 Additional details for SAW evaluation metrics . . . . . . . . . . . . . . 182
D Chapter 5 Appendix 184
D.1 Additional details for training losses . . . . . . . . . . . . . . . . . . . 184
D.1.1 Ordinal term for CGINTRINSICS . . . . . . . . . . . . . . . . 184
D.1.2 Additional hyperparameter settings . . . . . . . . . . . . . . . 185
E Chapter 6 Appendix 186
E.1 Priors on the Plenoptic Function . . . . . . . . . . . . . . . . . . . . . 186
E.1.1 Constant Visibility and Light Field Gradients . . . . . . . . . . 186
E.1.2 Common Light Sources, Material Properties, and Normals . . . 187
E.2 Scene Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
E.3 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
E.3.1 Losses optimizing DeepMPI color and α planes . . . . . . . . . 188
E.3.2 Training Losses . . . . . . . . . . . . . . . . . . . . . . . . . . 189
E.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
E.5 Training and Implementation . . . . . . . . . . . . . . . . . . . . . . . 192
E.6 Visual Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
E.7 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
F Chapter 7 Appendix 196
F.1 Scene Flow Regularization Details . . . . . . . . . . . . . . . . . . . . 196
F.2 Data Driven Prior Details . . . . . . . . . . . . . . . . . . . . . . . . . 197
F.3 Space-Time Interpolation Visualization . . . . . . . . . . . . . . . . . 199
F.4 Volume Rendering Equation Approximation . . . . . . . . . . . . . . . 200
F.5 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
F.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Bibliography 204
LIST OF TABLES
2.1 Results on the MD test set (places unseen during training) for sev-
eral network architectures. For VGG∗ we use the same loss and
network architecture as in [84] for comparison to [84]. Lower is better. 22
2.2 Results on MD test set (places unseen during training) for different
loss configurations. Lower is better. . . . . . . . . . . . . . . . . . . 22
2.3 Results on three different test sets with and without our depth re-
finement methods. Raw MD indicates raw depth data; Clean MD
indicates depth data using our refinement methods. Lower is better for
all error measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Results on Make3D for various training datasets and methods. The
first column indicates the training dataset. Errors for “Ours” are averaged
over four models trained/validated on MD. Lower is better for all metrics. 25
2.5 Results on the KITTI test set for various training datasets and ap-
proaches. Columns are as in Table 2.4. . . . . . . . . . . . . . . . . . 27
2.6 Results on the DIW test set for various training datasets and ap-
proaches. Columns are as in Table 2.4. . . . . . . . . . . . . . . . . . 27
3.1 Quantitative comparisons on the MC test set. Different input con-
figurations of our model: (I) single image; (II) optical flow masked
in the human region (F ), confidence and human mask; (III) masked
input depth, human mask; and (IV) additional confidence; in (V), we
also input human keypoints. The last row indicates the error for the
depth estimated from motion parallax between two frames in all image
regions (human and non-human); this serves as an oracle and can only
be measured if the entire scene is static. Lower is better for all metrics. 49
3.2 Results on the TUM RGBD dataset. Different si-RMSE metrics as
well as standard RMSE and relative error (Rel) are reported. We evaluate
our models (light gray background) under different input configurations,
as described in Table 3.1. Raw depth indicates the model is trained
using raw MVS depth predictions as supervision, without our depth
cleaning method. A dataset denoted as ‘-’ indicates that the method is
not learning-based. Lower is better for all error metrics. . . . . . . . . 50
4.1 Results on the IIW test set. Lower is better for the Weighted Human
Disagreement Rate (WHDR). The second column indicates the train-
ing data each learning-based method uses; “-” indicates the method
is optimization-based. ∗ indicates WHDR is evaluated based on CNN
classifer outputs for pairs of pixels rather than full decompositions. . . 75
4.2 Results on the SAW test set. Higher is better for AP%. The second
column is described in Table 4.1. Note that none of the methods use
annotations from SAW. . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Results on MIT intrinsics. For all error metrics, lower is better.
ST=Sintel dataset and SN=ShapeNet dataset. The second column
shows the dataset used for training. GT indicates whether the method
uses ground truth for training. . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Comparisons of existing intrinsic image datasets with our CGIN-
TRINSICS dataset. PB indicates physically-based rendering and non-PB
indicates non-physically-based rendering. . . . . . . . . . . . . . . . 88
5.2 Numerical results on the IIW test set. Lower is better for WHDR. The
“Training set” column specifies the training data used by each learning-
based method: “-” indicates an optimization-based method. IIW(O)
indicates original IIW annotations and IIW(A) indicates augmented IIW
comparisons. “All” indicates CGI+IIW(A)+SAW. † indicates network
was validated on CGI and others were validated on IIW. ∗ indicates
CNN predictions are post-processed with a guided filter [254]. . . . . 95
5.3 Quantitative results on the SAW test set. Higher is better for AP%.
The second column is described in Table 5.2. The third and fourth
columns show performance on the unweighted SAW benchmark and
our more challenging gradient-weighted benchmark, respectively. . . . 97
5.4 Quantitative Results on MIT intrinsics testset. For all error metrics,
lower is better. The second column shows the dataset used for training.
? indicates models fine-tuned on MIT. . . . . . . . . . . . . . . . . . 100
6.1 Quantitative comparisons on our test set. Lower is better for l1 and
LPIPS and higher is better for PSNR. l1 errors are scaled by 10 for ease
of presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1 Quantitative evaluation of novel view synthesis on the Dynamic
Scenes dataset. MV indicates whether the approach makes use multi-
view information or not. . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Quantitative evaluation of novel view and time synthesis. See
Sec. 7.4.2 for a description of the baselines. . . . . . . . . . . . . . . . 139
7.3 Ablation study on the Dynamic Scenes dataset. See Sec. 7.4.2 for
detailed descriptions of each of the ablations. . . . . . . . . . . . . . . 142
E.1 Scene statistics. We include (1) total number of images, 2) field of view
(FoV) of the reference DeepMPI, and (3) depth of near and far MPI
planes. The first five scenes are used for evaluation in Chapter6. . . . . 188
E.2 User study, Share of votes on Q1. . . . . . . . . . . . . . . . . . . . 194
E.3 User study, Share of votes on Q2. . . . . . . . . . . . . . . . . . . . 195
E.4 User study, Share of votes on Q3. . . . . . . . . . . . . . . . . . . . 195
LIST OF FIGURES
1.1 Learning geometry, appearance and motion in the wild. Massive
numbers of photos and videos are uploaded by people everyday (left).
My work in this thesis shows how to leverage such visual data to learn
scene geometry, appearance, lighting and motion, Based upon that, we
further demonstrate how to synthesize novel views in both space and
time. (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Inverse graphics as an ill-posed problem. From a single image (a),
a likely explanation of the underlying scene is shown in (b), but this
image could also be a painting (c), a sculpture (d), or an effect created
by a set of lights (e). Figure adapted from Adelson and Pentland [5] and
Jon Barron [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Comparison between MVS depth maps with and without our pro-
posed refinement/cleaning methods. The raw MVS depth maps (mid-
dle) exhibit depth bleeding (top) or incorrect depth on people (bottom).
Our methods (right) can correct or remove such outlier depths. . . . . . 12
2.2 Examples of automatic ordinal labeling. Blue mask: foreground
(Ford) derived from semantic segmentation. Red mask: background
(Bord) derived from reconstructed depth. . . . . . . . . . . . . . . . . 15
2.3 Effect of Lgrad term. Lgrad encourages predictions to match the ground
truth depth gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Effect of Lord term. Lord tends to corrects ordinal depth relations for
hard-to-construct objects such as the person in the first row and the tree
in the second row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Depth predictions on MD test set. (Blue=near, red=far.) For visu-
alization, we mask out the detected sky region. (a) Input photo. (b)
Ground truth COLMAP depth map (GT). (c) VGG∗ prediction using the
loss and network of [84]. (d) Depth prediction from a ResNet [191]. (e)
Depth prediction from an hourglass (HG) network [63] . . . . . . . . . 24
2.6 Depth predictions on Make3D. The last four columns show results
from the best models trained on non-Make3D datasets (final column is
our result). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Depth predictions on KITTI. (Blue=near, red=far.) None of the mod-
els were trained on KITTI data. From left to right: (a) input image, (b)
ground truth (GT), (c) model trained on DIW [63], (d) model trained on
Make3D [191], (e) Ours trained on MD. . . . . . . . . . . . . . . . . 26
2.8 Depth predictions on the DIW test set. (Blue=near, red=far.) Cap-
tions are described in Figure 2.7. None of the models were trained on
DIW data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Our model predicts dense depth when both an ordinary camera and
people in the scene are freely moving (right). We train our model on
our new MannequinChallenge dataset—a collection of Internet videos
of people imitating mannequins, i.e., freezing in diverse, natural poses,
while a camera tours the scene (left). Because people are stationary,
geometric constraints hold; this allows us to use multi-view stereo to
estimate depth which serves as supervision during training. In all figures,
we use inverse depth maps for visualization purposes, and refer to them
as depth maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Traditional stereo vs. our setup. Left: a person is observed at the same
time instant from two different views. The 3D position of points can
be computed using triangulation. Right: when both the camera and the
objects in the scene are moving, triangulation is no longer possible since
the epipolar constraint does not apply. . . . . . . . . . . . . . . . . . . 33
3.3 Sample images from Mannequin Challenge videos. Each image is a
frame from a video sequence in which the camera is moving but the
humans are all static. The videos span a variety of natural scenes, poses,
and configurations of people. . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Effect of depth cleaning. (a-b) Raw MVS depth maps, DMVS, may
contain errors and outliers, especially in untextured regions (see regions
circled in yellow). (c) Our depth cleaning method effectively filters out
such erroneous depth values. . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Sample frames from clips removed during filtering. (a) Videos cap-
tured with fisheye cameras; (b) videos with synthetic backgrounds; (c)
sequences with truly moving objects (pairs of frames shown in each
column). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 System overview. Our model takes as input the RGB frame, a human
segmentation mask, masked depth from motion parallax (via optical
flow and SfM pose), and associated confidence map. We ask the network
to use these inputs to predict depths that match the ground truth MVS
depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 System inputs and training data. The input to our network consists
of: (a) an RGB image, (b) a human mask, (c) a masked depth map
computed from motion parallax w.r.t. a selected source image, and (d)
a masked confidence map. Low confidence regions (dark circles) in
the first two rows indicate the vicinity of the camera epipole, where
depth from parallax is unreliable and removed. The network is trained
to regress to MVS depth (e). . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Examples of keypoint images. The top row shows examples of in-
put images and the bottom row shows corresponding detected human
keypoint images, where different colors indicating different joints. We
perform morphological dilation to the keypoint maps to make each
keypoint location more visible. . . . . . . . . . . . . . . . . . . . . . 46
3.9 Qualitative results on the MC test set. From top to bottom: reference
images and their corresponding MVS depth (pseudo ground truth); our
depth predictions using: our single view model (third row) and our two-
frame model (forth row). The additional network inputs give improved
performance in both human and non-human regions. . . . . . . . . . . 47
3.10 Qualitative comparisons on the TUM RGBD dataset. (a) Reference
images, (b) ground truth sensor depth, (c) results of the single-view
depth prediction method DORN [94], (d) result of the two-frame motion
stereo method DeMoN [354], (e-f) depth predictions from our single
view and two-frame models, respectively. . . . . . . . . . . . . . . . . 52
3.11 Comparisons on Internet video clips with moving cameras and
people. From left to right: (a) reference input image (b) results of
DORN [94], (c) results of Chen et al. [63], (d) results of DeMoN [354],
(e) results of our full method. . . . . . . . . . . . . . . . . . . . . . . 53
3.12 Depth-based visual effects. Using our predicted depth maps, we can
apply depth-aware visual effects on (a) input images; we show (b)
defocus, (c) object insertion, and (d) Anaglyph effects. . . . . . . . . 54
3.13 Depth-based image inpainting. We use depth prediction and camera
poses to warp the pixels in nearby frames for image inpainting and
people removal. Top row shows original images and bottom row shows
inpainted images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.14 Failure cases. From left to right: (a) input RGB image (b) depth
predicted from our single-view method (c) depth predicted from our
proposed full method. Our proposed full method can fail for reasons
including (1) a failure to generalize to complex human poses, (first three
rows), or due to non-human movers such as animals, cars, and shadows
(last three rows). In some of these cases, our single-view method can
outperform our full two-view method, because added complexities can
sometimes arise in the presence of multiple views. . . . . . . . . . . . 58
4.1 To train, our method learns from unlabeled videos with fixed viewpoint
but varying illumination (top). At test time (bottom), our network
produces an intrinsic image decomposition (R, S) from a single image I . 60
4.2 System overview and network. During training, our network input is
an image sequence I, and the outputs are reflectance images R and
shading images S for the sequence. Each block in the network depicts a
convolutional/deconvolutional layer. E is an encoder, and DR and DS
are decoders for the reflectance and shading images. For the innermost
feature maps, we have one side output c representing the illumination
color. E is an energy function measuring the cost of the decomposition. 63
4.3 Examples of challenging images in our dataset. The first two im-
ages depict colorful illumination. The last two images show strong
sunlight/shadows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Failure cases for intrinsic image estimation algorithms. We applied
a state-of-the-art multi-image intrinsic image decomposition estimation
algorithm [128] to our dataset. This method fails to produce decompo-
sition results suitable for training due to strong assumptions that hold
primarily for outdoor/laboratory scenes. . . . . . . . . . . . . . . . . 66
4.5 Effect of vmed in shading smoothness term. (white = large weight,
black = small weight.) Adding the extra vmed can help capture smooth-
ness in textured regions such as the pillows in the first row and floor in
the second row. The last column shows the final smoothness weight vpq. 72
4.6 Qualitative comparisons for intrinsic image decomposition on the
IIW/SAW test sets. Our network predictions achieve comparable re-
sults to state-of-art intrinsic image decomposition algorithms (Bell et
al. [26] and Zhou et al. [417]). . . . . . . . . . . . . . . . . . . . . . 76
4.7 Qualitative comparisons on the MIT intrinsic test set. Odd-number
rows show predicted reflectance; even-numbered rows show predicted
shading. (a) Input image, (b) Ground truth (GT), (c) SIRFS [19], (d)
Direct Intrinsics (DI) [252], (e) Shi et al. [313], (f) Our method. . . . . 80
5.1 Overview and network architecture. Our work integrates physically-
based rendered images from our CGINTRINSICS dataset and re-
flectance/shading annotations from IIW and SAW in order to train a
better intrinsic decomposition network. . . . . . . . . . . . . . . . . . 83
5.2 Visualization of ground truth from our CGINTRINSICS dataset. Top
row: rendered RGB images. Middle: ground truth reflectance. Bottom:
ground truth shading. Note that light sources are masked out when
creating the ground truth decomposition. . . . . . . . . . . . . . . . . 86
5.3 Visual comparisons between our CGI and the original SUNCG
dataset. Top row: images from SUNCG/PBRS. Bottom row: images
from our CGI dataset. The images in our dataset have higher SNR and
are more realistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Examples of predictions with and without IIW training data.
Adding real IIW data can qualitatively improve reflectance and shading
predictions. Note for instance how the quilt highlighted in first row has
a more uniform reflectance after incorporating IIW data, and similarly
for the floor highlighted in the second row. . . . . . . . . . . . . . . . 91
5.5 Examples of predictions with and without SAW training data.
Adding SAW training data can qualitatively improve reflectance and
shading predictions. Note the pictures/TV highlighted in the decompo-
sitions in the first row, and the improved assignment of texture to the
reflectance channel for the paintings and sofa in the second row. . . . . 93
5.6 Precision-Recall (PR) curve for shading images on the SAW test
set. Left: PR curves generated using the unweighted SAW error metric
of [205]. Right: curves generated using our more challenging gradient-
weighted metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Qualitative comparisons on the IIW/SAW test sets. Our predictions
show significant improvements compared to state-of-the-art algorithms
(Bell et al. [26] and Zhou et al. [417]). In particular, our predicted
shading channels include significantly less surface texture in several
challenging settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Qualitative comparisons on MIT intrinsics testset. Odd rows: re-
flectance predictions. Even rows: shading predictions. ? are the predic-
tions fine-tuned on MIT. . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1 Crowdsampled plenoptic slices. Given a large number of tourist pho-
tos taken at different times of day, our system learns to construct a
continuous set of lightfields and to synthesize novel views capturing
all-times-of-day scene appearance. . . . . . . . . . . . . . . . . . . . 103
6.2 Registered photo collections. Example SfM reconstructions of clusters
of Internet photos sharing similar viewpoints, labeled as red dots. . . . 108
6.3 Renderings of base color and alpha. From left to right: (a) origi-
nal photos at target viewpoint ck, (b) our estimated base color at ck,
(c) pseudo-depth computed from the RGBα MPI at ck using our two-
phase approach, (d) pseudo-depth from the baseline. For depth maps,
red=close and blue=far. . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4 Learning framework. Our method builds a reference DeepMPI Dr,
consisting of base color, alpha, and latent feature components organized
into planar layers. A rendering network G takes a DeepMPI projected
to a target viewpoint ck, and predicts corresponding RGB color layers.
The appearance of these layers is modulated by an appearance vector
zs produced by encoder E. The over operation O is applied to the
resulting RGBα MPI to render a view. We jointly train the encoder
E, rendering network G, and latent features F r in the DeepMPI by
comparing a rendered view with an original exemplar image Ik =
Is. During inference, given an exemplar photo Is, we can synthesize
novel views close to the reference viewpoint, while also preserving the
exemplar’s appearance. . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Comparisons of images reconstructed with different configurations
of our method. The images rendered from our full approach (e) are
more similar to the ground truth images (a) than other configurations.
In particular, the images rendered from the models without AdaIN (b)
or the DeepMPI (c) are less realistic, and the model that does not feed
the deep buffer Φrs to the encoder (d) fails to capture accurate scene
appearance, as indicated in the highlighted regions. . . . . . . . . . . 115
6.6 Appearance transfer comparison. From left to right: (a) exemplar
images used to extract appearance vectors, (b) predictions from MU-
NIT [143], (c) predictions from NRW [238], (d) predictions from our
method. Compared to the baselines, our rendered images are more
photo-realistic and are more faithful to the appearance of the exem-
plar images. Please zoom in to highlighted regions for better visual
comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.7 Appearance interpolation. The left- and rightmost exemplar images
indicate start and end appearance. Intermediate images are generated
by linearly interpolating latent vectors from the two images. Odd rows
show interpolation results from NRW [238], and even rows from our
method. Moving shadows are indicated in highlighted regions. . . . . . 119
6.8 4D Photos. We demonstrate an application of creating 4D photos by per-
forming spatial-temporal interpolation in which both camera viewpoint
and scene illumination change simultaneously. . . . . . . . . . . . . . 120
6.9 Limitations. Some failure cases include: (a) input photo collections
that do not span the full range of desired viewpoints, or (b) intrinsic
limitations of MPI leading to poor extrapolation to large camera motions.
In addition, as shown in (c) (exemplar image with strong shadow) and (d)
(resulting rendering), our method can fail to model strong cast shadows
produced by occluders outside the reference field of view. . . . . . . . 121
7.1 Scene flow fields warping. To render a frame at time i, we perform
volume tracing along ray ri with RGBσ at time i, giving us the pixel
color Ĉi(ri) (left). To warp the scene from time j to i, we offset each
step along ri using scene flow fi→j and volume trace with the associated
color and opacity (cj, σj) (right). . . . . . . . . . . . . . . . . . . . . 129
7.2 Scene flow disocclusion ambiguity. In this 2D orthographic example,
a single blue object translates to the right by one pixel from frame at time
i to frame at time j. Here, the correct scene flow at the point labeled
a, e.g., fi→j (a), points one unit to the right, however, for the scene
flow fi→j (c) (and similarly fj→i (a)), there can be multiple answers.
If fi→j (c) = 0, then the scene flow would incorrectly point to the
foreground in the next frame, and if fi→j (c) = 1, the scene flow would
point to the freespace location d in the next frame. . . . . . . . . . . . 131
7.3 Qualitative ablations. Results of our full method with different loss
components removed. The odd rows show zoom-in rendered color and
the even rows show corresponding pseudo depth. Each component
reduces the overall quality in different ways. . . . . . . . . . . . . . . 132
7.4 Dynamic and static components. Our method learns static and dy-
namic components in the combined representation. Note that person is
almost still in the second example. . . . . . . . . . . . . . . . . . . . . 134
7.5 Static scene representation ablation. Adding a static scene represen-
tation yields higher fidelity renderings, especially in static regions (a,c)
when compared to the pure dynamic model (b). . . . . . . . . . . . . . 136
7.6 Novel time synthesis. Rendering images by interpolating the time
index (top) yields blending artifacts compared to our scene flow based
rendering (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.7 Qualitative comparisons on the Dynamic Scenes dataset. Compared
with prior methods, our rendered images more closely match the ground
truth, and include fewer artifacts, as shown in the highlighted regions. . 138
7.8 Qualitative comparisons on monocular video clips. When compared
to baselines, our approach more correctly synthesizes hidden content in
disocclusions (shown in the last three rows), and locations with complex
scene structure such as the fence in the first row. . . . . . . . . . . . . 142
7.9 Limitations. Our method is unable to extrapolate content unseen in the
training views (a), and has difficulty recovering high frequency details
if a video involves extreme object motions (b,c). . . . . . . . . . . . . 143
8.1 Facial expression manipulation. Computer vision has enabled control
of facial expression from arbitrary videos by using image and depth
information. This has raised significant security concerns regarding
misuse in fake news or propaganda. Figure adapted from Thies et al.[349].147
8.2 Privacy-preserving image synthesis from 3D reconstruction. From
left to right: original photo, synthesized image from standard SfM
reconstruction [304] through a technique from Pittaluga et al. [278], syn-
thesized image from privacy-preserving SfM 3D reconstruction [105],
which excludes sensitive visual information such as humans. Figure
adapted from Geppert et al. [333]. . . . . . . . . . . . . . . . . . . . 150
8.3 Privacy-preserving 3D representation. Instead of the use of points as
a 3D representation, the use of randomized 3D lines to enable privacy-
persevering localization has been proposed. Figure adapted from Spe-
ciale et al. [333]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4 Unfairness in data-driven computer vision. State-of-the-art facial
recognition systems all reveal gender and ethnicity bias in the model’s
predictions. These algorithms perform much better on light-skinned
males than dark-skinned females. Figure adapted from Buolamwini et
al. [48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.5 Dataset bias in depth prediction. Qualitative comparison of state-of-
the-art depth prediction model [281], and our proposed depth prediction
models in this thesis which was trained on MegaDepth [207] and Man-
nquinChallenge [203], using images from the Microsoft COCO dataset
[210]. Figure adapted from Ranftl et al. [281]. . . . . . . . . . . . . . 155
8.6 Uncertainty modeling in depth prediction. Uncertainty modeling can
help in identifying confident regions during depth prediction, making
the model robust to noise and potential attacks. From left to right:
input image, ground-truth depth, depth prediction, estimated aleatoric
uncertainty, and estimated epistemic uncertainty. Figure adapted from
Kendall et al. [173]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.7 Uncertainty modeling in novel view synthesis. By modeling tran-
sient and sensitive objects in Internet photos as aleatoric uncertainty,
we obtain novel view synthesis results with better rendering quality
and privacy-preserving properties. From left to right: rendered static
component, rendered transient component, composite rendered image,
original photo, and estimated aleatoric uncertainty. Figure adapted from
Martin-Brualla et al. [230]. . . . . . . . . . . . . . . . . . . . . . . . 159
A.1 Additional example comparisons between MVS depth maps with
and without our proposed refinement/cleaning methods. Column
(b) (before filtering): the plinth of the statue in the first row and the
“Statue of Liberty” in the second row both show depth bleeding ef-
fect. Column (c) (after filtering): our refinement method corrects or
removes such depth values. . . . . . . . . . . . . . . . . . . . . . . . 171
A.2 Additional examples of automatic ordinal labeling. Blue mask: fore-
ground (Ford) derived from semantic segmentation. Red mask: back-
ground (Bord) derived from reconstructed depth. . . . . . . . . . . . . 173
A.3 Examples of sampled SfM points. Red circles indicate sampled SfM
points with the radius indicating estimated depth derived from SfM;
small radius = small (close) depth, large radius = large (far) depth. . . 175
B.1 Network Architecture. Each block with a different color (id) in (a)
indicates a convolutional layer. The block labeled H indicates a 3× 3
convolutional layer and all other blocks are implemented as a variant of
an Inception module [344], as shown in (b). Parameters for each type
of layer are shown in (c). We use bilinear interpolation to upsample
features in the network. Figures modified from Chen et al. [63]. . . . . 179
E.1 Visual examples of reference base color images. These are over-
composited from base color planes of the reference DeepMPI. . . . . . 191
E.2 Visual illustration of reference mean RGB PSV. Different images in
each row indicate different depth planes of the plane sweep volume
(PSV). The mean RGB images at different depth planes have different
in-focus regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
E.3 Visual examples of rectified RGB images. The reference rectified
images are geometrically stable and globally aligned up to disocculusion.194
F.1 Space-time view synthesis. We propose a 3D splatting-based approach
to perform space-time interpolation at specified target viewpoint (shown
as green camera) at an intermediate time i+ δi. Specifically, we sweep a
plane over every ray r emitted from the specified target viewpoint from
front to back. At each sampled step t along the ray, we query the color
and density information (c, α), and the scene flows at times i and i+ 1.
We then displace the 3D points along the ray by the scaled scene flow
δifi→i+1, (1− δi)fi→i−1 respectively (left). The 3D displaced points are
then splatted from time i and i + 1 onto a (c, α) accumulation buffer
at the target viewpoint, and the splats are blended with linear weights
1 − δi, δi (middle). The final rendered view is obtained by volume
rendering the accumulation buffer (right). . . . . . . . . . . . . . . . 199
F.2 Network architecture of static (time-invariant) scene representa-
tion. Modified from the original NeRF architecture diagram. We predict
an extra blending weight field v from intermediate features along with
opacity σ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
F.3 Network Architecture of dynamic (time-variant) scene representa-
tion. Modified from the original NeRF architecture diagram. We encode
and input time indices i into the MLP and predict time-dependent scene
flow fields Fi and disocculusion weight fieldsWi from the intermediate
features along with opacity σi. . . . . . . . . . . . . . . . . . . . . . 201
CHAPTER 1
INTRODUCTION
We live in a complex 4D world where we rely on our eyes to perceive color, light and
dynamics around us. For example, when we drive on a highway, we can determine how
far the car is front of us and how fast it is traveling, while also being able to distinguish
whether a dark region is due to shadow or to painted texture. Although these tasks are
simple for humans, it is very challenging for a machine to understand such physical
properties from images. Therefore, a natural question comes to a mind: how can we
build a machine capable of modeling a scene from cameras, and inferring the physical
properties of that scene in space and across time.
One way to formulate this problem is as physics-based computer vision, or inverse
graphics. As illustrated in Figure 1.1, inverse graphics can be treated as the inverse process
of a traditional graphics engine, with its goal of recovering scene geometry, materials,
illumination and motion from one or more images [280]. Inverse graphics techniques
are useful in practice, and play important role in numerous real-world applications.
For instance, if we can accurately recover the space-time structure of a scene, then we
can enable immersive virtual reality (VR) and augmented reality (AR) experience. In
fact, Google’s ARCore, Microsoft’s Hololens, and Facebook’s Oculus rely on accurate
estimates of geometry, appearance and illumination from their vision algorithms to render
graphics from the correct viewpoint with plausible lighting. Inverse graphics also enables
many image or video editing functionalities such as object insertion or 3D Photography,
which have been adopted by many products such as Adobe PhotoShop and After Effects.
In the past, inverse graphics problems have often been tackled by relying on multiple
synchronized cameras or extra sensors [255, 219, 303, 46]. But such setup prevents them
from being scaled to the images and videos we see and take in our everyday life. On
…...
Figure 1.1: Learning geometry, appearance and motion in the wild. Massive numbers
of photos and videos are uploaded by people everyday (left). My work in this thesis
shows how to leverage such visual data to learn scene geometry, appearance, lighting and
motion, Based upon that, we further demonstrate how to synthesize novel views in both
space and time. (right).
the other hand, solving inverse graphics from images and videos in the wild is usually
a highly ill-posed problem. For example, in the case of single-image inference shown
in Figure 1.2, there can be many different physical explanations for a single image
observation. However, many of these explanations are very unlikely in the real-world.
Hence, deep learning is a promising approach for this task, since it can potentially learn
useful priors over likely scenes by training on large amounts of data. This raise another
important question: how do we get and use data for learning geometry, appearance, and
motions in the wild? Unlike with object recognition datasets such as ImageNet [77]
and COCO [210], it is much more difficult to collect accurate human-labelled data from
crowd-sourcing for inverse graphics problems, since it is hard for humans to accurately
label quantities such as depth, object motions and illumination at scale.
My work instead leverages a compelling source of data for solving this problem:
massive numbers of unlabelled photos and videos people upload to the Internet every
day (Figure 1.1, left). On one hand, the physical properties of our world are implicitly
encoded in these photos and videos. On the other hand, Internet data is extremely
massive, unstructured, and uncalibrated, meaning they were taken by unknown cameras,
2
(a) Observed image (b) likely explanation (c) painter’s explanation (d) sculptor’s explanation (e) gaffer’s explanation
Figure 1.2: Inverse graphics as an ill-posed problem. From a single image (a), a likely
explanation of the underlying scene is shown in (b), but this image could also be a
painting (c), a sculpture (d), or an effect created by a set of lights (e). Figure adapted
from Adelson and Pentland [5] and Jon Barron [20].
at unknown locations, from unknown viewpoints, and at unknown times. By leveraging
classical physically based tools from vision as well as graphics, and combining them
with the power of machine learning, my work has made key advances on this problem.
In particular, in this thesis, I address three main challenges in capturing scene geometry,
appearance and dynamics from images and videos in the wild.
Learning depth in the wild. Estimating the 3D geometry of scenes in the wild is one
of the most important tasks for visual reasoning. Although there has been incredible
progress in using multi-view geometry for 3D reconstruction of objects, buildings, and
even whole cities, there are key limitations of classical multi-view based methods that
are the basis for exciting research challenges.
One key limitation of classical 3D methods is that they require multiple images as
input. In contrast, depth estimation from single images is a long-standing, ill-posed
problem. While single-view depth perception is a hot topic in computer vision, datasets
for this problem are very limited, and typically focus on single domains, such as indoor
or street scenes. Machine learning models trained on such data do not generalize to other
kinds of photos, such as those in one’s personal photo collections. Therefore, in Chapter
2, I explore the use of multi-view Internet photo collections, a virtually unlimited data
source, to generate large number of image/depth pairs as 3D supervision for training a
better singe-view depth prediction model.
3
Another key assumption of classic methods is that scenes are static. However, in
practice, moving objects—especially people—are very common in many scenarios, such
as Augmented Reality (AR) and video processing. Thus, in Chapter 3, I present a
new approach tackling the challenging problem of dense depth prediction for dynamic
scenes with moving people. We take a data-driven approach and learn to predict the
depth of moving people from a surprising data source: thousands of YouTube videos of
people imitating mannequins. Because the people featured in such videos are stationary,
geometric constraints hold, thus training data can be generated using classical multi-view
reconstruction methods. Further, I propose a novel approach that uses motion parallax
cues available in the videos to improve depth prediction. I demonstrate that our technique
can be used to create a variety of visual effects such as video defocus, image in-painting
and object insertion for AR applications.1
Learning material and illumination in the wild. Recovering Material and illumina-
tion of a scene, often in the form of intrinsic image decomposition (IID), is another key
topic in inverse graphics, involving factorizing an input image into a product of two other
images with a physical interpretation: a reflectance image, R, containing the material
color, and a shading image, S, that modulates the scene appearance via illumination.
While the community has seen significant progress in IID, it remains an extremely
challenging problem due to its ill-posedness. Hence, as with depth prediction, the use of
learning is an appealing proposition. Unfortunately, unlike with depth prediction where
we can use depth sensors or multi-view geometry based approaches, there is no device or
reliable multi-view algorithms for collecting accurate ground truth for intrinsic images.
Thus, in Chapter 4, my work explore a weaker, but readily available, source of training
1https://www.youtube.com/watch?v=fj_fK74y5_0
4
data for IID, Internet time-lapse videos, and leverages consistency between each frame
of a given sequence as a supervision signal. Based on this idea, I introduce BigTime,
a dataset of unlabeled time-lapse image sequences collected from the Internet. While
the sequences in BigTime do not provide ground truth intrinsic images, I can instead
exploit rich indirect consistency cues during training by specifying that the model should
predict decompositions where the scene appearance itself is consistent across a sequence,
and any variation is due to changes in illumination. This work is the first of its kind to
show the power of unsupervised learning using Internet time-lapse video for material
and illumination estimation in the wild [206].
In Chapter 5, I further explore the use of synthetic data generated by physics-based
rendering engines for learning a better intrinsic images model. I introduce CGIntrinsics, a
large-scale synthetic dataset with full ground truth decompositions. Surprisingly, we find
that a decomposition network trained solely on our data outperforms the state-of-the-art,
demonstrating the surprising effectiveness of synthetic data for the intrinsic images task.
Reconstructing the Plenoptic Function in the wild. The intrinsic physical properties
estimated by inverse graphics techniques can further enable image-based rendering.
Image-based rendering refers to the problem of synthesizing images at novel viewpoints
from a set of 2D images. Most previous work focuses on rendering novel views in
3D, which are often described as light fields [199]. These methods usually require
synchronized camera arrays, or the assumption that the captured scene is completely
static with time-invariant appearance and geometry. However, if we want to capture the
entire world and the render it in Mixture Reality, it would be impossible to do so by
putting such camera arrays in every corner of the world. More importantly, in reality, we
know our world is dynamic: people, cars, animals can freely move in a scene; photos
and videos can also be taken at different seasons, or at different times of day. So another
5
question arises: how can we synthesize photo-realistic novel views in both space and
time?
In the literature, such space-time view synthesis can be characterized with a function
called the Plenoptic function [4]. The Plenoptic function is a hypothetical function
representing everything we can ever see, and describing the light reaching an observer
at any point in space and time. In other words, we can define this function as the light
passing through the camera center at every 3D location, at every possible viewing angle,
and at any given time. In Chapter 6 and 7, I demonstrate how to use unlabelled Internet
visual data to reconstruct the Plenoptic function in the wild.
First, to enable modeling of scene appearance and illumination changes over time,
in Chapter 6, I present a new approach to novel view synthesis under time-varying
illumination from Internet photo collections without temporal registration. Our approach
builds on the recent multi-plane image (MPI) format for representing local light fields
under fixed viewing conditions. I introduce a DeepMPI representation, motivated by
observations on the sparsity structure of the plenoptic function, that allows for real-
time synthesis of photorealistic views that are continuous in both space and across
changes in lighting. My method can synthesize the same compelling parallax and view-
dependent effects as previous methods, while simultaneously interpolating along changes
in reflectance and illumination with time.
Another important temporal factor that leads to appearance change over time is scene
dynamics. Thus, in Chapter 7, I present a new method for novel view and time synthesis
of complex dynamic scenes, requiring only a monocular video with known camera poses
as input. To do this, I introduce Neural Scene Flow Fields, a new representation that
models the dynamic scene as a time-variant continuous function of appearance, geometry,
and 3D scene motion. My representation is optimized through a neural network to fit the
6
observed input views. I show that our representation can be used for challenging dynamic
scenes, including thin structures, view-dependent effects, and natural degrees of motion.
In Chapter 8, I discuss important ethical concerns in computer vision and deep
learning and show concrete case studies related to our topics in this thesis. I then
conclude in Chapter 9, discussing future research direction.
7
CHAPTER 2
LEARNING SINGLE VIEW DEPTH PREDICTION FROM INTERNET
PHOTOS
2.1 Introduction
Predicting 3D shape from a single image is an important capability of visual reasoning,
with applications in robotics, graphics, and other vision tasks such as intrinsic images.
While single-view depth estimation is a challenging, underconstrained problem, deep
learning methods have recently driven significant progress. Such methods thrive when
trained with large amounts of data. Unfortunately, fully general training data in the form
of (RGB image, depth map) pairs is difficult to collect. Commodity RGB-D sensors such
as Kinect have been widely used for this purpose [322], but are limited to indoor use.
Laser scanners have enabled important datasets such as Make3D [302] and KITTI [237],
but such devices are cumbersome to operate (in the case of industrial scanners), or
produce sparse depth maps (in the case of LIDAR). Moreover, both Make3D and KITTI
are collected in specific scenarios (a university campus, and atop a car, respectively).
Training data can also be generated through crowdsourcing, but this approach has so far
been limited to gathering sparse ordinal relationships or surface normals [108, 63, 64].
In this chapter, we explore the use of a nearly unlimited source of data for this problem:
images from the Internet from overlapping viewpoints, from which structure-from-motion
(SfM) and multi-view stereo (MVS) methods can automatically produce dense depth.
Such images have been widely used in research on large-scale 3D reconstruction [328,
111, 8, 92]. We propose to use the outputs of these systems as the inputs to machine
learning methods for single-view depth prediction. By using large amounts of diverse
training data from photos taken around the world, we seek to learn to predict depth with
high accuracy and generalizability. Based on this idea, we introduce MegaDepth (MD), a
large-scale depth dataset generated from Internet photo collections.
To our knowledge, ours is the first use of Internet SfM+MVS data for single-view
depth prediction. Our main contribution is the MD dataset itself. In addition, in creating
MD, we found that care must be taken in preparing a dataset from noisy MVS data, and
so we also propose new methods for processing raw MVS output, and a corresponding
new loss function for training models with this data. Notably, because MVS tends to
not reconstruct dynamic objects (people, cars, etc), we augment our dataset with ordinal
depth relationships automatically derived from semantic segmentation, and train with a
joint loss that includes an ordinal term. In our experiments, we show that by training on
MD, we can learn a model that works well not only on images of new scenes, but that
also generalizes well to completely different datasets, including Make3D, KITTI, and
DIW—achieving better generalization than prior datasets.
2.2 Related work
Single-view depth prediction. A variety of methods have been proposed for single-view
depth prediction, most recently by utilizing machine learning [139, 301]. A standard
approach is to collect RGB images with ground truth depth, and then train a model
(e.g., a CNN) to predict depth from RGB [85, 213, 214, 298, 13, 191]. Most such
methods are trained on a few standard datasets, such as NYU [321, 322], Make3D
[302], and KITTI [104], which are captured using RGB-D sensors (such as Kinect) or
laser scanning. Such scanning methods have important limitations, as discussed in the
introduction. Recently, Novotny et al. [263] trained a network on 3D models derived
from SfM+MVS on videos to learn 3D shapes of single objects. However, their method
9
is limited to images of objects, rather than scenes.
Multiple views of a scene can also be used as an implicit source of training data for
single-view depth prediction, by utilizing view synthesis as a supervisory signal [385,
101, 110, 415]. However, view synthesis is only a proxy for depth, and may not always
yield high-quality learned depth. Ummenhofer et al. [355] trained from overlapping
image pairs taken with a single camera, and learned to predict image matches, camera
poses, and depth. However, it requires two input images at test time.
Ordinal depth prediction. Another way to collect depth data for training is to ask people
to manually annotate depth in images. While labeling absolute depth is challenging,
people are good at specifying relative (ordinal) depth relationships (e.g., closer-than,
further-than) [108]. Zoran et al.[428] used such relative depth judgments to predict
ordinal relationships between points using CNNs. Chen et al.leveraged crowdsourcing
of ordinal depth labels to create a large dataset called “Depth in the Wild” [63]. While
useful for predicting depth ordering (and so we incorporate ordinal data automatically
generated from our imagery), the Euclidean accuracy of depth learned solely from ordinal
data is limited.
Depth estimation from Internet photos. Estimating geometry from Internet photo
collections has been an active research area for a decade, with advances in both structure
from motion [328, 8, 377, 304] and multi-view stereo [111, 95, 306]. These techniques
generally operate on 10s to 1000s of images. Using such methods, past work has used
retrieval and SfM to build a 3D model seeded from a single image [305], or registered a
photo to an existing 3D model to transfer depth [402]. However, this work requires either
having a detailed 3D model of each location in advance, or building one at run-time.
Instead, we use SfM+MVS to train a network that generalizes to novel locations and
scenarios.
10
2.3 The MegaDepth Dataset
In this section, we describe how we construct our dataset. We first download Internet
photos from Flickr for a set of well-photographed landmarks from the Landmarks10K
dataset [201]. We then reconstruct each landmark in 3D using state-of-the-art SfM
and MVS methods. This yields an SfM model as well as a dense depth map for each
reconstructed image. However, these depth maps have significant noise and outliers, and
training a deep network on this raw depth data will not yield a useful predictor. Therefore,
we propose a series of processing steps that prepare these depth maps for use in learning,
and additionally use semantic segmentation to automatically generate ordinal depth data.
2.3.1 Photo calibration and reconstruction
We build a 3D model from each photo collection using COLMAP, a state-of-art SfM
system [304] (for reconstructing camera poses and sparse point clouds) and MVS sys-
tem [306] (for generating dense depth maps). We use COLMAP because we found that it
produces high-quality 3D models via its careful incremental SfM procedure, but other
such systems could be used. COLMAP produces a depth map D for every reconstructed
photo I (where some pixels of D can be empty if COLMAP was unable to recover a
depth), as well as other outputs, such as camera parameters and sparse SfM points plus
camera visibility.
11
(a) Input photo (b) Raw depth (c) Refined depth
Figure 2.1: Comparison between MVS depth maps with and without our proposed
refinement/cleaning methods. The raw MVS depth maps (middle) exhibit depth bleed-
ing (top) or incorrect depth on people (bottom). Our methods (right) can correct or
remove such outlier depths.
2.3.2 Depth map refinement
The raw depth maps from COLMAP contain many outliers from a range of sources,
including: (1) transient objects (people, cars, etc.) that appear in a single image but
nonetheless are assigned (incorrect) depths, (2) noisy depth discontinuities, and (3)
bleeding of background depths into foreground objects. Other MVS methods exhibit
similar problems due to inherent ambiguities in stereo matching. Figure 2.1(b) shows
two example depth maps produced by COLMAP that illustrate these issues.
Such outliers have a highly negative effect on the depth prediction networks we
seek to train. To address this problem, we propose two new depth refinement methods
designed to generate high-quality training data:
First, we devise a modified MVS algorithm based on COLMAP, but more conservative
in its depth estimates, based on the idea that we would prefer less training data over bad
12
training data. COLMAP computes depth maps iteratively, at each stage trying to ensure
geometric consistency between nearby depth maps. One adverse effect of this strategy is
that background depths can tend to “eat away” at foreground objects, because one way
to increase consistency between depth maps is to consistently predict the background
depth (see Figure 2.1 (top)). To counter this effect, at each depth inference iteration in
COLMAP, we compare the depth values at each pixel before and after the update and
keep the smaller (closer) of the two. We then apply a median filter to remove unstable
depth values. We describe our modified MVS algorithm in detail in the Appendix.
Second, we utilize semantic segmentation to enhance and filter the depth maps, and
to yield large amounts of ordinal depth comparisons as additional training data. The
second row of Figure 2.1 shows an example depth map computed with our object-aware
filtering. We now describe our use of semantic segmentation in detail.
2.3.3 Depth enhancement via semantic segmentation
Multi-view stereo methods can have problems with a number of object types, including
transient objects such as people and cars, difficult-to-reconstruct objects such as poles
and traffic signals, and sky regions. However, if we can understand the semantic layout
of an image, then we can attempt to mitigate these issues, or at least identify problematic
pixels. We have found that deep learning methods for semantic segmentation are starting
to become reliable enough for this use [407].
We propose three new uses of semantic segmentation in the creation of our dataset.
First, we use such segmentations to remove spurious MVS depths in foreground regions.
Second, we use the segmentation as a criterion to categorize each photo as providing
either Euclidean depth or ordinal depth data. Finally, we combine semantic information
13
and MVS depth to automatically annotate ordinal depth relationships, which can be used
to help training in regions that cannot be reconstructed by MVS.
Semantic filtering. To process a given photo I , we first run semantic segmentation using
PSPNet [407], a recent segmentation method, trained on the MIT Scene Parsing dataset
(consisting of 150 semantic categories) [412]. We then divide the pixels into three subsets
by predicted semantic category:
1. Foreground objects, denoted F , corresponding to objects that often appear in the
foreground of scenes, including static foreground objects (e.g., statues, fountains)
and dynamic objects (e.g., people, cars).
2. Background objects, denotedB, including buildings, towers, mountains, etc. (See
Appendix for full details of the foreground/background classes.)
3. Sky, denoted S, which is treated as a special case in the depth filtering described
below.
We use this semantic categorization of pixels in several ways. As illustrated in Figure 2.1
(bottom), transient objects such as people can result in spurious depths. To remove these
from each image I , we consider each connected component C of the foreground mask F .
If< 50% of pixels in C have a reconstructed depth, we discard all depths from C. We use
a threshold of 50%, rather than simply removing all foreground depths, because pixels on
certain objects in F (such as sculptures) can indeed be accurately reconstructed (and we
found that PSPNet can sometimes mistake sculptures and people for one another). This
simple filtering of foreground depths yields large improvements in depth map quality.
Additionally, we remove reconstructed depths that fall inside the sky region S, as such
depths tend to be spurious.
Euclidean vs. ordinal depth. For each 3D model we have thousands of reconstructed
14
Figure 2.2: Examples of automatic ordinal labeling. Blue mask: foreground (Ford)
derived from semantic segmentation. Red mask: background (Bord) derived from recon-
structed depth.
Internet photos, and ideally we would use as much of this depth data as possible for
training. However, some depth maps are more reliable than others, due to factors such
as the accuracy of the estimated camera pose or the presence of large occluders. Hence,
we found that it is beneficial to limit training to a subset of highly reliable depth maps.
We devise a simple but effective way to compute a subset of high-quality depth maps, by
thresholding by the fraction of reconstructed pixels. In particular, if ≥ 30% of an image
I (ignoring the sky region S) consists of valid depth values, then we keep that image as
training data for learning Euclidean depth. This criterion prefers images without large
transient foreground objects (e.g., “no selfies”). At the same time, such foreground-heavy
images are extremely useful for another purpose: automatically generating training data
for learning ordinal depth relationships.
Automatic ordinal depth labeling. As noted above, transient or difficult to reconstruct
objects, such as people, cars, and street signs are often missing from MVS reconstructions.
Therefore, using Internet-derived data alone, we will lack ground truth depth for such
objects, and will likely do a poor job of learning to reconstruct them. To address this
issue, we propose a novel method of automatically extracting ordinal depth labels from
our training images based on their estimated 3D geometry and semantic segmentation.
Let us denote as O (“Ordinal”) the subset of photos that do not satisfy the “no selfies”
criterion described above. For each image I ∈ O, we compute two regions, Ford ∈ F
15
(based on semantic information) and Bord ∈ B (based on 3D geometry information),
such that all pixels in Ford are likely closer to the camera than all pixels in Bord. Briefly,
Ford consists of large connected components of F , and Bord consists of large components
of B that also contain valid depths in the last quartile of the full depth range for I (see
Appendix for full details). We found this simple approach works very well (> 95%
accuracy in pairwise ordinal relationships), likely because natural photos tend to be
composed in certain common ways. Several examples of our automatic ordinal depth
labels are shown in Figure 2.2.
2.3.4 Creating a dataset
We use the approach above to densely reconstruct 200 3D models from landmarks around
the world, representing about 150K reconstructed images. After our proposed filtering,
we are left with 130K valid images. Of these 130K photos, around 100K images are
used for Euclidean depth data, and the remaining 30K images are used to derive ordinal
depth data. We also include images from [180] in our training set. Together, this data
comprises the MegaDepth (MD) dataset, available at http://www.cs.cornell.
edu/projects/megadepth/.
2.4 Depth estimation network
This section presents our end-to-end deep learning algorithm for predicting depth from a
single photo.
16
2.4.1 Network architecture
We evaluated three networks used in prior work on single-view depth prediction:
VGG [84], the “hourglass” network [63], and a ResNet architecture [191]. Of these, the
hourglass network performed best, as described in Section 2.5.
2.4.2 Loss function
The 3D data produced by SfM+MVS is only up to an unknown scale factor, so we cannot
compare predicted and ground truth depths directly. However, as noted by Eigen and
Fergus [85], the ratios of pairs of depths are preserved under scaling (or, in the log-depth
domain, the difference between pairs of log-depths). Therefore, we solve for a depth map
in the log domain and train using a scale-invariant loss function, Lsi. Lsi combines three
terms:
Lsi = Ldata + αLgrad + βLord. (2.1)
Scale-invariant data term. We adopt the loss of Eigen and Fergus [85], which computes
the mean square error (MSE) of the difference between all pairs of log-depths in linear
time. Suppose we have a predicted log-depth map L, and a ground truth log depth map L∗.
Li and L∗i denote corresponding individual log-depth values indexed by pixel position i.
We denote Ri = Li − L∗i and define:∑ (n1 ∑ )n 2L = (R )2 − 1data i Ri (2.2)
n n2
i=1 i=1
where n is the number of valid depths in the ground truth depth map.
Multi-scale scale-invariant gradient matching term. To encourage smoother gradient
changes and sharper depth discontinuities in the predicted depth map, we introduce a
17
Input photo Output w/o Lgrad Output w/ Lgrad
Figure 2.3: Effect of Lgrad term. Lgrad encourages predictions to match the ground truth
depth gradient.
multi-scale scale-invariant gradient matching term Lgrad, defined as an `1 penalty on
differences in log-depth gradients b∑etw∑een(t∣he predicted and g1 ∣ ∣∣ ∣∣ ∣
r
∣)
ound truth depth map:
Lgrad = ∇ k kxRi + ∇yRn i
(2.3)
k i
where Rki is the value of the log-depth difference map at position i and scale k. Because
the loss is computed at multiple scales, Lgrad captures depth gradients across large image
distances. In our experiments, we use four scales. We illustrate the effect of Lgrad in
Figure 2.3.
Ordinal depth loss. Inspired by Chen et al. [63], our ordinal depth loss term Lord
utilizes the automatic ordinal relations described in Section 2.3.3. During training, for
each image in our ordinal set O, we pick a single pair of pixels (i, j), with pixel i and
j either belonging to the foreground region Ford or the background region Bord. Lord is
designed to be robust to the small number of incorrectly ordered pairs.

log ((1 + exp ((P√ij)) )) if Pij ≤ τLord =  (2.4)log 1 + exp Pij + c if Pij > τ
18
Input photo Output w/o Lord Output w/ Lord
Figure 2.4: Effect of Lord term. Lord tends to corrects ordinal depth relations for hard-
to-construct objects such as the person in the first row and the tree in the second row.
where P = −r∗ij ij (Li − Lj) and r∗ij is the automatically labeled ordinal depth relation
between i and j (r∗ij = 1 if pixel i is further than j and −1 otherwise). c is a constant
set so that Lord is continuous. Lord encourages the depth difference of a pair of points to
be large (and ordered) if our automatic labeling method judged the pair to have a likely
depth ordering. We illustrate the effect of Lord in Figure 2.4. In our tests, we set τ = 0.25
based on cross-validation.
2.5 Evaluation
In this section, we evaluate our networks on a number of datasets, and compare to several
state-of-art depth prediction algorithms, trained on a variety of training data. In our
evaluation, we seek to answer several questions, including:
• How well does our model trained on MD generalize to new Internet photos from
never-before-seen locations?
19
• How important is our depth map processing? What is the effect of the terms in our
loss function?
• How well does our model trained on MD generalize to other types of images from
other datasets?
The third question is perhaps the most interesting, because the promise of training on large
amounts of diverse data is good generalization. Therefore, we run a set of experiments
training on one dataset and testing on another, and show that our MD dataset gives the
best generalization performance.
We also show that our depth refinement strategies are essential for achieving good
generalization, and show that our proposed loss function—combining scale-invariant data
terms with an ordinal depth loss—improves prediction performance both quantitatively
and qualitatively.
Experimental setup. Out of the 200 reconstructed models in our MD dataset, we
randomly select 46 to form a test set (locations not seen during training). For the
remaining 154 models, we randomly split images from each model into training and
validation sets with a ratio of 96% and 4% respectively. We set α = 0.5 and β = 0.1
using MD validation set. We implement our networks in PyTorch [272], and train using
Adam [177] for 20 epochs with batch size 32.
For fair comparison, we train and validate our network using MD data for all experi-
ments. Due to variance in performance of cross-dataset testing, we train four models on
MD and compute the average error. performance of each individual model).
20
2.5.1 Evaluation and ablation study on MD test set
In this subsection, we describe experiments where we train on our MD training set and
test on the MD test set.
Error metrics. For numerical evaluation, we use two scale-invariant error measures (as
with our loss function, we use scale-invariant measures due to the scale-free nature of
SfM models). The first measure is the scale-invariant RMSE (si-RMSE) (Equation 2.2),
which measures precise numerical depth accuracy. The second measure is based on the
preservation of depth ordering. In particular, we use a measure similar to [428, 63] that
we call the SfM Disagreement Rate (SDR). SDR is based on the rate of disagreement
with ordinal depth relationships derived from estimated SfM points. We use sparse SfM
points rather than dense MVS because we found that sparse SfM points capture some
structures not reconstructed by MVS (e.g., complex objects such as lampposts). We
define SDR(D,D∗), the ordinal disagreement rate between the predicted (non-log) depth
map D = exp(L) and ground-truth SfM depths D∗, as:
∑
SDR(D,D∗
1 ( )
) = 1 ord(Di, D
∗ ∗
j) 6= ord(D
n i
, Dj ) (2.5)
i,j∈P
where P is the set of pairs of pixels with available SfM depths to compare, n is the total
number of pairwise comparisons, and ord(·, ·) is one of three depth relations (further-than,
closer-than, and same-depth-as): 
1 if Di > 1 + δDj
ord(Di, Dj) = −1 if
Di < 1− δ (2.6)
 D j0 if 1− δ ≤ Di ≤ 1 + δ
Dj
We also define SDR= and SDR 6= as the disagreement rate with ord(D∗i , D
∗
j ) = 0
and ord(D∗i , D
∗
j ) =6 0 respectively. In our experiments, we set δ = 0.1 for tolerance to
21
Network si-RMSE SDR=% SDR6=% SDR%
VGG∗ [84] 0.116 31.28 28.63 29.78
VGG (full) 0.114 29.34 26.91 27.53
ResNet (full) 0.112 26.25 24.23 25.14
HG (full) 0.103 28.00 23.74 25.59
Table 2.1: Results on the MD test set (places unseen during training) for several
network architectures. For VGG∗ we use the same loss and network architecture as
in [84] for comparison to [84]. Lower is better.
Method si-RMSE SDR=% SDR=6 % SDR%
Ldata only 0.146 32.32 29.96 30.08
+Lgrad 0.111 25.17 27.32 26.11
+Lgrad +Lord 0.103 28.00 23.74 25.59
Table 2.2: Results on MD test set (places unseen during training) for different loss
configurations. Lower is better.
uncertainty in SfM points. For efficiency, we sample SfM points from the full set to
compute this error term.
Effect of network and loss variants. We evaluate three popular network architectures
for depth prediction on our MD test set: the VGG network used by Eigen et al. [84],
an “hourglass”(HG) network [63], and ResNets [191]. To compare our loss function to
that of Eigen et al. [84], we also test the same network and loss function as [84] trained
on MD. [84] uses a VGG network with a scale-invariant loss plus single scale gradient
matching term. Quantitative results are shown in Table 2.1 and qualitative comparisons
are shown in Figure 2.5. We also evaluate variants of our method trained using only some
of our loss terms: (1) a version with only the scale-invariant data term Ldata (the same
loss as in [85]), (2) a version that adds our multi-scale gradient matching loss Lgrad, and
(3) the full version including Lgrad and the ordinal depth loss Lord. Results are shown in
Table 2.2.
As shown in Tables 2.1 and 2.2, the HG architecture achieves the best performance of
22
Test set Error measure Raw MD Clean MD
Make3D RMS 11.41 5.322
Abs Rel 0.614 0.364
log10 0.386 0.152
KITTI RMS 12.15 6.621
RMS(log) 0.582 0.369
Abs Rel 0.433 0.307
Sq Rel 3.927 2.546
DIW WHDR% 31.32 24.55
Table 2.3: Results on three different test sets with and without our depth refinement
methods. Raw MD indicates raw depth data; Clean MD indicates depth data using our
refinement methods. Lower is better for all error measures.
the three architectures, and training with our full loss yields better performance compared
to other loss variants, including that of [84] (first row of Table 2.1). One thing to notice
that is adding Lord could significantly improve SDR=6 while increasing SDR=. Figure 2.5
shows that our joint loss helps preserve the structure of the depth map and capture nearby
objects such as people and buses.
Finally, we experiment with training our network on MD with and without our
proposed depth refinement methods, testing on three datasets: KITTI, Make3D, and DIW.
The results, shown in Table 2.3, show that networks trained on raw MVS depth do not
generalize well. Our proposed refinements significantly boost prediction performance.
2.5.2 Generalization to other datasets
A powerful application of our 3D-reconstruction-derived training data is to generalize to
outdoor images beyond landmark photos. To evaluate this capability, we train our model
on MD and test on three standard benchmarks: Make3D [301], KITTI [104], and DIW
[63]—without seeing training data from these datasets. Since our depth prediction is
23
(a) Image (b) GT (c) VGG∗ (d) ResNet (e) HG
Figure 2.5: Depth predictions on MD test set. (Blue=near, red=far.) For visualization,
we mask out the detected sky region. (a) Input photo. (b) Ground truth COLMAP depth
map (GT). (c) VGG∗ prediction using the loss and network of [84]. (d) Depth prediction
from a ResNet [191]. (e) Depth prediction from an hourglass (HG) network [63] .
defined up to a scale factor, for each dataset, we align each prediction with the ground
truth by a scaling factor based on ratio between ground truth and predicted depth.
Make3D. To test on Make3D, we follow the protocol of prior work [214, 191],resizing all
images to 345× 460, and removing ground truth depths larger than 70m (since Make3D
data is unreliable at large distances). We train our network only on MD using our full loss.
Table 2.4 shows numerical results, including comparisons to several methods trained
on both Make3D and non-Make3D data, and Figure 2.6 visualizes depth predictions
from our model and several other non-Make3D-trained models. Our network trained
on MD has the best performance among all non-Make3D-trained models. Our model
even outperforms several models trained directly on Make3D. Finally, the last row of
24
Training set Method RMS Abs Rel log10
Make3D Karsch et al. [170] 9.2 0.355 0.127
Liu et al. [216] 9.49 0.335 0.137
Liu et al. [213] 8.6 0.314 0.119
Li et al. [200] 7.19 0.278 0.092
Laina et al. [191] 4.45 0.176 0.072
Xu et al. [386] 4.38 0.184 0.065
NYU Eigen et al. [84] 6.96 0.427 0.180
Liu et al. [213] 7.96 0.438 0.186
Laina et al. [191] 7.99 0.466 0.195
KITTI Zhou et al. [415] 10.47 0.383 0.478
Godard et al. [110] 11.76 0.544 0.193
DIW Chen et al. [63] 5.59 0.424 0.176
MD Ours 5.32 0.364 0.152
MD+Make3D Ours 4.26 0.176 0.069
Table 2.4: Results on Make3D for various training datasets and methods. The first
column indicates the training dataset. Errors for “Ours” are averaged over four models
trained/validated on MD. Lower is better for all metrics.
Table 2.4 shows that our model fine-tuned on Make3D achieves better performance than
the state-of-the-art.
KITTI. Next, we evaluate our model on the KITTI test set based on the split of [85].
As with our Make3D experiments, we do not use images from KITTI during training.
The KITTI dataset is very different from ours, consisting of driving sequences that
include objects, such as sidewalks, cars, and people, that are difficult to reconstruct
with SfM/MVS. Nevertheless, as shown in Table 2.5, our MD-trained network still
outperforms approaches trained on non-KITTI datasets. In particular, our performance is
similar to the method of Zhou et al. [415] trained on the Cityscapes (CS) dataset. CS also
consists of driving image sequences quite similar to KITTI’s. In contrast, our MD dataset
contains much more diverse scenes. Finally, the last row of Table 2.5 shows that we can
achieve state-of-the-art performance by fine-tuning our network on KITTI training data.
25
(a) Image (b) GT (c) DIW [63] (d) NYU [84] (e) KITTI [110] (f) MD
Figure 2.6: Depth predictions on Make3D. The last four columns show results from
the best models trained on non-Make3D datasets (final column is our result).
(a) Image (b) GT (c) DIW (D) Make3D (e) MD
Figure 2.7: Depth predictions on KITTI. (Blue=near, red=far.) None of the models
were trained on KITTI data. From left to right: (a) input image, (b) ground truth (GT),
(c) model trained on DIW [63], (d) model trained on Make3D [191], (e) Ours trained on
MD.
Figure 2.7 shows visual comparisons between our results and models trained on other
non-KITTI datasets. One can see that we achieve much better visual quality compared
to other non-KITTI datasets, and our predictions can reasonably capture nearby objects
such as traffic signs, cars, and trees, due to our ordinal depth loss.
DIW. Finally, we test our network on the DIW dataset [63]. DIW consists of Internet
photos with general scene structures. Each image in DIW has a single pair of points with
26
Training set Method RMS RMS(log) Abs Rel Sq Rel
KITTI Liu et al. [214] 6.52 0.275 0.202 1.614
Eigen et al. [85] 6.31 0.282 0.203 1.548
Zhou et al. [415] 6.86 0.283 0.208 1.768
Godard et al. [110] 5.93 0.247 0.148 1.334
Make3D Laina et al. [191] 8.50 0.397 0.311 3.201
Liu et al. [213] 11.88 0.416 0.365 7.591
NYU Eigen et al. [84] 10.47 0.492 0.367 3.716
Liu et al. [213] 10.19 0.446 0.321 3.118
Laina et al. [191] 10.58 0.508 0.390 3.939
CS Zhou et al. [415] 7.58 0.334 0.267 2.686
DIW Chen et al. [63] 7.11 0.471 0.409 3.270
MD Ours 6.62 0.369 0.307 2.546
MD+KITTI Ours 5.90 0.241 0.141 1.328
Table 2.5: Results on the KITTI test set for various training datasets and ap-
proaches. Columns are as in Table 2.4.
Training set Method WHDR%
DIW Chen et al. [63] 22.14
KITTI Zhou et al. [415] 31.24
Godard et al. [110] 30.52
NYU Eigen et al. [84] 25.70
Laina et al. [191] 45.30
Liu et al. [213] 28.27
Make3D Laina et al. [191] 31.65
Liu et al. [213] 29.58
MD Ours 24.55
Table 2.6: Results on the DIW test set for various training datasets and approaches.
Columns are as in Table 2.4.
a human-labeled ordinal depth relationship. As with Make3D and KITTI, we do not
use DIW data during training. For DIW, quality is computed via the Weighted Human
Disagreement Rate (WHDR), which measures the frequency of disagreement between
predicted depth maps and human annotations on a test set. Numerical results are shown in
Table 2.6. Our MD-trained network again has the best performance among all non-DIW
trained models. Figure 2.8 visualizes our predictions and those of other non-DIW-trained
27
(a) Image (b) NYU [84] (c) KITTI [110] (d) Make3D [213] (e) Ours
Figure 2.8: Depth predictions on the DIW test set. (Blue=near, red=far.) Captions
are described in Figure 2.7. None of the models were trained on DIW data.
networks on DIW test images. Our predictions achieve visually better depth relationships.
Our method even works reasonably well for challenging scenes such as offices and
close-ups.
2.6 Discussion
In this chapter, we presented a new use for Internet-derived SfM+MVS data: generating
large amounts of training data for single-view depth prediction. We demonstrated that
this data can be used to predict state-of-the-art depth maps for locations never observed
during training, and generalizes very well to other datasets. However, our method also
has a number of limitations. MVS methods still do not perfectly reconstruct even static
scenes, particularly when there are oblique surfaces (e.g., ground), thin or complex
objects (e.g., lampposts), and difficult materials (e.g., shiny glass). Our method does
not predict metric depth; future work in SfM could use learning or semantic information
to correctly scale scenes. Our dataset is currently biased towards outdoor landmarks,
28
though by scaling to much larger input photo collections we will find more diverse scenes.
Despite these limitations, our work points towards the Internet as an intriguing, useful
source of data for geometric learning problems.
29
CHAPTER 3
LEARNING THE DEPTHS OF MOVING PEOPLE BY WATCHING FROZEN
PEOPLE
3.1 Introduction
A hand-held camera capturing video of a dynamic scene is a common scenario. Recover-
ing dense geometry in this case is a challenging task: moving objects violate the epipolar
constraint commonly used in 3D vision (Figure 3.2), and are often treated as noise or
outliers in existing structure-from-motion (SfM) and multi-view stereo (MVS) methods.
Human depth perception, however, is not easily fooled by object motion—rather, we
maintain a feasible interpretation of the objects’ geometry and depth ordering even if both
the observer and the objects are moving, and even when the scene is observed with just
one eye [140]. In this work, we take a step towards achieving this ability computationally.
We focus on the task of predicting accurate, dense depth from ordinary videos where
both the camera and people in the scene are naturally moving. We focus on humans for
two reasons: i) in many application areas, such as augmented reality, humans constitute
the salient objects in the scene, and ii) human motion is articulated and difficult to model.
By taking a data-driven approach, we avoid the need to explicitly impose assumptions on
the shape or deformation of people, but instead learn these priors from data.
Where do we get data to train such a method? Generating high-quality synthetic
data where both the camera and the people in the scene are naturally moving is very
challenging. One approach would be to record real scenes with an RGBD sensor (e.g., a
Microsoft Kinect), but such data is typically limited to indoor environments and requires
significant manual work to capture and process. In addition, if such a dataset is captured
in the lab, a model trained on it may have difficulty generalizing to real scenes. It is also
difficult to gather a diverse collection of people with diverse poses at scale.
Instead, we derive data from a surprising source: YouTube videos in which people
imitate mannequins, i.e., freeze in elaborate, natural poses, while a hand-held camera
tours the scene (Figure 3.3). These videos comprise our new MannequinChallenge (MC)
dataset, which we have released for the research community [202]. Because the entire
scene in such videos is stationary—including the people—we can accurately estimate
camera poses and depth using modern SfM and MVS algorithms, and then use this
derived 3D data as supervision for training a model to predict depth for moving scenes.
In particular, we design and train a deep neural network that takes an input RGB
image, a mask indicating human regions, and an initial depth defined for the static
environment (i.e., the non-human regions), and outputs a dense depth map over the entire
image—both the environment and the people. Note that the initial environmental depth is
computed using motion parallax between two video frames, providing the network with
information not available from a single frame. Once trained, our model can handle natural
videos with arbitrary camera and human motion. Figure 3.1 illustrates our approach.
We demonstrate our method on a variety of real-world Internet videos shot with a
hand-held camera and depicting complex human actions such as walking, running, and
dancing. Our model predicts depth with higher accuracy than state-of-the-art monocular
depth prediction and motion stereo methods. We further show how our predicted depth
maps can be used to produce various 3D effects such as synthetic depth-of-field, depth-
aware inpainting, and insertion of virtual objects into 3D scenes with correct occlusion.
In summary, our contributions are: i) a new source of data for depth prediction
consisting of a large number of Internet videos in which the camera moves around people
31
Train Inference
MannequinChallenge (MC) Dataset MVS Depth Moving people, moving camera
Static scene, moving camera (supervison)
Human 
Mask Initial depth 
from flow RGB Image Predicted depth
Our depth predictions
Figure 3.1: Our model predicts dense depth when both an ordinary camera and people in
the scene are freely moving (right). We train our model on our new MannequinChallenge
dataset—a collection of Internet videos of people imitating mannequins, i.e., freezing
in diverse, natural poses, while a camera tours the scene (left). Because people are
stationary, geometric constraints hold; this allows us to use multi-view stereo to estimate
depth which serves as supervision during training. In all figures, we use inverse depth
maps for visualization purposes, and refer to them as depth maps.
“frozen” in natural poses, along with a methodology for generating accurate depth maps
and camera poses; and ii) a deep-network-based model that makes use of motion parallax
cues from video sequences, and that is designed and trained to predict dense depth maps
in the challenging case of simultaneous camera motion and complex human motion.
3.2 Related Work
Learning-based depth prediction. Numerous algorithms, based on both supervised
and unsupervised learning methods, have recently been proposed for predicting dense
depth from a single RGB image [387, 191, 94, 85, 63, 207, 306, 109, 416, 398, 226,
362]. However, because these methods use a single RGB image, they ignore useful
motion parallax cues present in video sequences. Some recent learning-based methods
also consider multiple images for depth estimation, either assuming known camera
poses [141, 395] or simultaneously predicting camera poses along with depth [354, 413].
However, these methods assume that the captured scenes are completely static. They are
not designed to estimate depth for dynamic objects, which is the focus of our work.
32
! !"#
! !"#
Traditional Stereo Our Case 
(static scene / stereo camera) (moving camera, moving people)
Figure 3.2: Traditional stereo vs. our setup. Left: a person is observed at the same
time instant from two different views. The 3D position of points can be computed using
triangulation. Right: when both the camera and the objects in the scene are moving,
triangulation is no longer possible since the epipolar constraint does not apply.
Depth estimation for dynamic scenes. Depth information captured from RGBD sensors
or stereo cameras has been widely used for 3D modeling of dynamic scenes [255, 427,
396, 79, 150, 369, 159, 285, 22, 21]. However, only a few methods attempt to estimate
depth from a monocular camera. Several methods have sought to reconstruct sparse
geometry for dynamic scenes using either a single monocular camera [270, 411, 324],
or multiple unsynchronized cameras [360]. Russell et al. [299] and Ranftl et al. [282]
suggest motion/object segmentation–based algorithms to decompose a dynamic scene
into piecewise rigid parts before inferring depth ordering. However, these methods
impose strong assumptions about object motion that can be violated by articulated human
motion. More recently, Rematas et al. [284] predict depth for moving soccer players
using synthetic training data from FIFA video games. However, their method is limited
to soccer players, and cannot handle general people in the wild.
33
RGBD datasets for learning depth. There are a number of RGBD datasets of indoor
scenes, captured using depth sensors [322, 56, 72, 382] or rendered from synthetic
data [330]. However, none of these datasets provide depth supervision for moving people
in natural environments. In particular, several action recognition methods use depth
sensors to capture human actions [424, 319, 233, 256], but most of these use a static
camera and provide only a limited number of indoor scenes. REFRESH [223] is a recent
semi-synthetic scene flow dataset created by overlaying animated people on NYUv2
images. Here, too, the data is limited to interior scenes and consists of synthetic humans
placed in unrealistic configurations with respect to their surroundings. The resulting
trained models thus have limited ability to generalize to real scenarios.
Human shape and pose prediction. Recovery of a posed 3D human mesh from a single
RGB image has attracted significant attention [193, 116, 167, 37, 273, 234]. Recent
methods achieve impressive results on natural images spanning a variety of poses, some
of which can also model fine details such as hair and clothing [389, 389, 120]. However,
such approaches do not model geometric relations between the people and the static
parts of the scenes. Finally, many of these methods rely on correctly detecting human
keypoints, requiring most of the body to be visible in each video frame.
3.3 The MannequinChallenge Dataset
The Mannequin Challenge [374] is a popular video trend in which people freeze in place—
often in interesting poses—while the camera operator moves around the scene filming
them. Thousands of such videos have been created and uploaded to YouTube since
late 2016. These videos comprise our new MannequinChallenge (MC) Dataset [202],
which spans a wide range of scenes with people of different ages, naturally posing in
34
Figure 3.3: Sample images from Mannequin Challenge videos. Each image is a frame
from a video sequence in which the camera is moving but the humans are all static. The
videos span a variety of natural scenes, poses, and configurations of people.
different group configurations (see Figure 3.3). To the extent that people succeed in
staying still during the videos, we can assume the scenes are static and obtain accurate
camera poses and depth information by processing them with SfM and MVS algorithms.
However, recovering accurate geometry from such raw Internet videos is challenging,
and requires careful filtering of noisy video clips and individual frames in each clip.
After processing, we obtain around 2,000 candidate videos from which we derive 4,690
sequences comprised of a total of more than 170K valid image-depth pairs.
We now describe in detail how we process the raw videos and derive our training
data.
Estimating camera poses. Following a similar approach to Zhou et al. [418], we use
ORB-SLAM2 [251] to identify trackable sequences in each video and to estimate an
initial camera pose for each frame. At this stage, we process a lower-resolution version
of the video for efficiency, and set the field of view to 60 degrees (a typical value for
modern cell-phone cameras). We then reprocess each sequence at a higher resolution
using a visual SfM system [304], which refines the initial camera poses and intrinsic
parameters. This method extracts and matches features across frames in the videos, then
performs a global bundle adjustment optimization. Finally, sequences with non-smooth
camera motion are removed using the technique of Zhou et al. [418], as we observe that
35
such sequences often have erroneous camera poses.
Computing dense depth with MVS. Once the camera poses for each clip are estimated,
we then reconstruct each scene’s dense geometry. In particular, we recover per-frame
dense depth maps using COLMAP, a state-of-the-art MVS system [306].
Because our data consists of challenging Internet videos that exhibit camera motion
blur, shadows, reflections, etc., the raw depth maps estimated by MVS are often too
noisy for use in training a model. We address this issue with a careful depth cleaning
procedure. We first filter outlier depths using the depth refinement method proposed
by Li and Snavely [207]. We further remove erroneous depth values by considering
the consistency between the MVS depth and the depth obtained from motion parallax
between pairs of frames. Specifically, for each frame, we compute a normalized error
∆(p) for every valid pixel p:
|DMVS(p)−Dpp(p)|
∆(p) = (3.1)
DMVS(p) +Dpp(p)
where DMVS is the depth map obtained by MVS and Dpp is the depth map computed
from two-frame motion parallax (see Section 3.4.1). Depth values for which ∆(p) > δ
are removed, where we empirically set δ = 0.2.
Figure 3.4 shows examples of MVS depth maps before and after our proposed
depth cleaning method. The regions circled in yellow illustrate that our depth cleaning
method can effectively remove incorrect depth regions. Because these depth maps serve
as supervision during training, this filtering has a significant impact on our model’s
performance, as shown in our experiments (Sec. 3.5.2). Figure 3.7 shows additional
examples of our processed sequences with corresponding estimated MVS depths after
cleaning.
Filtering clips. Several factors can make a video clip unsuitable for training. For
36
(a) Image (b) Raw DMVS (c) Cleaned DMVS
Figure 3.4: Effect of depth cleaning. (a-b) Raw MVS depth maps, DMVS, may contain
errors and outliers, especially in untextured regions (see regions circled in yellow). (c)
Our depth cleaning method effectively filters out such erroneous depth values.
example, people may “unfreeze” (start moving) at some point in the video, or the
video may contain synthetic graphical elements in the background. Dynamic objects
and synthetic backgrounds do not obey multi-view geometric constraints and hence
are treated as outliers and filtered out by MVS, potentially leaving few valid pixels.
Therefore, we remove frames where < 20% of pixels have valid MVS depth after our
two-pass cleaning stage.
Further, we remove frames where the estimated radial distortion coefficient |k1| > 0.1
(indicative of a fisheye camera) or where the estimated focal length is ≤ 0.6 or ≥ 1.2
(indicating that the camera parameters are likely inaccurate). We keep sequences that are
at least 30 frames long, have an aspect ratio of 16:9, and have a width of ≥ 1600 pixels.
37
Finally, we visually inspect the trajectories and point clouds of the remaining sequences
and remove obviously incorrect reconstructions.
Figure 3.5 shows examples of images filtered out from the raw Mannequin Challenge
video clips by our data creation pipeline. These examples include images captured by
fisheye cameras, as well as images with large regions of synthetic background or moving
objects.
After processing, we obtain 4,690 sequences with a total of more then 170K valid
image-depth pairs. We split our MC dataset into training, validation and testing sets with
a 80:3:17 split over clips.
3.4 Depth Prediction Model
We train our depth prediction model on our MannequinChallenge dataset in a supervised
manner, i.e., by regressing to the depth generated by the SfM and MVS pipeline. A key
question is how to structure the input to the network to allow training on frozen people
but inference on moving people.
One possible approach is to regress to depth from a single RGB image (RGB-to-
depth), but this approach disregards geometric information about the static regions of
the scene that is available by considering more than a single frame. To benefit from
such information, we design a two-frame model that uses depth estimated from motion
parallax for the static, non-human regions of the scene (Figure 3.6).
The full input to our network (Figure 3.7) includes 1) a reference image Ir, 2) a binary
mask M indicating human regions, 3) an initial depth map Dpp estimated from motion
parallax and with human regions removed, 4) a confidence map C, and 5) an optional
38
(a) Fisheye
(b) Synthetic background
(c) Moving objects
Figure 3.5: Sample frames from clips removed during filtering. (a) Videos captured
with fisheye cameras; (b) videos with synthetic backgrounds; (c) sequences with truly
moving objects (pairs of frames shown in each column).
human keypoint map K. We assume known, accurate camera poses from SfM during
both training and inference. In an online inference-time setting, accurate camera poses
can also be obtained using visual-inertial odometry. Given these inputs, the network
predicts a full depth map for the entire scene. To match the MVS depth values, the
network must inpaint the depth in human regions, refine the depth in non-human regions
from the estimated Dpp, and finally make the depth of the entire scene consistent.
Our network architecture is a variant of the hourglass network proposed by Chen et
al. [63]. Specifically, the network has a standard encoder-decoder U-Net structure, with
matching input and output resolution, consisting of approximately 5M parameters. In
39
Inputs at time t
Source Frame Reference Frame MVS Depth 
(supervision)
Human mask
Optical flow
Losses
Depth from parallax
Camera poses Regression
from CNN 
SLAM/SfM  Confidence
Predicted Depth 
Figure 3.6: System overview. Our model takes as input the RGB frame, a human
segmentation mask, masked depth from motion parallax (via optical flow and SfM pose),
and associated confidence map. We ask the network to use these inputs to predict depths
that match the ground truth MVS depth.
addition, an Inception module variant [344] is used in each convolutional layer of the
network. We replace nearest-neighbor upsampling layers with bilinear upsampling layers,
which we found to produce sharper depth maps while slightly improving overall accuracy.
We refer readers to the Appendix and to Chen et al. [63] for full details of our network
architecture. The following sections describe our model inputs and training losses in
detail.
3.4.1 Depth from motion parallax
Motion parallax between two video frames provides an initial depth estimate for the static
regions of the scene. We assume humans are dynamic while the rest of the scene is static.
Specifically, for each reference frame, Ir, we select another frame in the video Is, and
estimate an optical flow field from Ir to Is using FlowNet2.0 [146]. Given the estimated
flow field and the relative camera poses between the two views, we then compute an
40
(a) Reference image Ir (b) Human mask M (c) Input depth Dpp (d) Input confidence C (e) MVS depth DMVS
Figure 3.7: System inputs and training data. The input to our network consists of:
(a) an RGB image, (b) a human mask, (c) a masked depth map computed from motion
parallax w.r.t. a selected source image, and (d) a masked confidence map. Low confidence
regions (dark circles) in the first two rows indicate the vicinity of the camera epipole,
where depth from parallax is unreliable and removed. The network is trained to regress
to MVS depth (e).
initial depth map using the Plane-Plus-Parallax (P+P) representation [152, 379].
Note that P+P is typically used to estimate the relative structure of a scene with
respect to a reference plane, either a plane in the scene or a virtual reference plane. In
our case, we use it as means to cancel out relative camera rotation, as described below.
Formally, suppose we have a relative camera pose relating Is and Ir consisting of
a 3D rotation R ∈ SO(3) and 3D translation t ∈ R3, with shared intrinsics matrix K.
Given an arbitrary planar surface, Π, the geometric relation between a 2D image point
p ∈ Ir and its corresponding point p′ ∈ Is (expressed in homogeneous coordinates) can
be represented as a combination of a planar component and residual parallax component:
p = pw + µ, (3.2)
where pw is the 2D image point in Ir that results from warping p′ ∈ Is by a homography
A, which aligns the plane Π between the two views, and µ is the remaining 2D parallax
motion. We refer readers to the Appendix for a detailed definition of pw and µ.
41
One can show that when setting the reference plane Π to the plane at infinity, the
expression in Eq. 3.2 can be written as:
tz(pw −Kt)
p = pw + , (3.3)
Dpp(p)
where Dpp(p) is the depth value at p in the coordinate system of the reference view Ir,
and tz is the third component of the translation vector t. In addition, the homography A
in this case is computed as A = KRK−1.
From Eq. 3.3, we can estimate the depth Dpp(p) as:
‖tzpw −Kt‖2
Dpp(p) = , (3.4)‖p− pw‖2
We found this computation to be more efficient and robust for dense depth estima-
tion compared to standard triangulation methods, which are usually applied to sparse
correspondences. See the Appendix for a detailed derivation of Eq. 3.4.
In some cases, such as forward/backward relative camera motion, ||p− pw||2 will be
close to zero in some image regions (i.e., near the camera epipole), resulting in ill-defined
depth values. We detect and remove these image regions as described in Sec. 3.4.2.
Keyframe selection. Depth from motion parallax can be ill-posed if the 2D displacement
between two views is small or well-approximated by a homography (e.g., in the case of
pure camera rotation). To avoid such cases, we use a heuristic to select a reference frame
Ir and a corresponding source keyframe Is. We want the two views to have significant
overlap, while having sufficient baseline (i.e., distance between camera centers). In
particular, for each Ir, we find the index s of Is as
s = arg maxjdrjorj (3.5)
where drj is the L2 distance between the camera centers of Ir and neighboring frame Ij .
42
The term orj is the fraction of co-visible SfM f⋂eatures in Ir and Ij:
2|V r V j|
orj = , (3.6)|V r|+ |V j|
where V j is the set of features visible in Ij . We discard pairs of frames for which
orj < τo, i.e., the fraction of co-visible features should be larger than a threshold τo (we
set τo = 0.6), and limit the maximum frame interval to 10. We found these view selection
criteria to work well in our experiments.
3.4.2 Depth confidence
Our data consists of challenging Internet video clips with camera motion blur, shadows,
low lighting, and reflections. In such cases, optical flow is often noisy [380], leading
to uncertainty in the input depth map Dpp. We thus estimate, and feed to the network,
a confidence map C. This map allows the network to rely more on the input depth in
high-confidence regions, and potentially to improve its prediction in low-confidence
regions. The confidence value at each pixel p in the non-human regions is defined as:
C(p) = Clr(p)Cep(p)Cpa(p). (3.7)
where the individual terms are defined as follows.
Flow consistency. The term Clr measures “left-right” consistency between the forward
and backward flow fields. Specifically, we denote forward flow from Ir to Is as ffwd, and
backward flow from Is to Ir as fbwd. Clr is then defined as:
( )
C 2 2lr(p) = max 0, 1− r(p) /σ̄ (3.8)
where r(p) = ‖ffwd(p) + f ′bwd(p )‖2 is the forward-backward optical flow warping error,
and σ̄ is a tolerance parameter. For perfectly consistent forward and backward flows
43
Clr = 1, while Clr = 0 when the error is greater than σ̄ pixels (we set σ̄ = 1px in our
experiments).
Geometric consistency. The term Cep measures how well the flow field complies with
the epipolar constraint between the views [125]. Cep gives low confidence to pixels where
the flow field and the epipolar constraint disagree:
( )
Cep(p) = max 0, 1− (γ(p)/γ̄)2 (3.9)
where γ̄ controls the epipolar distance tolerance (we set γ̄ = 2px in our experiments),
and the geometric epipolar distance γ(p) is defined as:
|p′TFp|
γ(p) = √ (3.10)
(Fp)2 2x + (Fp)y
where F = K−T [t] RK−1× is the fundamental matrix relating the two views, and (Fp)x
and (Fp)y are the first and second elements of Fp, respectively.
Parallax confidence. The term Cpa assigns low confidence to pixels for which the
parallax between the views is small [30(6]: )2
− min(β̄, β(p))− β̄Cpa(p) = 1 (3.11)
β̄
where ( ′ )
β(p) = cos−1
v(p)v(p )
(3.12)
‖v(p)‖ ′2‖v(p )‖2
is the angle between the camera rays meeting at pixel p, and v(p) = K−1p and v(p′) =
K−1p′ are viewpoint direction vectors at p in Ir and p′ in Is respectively. β̄ is the angle
tolerance (we use β̄ = 1° in our experiments).
Figure 3.7(d) shows examples of computed confidence maps. Note that human
regions as well as regions for which the confidence C(p) < 0.25 are masked out.
44
3.4.3 Keypoints
We optionally use human keypoints as an additional input to the network, providing the
network with explicit information about the poses of the people featured. In particular,
we apply the Mask-RCNN [131] human keypoint detection algorithm to each frames.
This algorithm detects, for each person, a set of keypoints on salient points such as joint
locations. We encode these detections as an image for use as a network input by simply
setting the image pixel value at each keypoint location to the corresponding keypoint
index (normalized to lie between 0 and 1), and the rest of the pixels to zero. Figure 3.8
shows examples of human keypoints predicted by Mask-RCNN. We show that adding
keypoints as an input can boost depth prediction performance for people, as shown in
Tables 3.1 and 3.2.
3.4.4 Losses
We train our network to regress to depth maps computed by our proposed data pipeline.
Because the estimated depth values from SfM and MVS have an arbitrary scale, we use a
scale-invariant depth regression loss. That is, our loss is computed on log-space depth
values. Our loss function consists primarily of three terms:
Lsi = LMSE + α1Lgrad + α2(Lsm1 + Lsm2). (3.13)
We compute our losses with respect to the reference image Ir. To simplify notations,
we remove the superscript r in the loss equations.
Scale-invariant MSE. LMSE denotes the scale-invariant mean square error (MSE)
adopted from [85]. This loss term computes the squared, log-space difference in depth
between two pixels in the prediction and the same two pixels in the ground truth, averaged
45
Figure 3.8: Examples of keypoint images. The top row shows examples of input
images and the bottom row shows corresponding detected human keypoint images, where
different colors indicating different joints. We perform morphological dilation to the
keypoint maps to make each keypoint location more visible.
over all pairs of valid pixels. That is, it penalizes differences in the depth ratio between
any two pixels in the prediction and the ground truth. Further, this loss can be computed
in linear time in terms of the number of pixels, as derived in the Appendix:
∑∑
L 1MSE = (R(p)−R(q))2 (3.14)
2N2
p∈I q∈I ( )2
1 ∑ 1 ∑
= R(p)2 − R(p) (3.15)
N N2
p∈I p∈I
where R(p) = log D̂(p) − logDgt(p), and D̂ and and Dgt denote the predicted and
ground truth depth, respectively.
Multi-scale gradient consistency term. To improve depth predictions, we use a multi-
scale gradient consistency term to encourage smoother gradient changes and sharper
depth discontinuities in the predicted depth images [207]:
∑S−1 ∑
L 1grad = (|∇xRs(p)|+ |∇yRs(p)|) (3.16)
N
s=0 s p∈Is
where the subscript s on Rs and Is indicates that images are computed at scale s, and Ns
denotes the number of valid pixel at scale s.
46
Figure 3.9: Qualitative results on the MC test set. From top to bottom: reference
images and their corresponding MVS depth (pseudo ground truth); our depth predictions
using: our single view model (third row) and our two-frame model (forth row). The
additional network inputs give improved performance in both human and non-human
regions.
Multi-scale edge-aware smoothness terms. To encourage smooth interpolation of
depth in texture-less regions where MVS fails to recover depth, we add smoothness
terms at multiple scales based on first- and second-order image derivatives [362], and
smoothness weight is modulated by the distance to neighborhood pixels:
∑S−1 ∑
L 1sm1 = exp(−|∇Is(p)|)|∇ log D̂s(p)| (3.17)
N s
∑s=0 s
2
p∑∈IsS−1
L 1sm2 = exp(−|∇2Is(p)|)|∇2 log D̂s(p)| (3.18)
N 2s
s=0 s p∈Is
For the Lgrad, Lsm1 and Lsm2 terms, we create S = 5-scale image pyramids for both
the predicted and ground truth depth images, using nearest-neighbor down-sampling,
since we find, compared with bilinear interpolation, nearest-neighbor down-sampling
leads to much sharper depth prediction.
47
3.5 Results
We test our method quantitatively and qualitatively and compare it with several state-of-
the-art single-view and motion-based depth prediction algorithms. We show additional
qualitative results on challenging Internet videos with complex human motion and natural
camera motion, and demonstrate how our predicted depth maps can be used for several
visual effects.
Implementation details. We use FlowNet2.0 [146] to estimate optical flow since it
handles large displacements well and preserves sharp motion discontinuities. We use
Mask-RCNN [131] to generate human masks and human keypoints. The predicted masks
sometimes have errors and miss small parts of people, so we apply a morphological
dilation operation to the binary human masks to ensure that the masks are conservative
and include all the human regions. When keypoints are used, we normalize their values
to between 0 and 1 before feeding them to the network.
Our network predicts log depth at both the training and inference stages. During
training, we randomly normalize the input log-depth before feeding it to the network by
subtracting a value sampled from between the 40th and 60th percentile of valid input
logDpp. During inference, we normalize input log-depth by subtracting the median of
logDpp. Additionally, during training, we randomly zero out the initial input depth and
confidence (with probability 0.1) to address the potential situation where input depth is
unavailable (e.g., camera is nearly static or estimated optical flow is completely incorrect)
during inference. When using human keypoints as input, we also use the depth from
motion parallax Dpp with high confidence (Clr > 0, Cep > 0 and Cpa > 0.5) at these
locations as ground truth if MVS depth DMVS is not available.
In our experiments, we set hyperparameters in our loss terms α1 = 0.5, α2 = 0.05
48
Network inputs si-full si-env si-hum si-intra si-inter
I. I 0.333 0.338 0.317 0.264 0.384
II. IFCM 0.330 0.349 0.312 0.260 0.381
III. IDppM 0.255 0.229 0.264 0.243 0.285
IV. IDppCM 0.232 0.188 0.237 0.221 0.268
V. IDppCMK 0.227 0.189 0.230 0.212 0.263
Unmasked Dpp (oracle) 0.202 0.206 0.200 0.192 0.213
Table 3.1: Quantitative comparisons on the MC test set. Different input configurations
of our model: (I) single image; (II) optical flow masked in the human region (F ),
confidence and human mask; (III) masked input depth, human mask; and (IV) additional
confidence; in (V), we also input human keypoints. The last row indicates the error
for the depth estimated from motion parallax between two frames in all image regions
(human and non-human); this serves as an oracle and can only be measured if the entire
scene is static. Lower is better for all metrics.
based on the validation set. We train our networks for 20 epochs from scratch using
the Adam [177] optimizer with initial learning rate of 0.0004. We halve the learning
rate every 8 epochs. During training, we downsample all the images to a resolution of
532×299, use a mini-batch size of 16, and perform data augmentation though random
flips and central crops so that input image resolution to the networks is 512×288.
Error metrics. We measure error using scale-invariant RMSE (si-RMSE), equivalent to
√
LMSE, described in Section 3.4.4. We evaluate si-RMSE on five different regions: 1)
si-full measures the error between all pairs of pixels, giving the overall accuracy across
the entire image; 2) si-env measures pairs of pixels in non-human regions E , providing
depth accuracy of the environment; and 3) si-hum measures pairs where at least one pixel
lies in the human region H, providing depth accuracy for people. si-hum can further
be divided into two error measures: 4) si-intra measures si-RMSE withinH, or human
accuracy independent of the environment; and 5) si-inter measures si-RMSE between
pixels inH and in E , or human accuracy w.r.t. the environment. We include derivations
in the Appendix.
49
Methods Dataset two-view? si-full si-env si-hum si-intra si-inter RMSE Rel
Russell et al. [299] - Yes 2.146 2.021 2.207 2.206 2.093 2.520 0.772
DeMoN [354] RGBD+MVS Yes 0.338 0.302 0.360 0.293 0.384 0.866 0.220
Chen et al. [63] NYU+DIW No 0.441 0.398 0.458 0.408 0.470 1.004 0.262
Laina et al. [191] NYU No 0.358 0.356 0.349 0.270 0.377 0.947 0.223
Xu et al. [387] NYU No 0.427 0.419 0.411 0.302 0.451 1.085 0.274
Fu et al. [94] NYU No 0.351 0.357 0.334 0.257 0.360 0.925 0.194
I MC No 0.318 0.334 0.294 0.227 0.319 0.840 0.204
IFCM MC Yes 0.316 0.330 0.302 0.228 0.323 0.843 0.206
IDppM MC Yes 0.246 0.225 0.260 0.233 0.273 0.635 0.136
IDppCM (raw depth) MC Yes 0.272 0.238 0.293 0.258 0.282 0.688 0.147
IDppCM MC Yes 0.232 0.203 0.252 0.224 0.262 0.570 0.129
IDppCMK MC Yes 0.221 0.195 0.238 0.215 0.247 0.541 0.125
Table 3.2: Results on the TUM RGBD dataset. Different si-RMSE metrics as well as
standard RMSE and relative error (Rel) are reported. We evaluate our models (light gray
background) under different input configurations, as described in Table 3.1. Raw depth
indicates the model is trained using raw MVS depth predictions as supervision, without
our depth cleaning method. A dataset denoted as ‘-’ indicates that the method is not
learning-based. Lower is better for all error metrics.
3.5.1 Evaluation on the MC test set
We evaluated our method on our MC test set, which consists of more than 29K images
taken from 756 video clips. Processed MVS depth values DMVS obtained by our pipeline
(see Section 3.3) are considered as ground truth.
To quantify the importance of each component of the model’s input, we compare the
performance of several models, each trained on our MC dataset with a different input
configuration. The two main configurations are: (i) a single-view model (input is RGB
image) and (ii) our full two-frame model, where the input includes a reference image,
an initial masked depth map Dpp, a confidence map C, and a human mask M . We also
perform ablation studies by replacing the input depth with optical flow F , removing C
from the input, and adding the human keypoint map K.
Quantitative evaluations are shown in Table 3.1. By comparing rows (I), (III) and
(IV), it is clear that adding the initial depth of the environment as well as the confidence
50
map significantly improves the performance for both human and non-human regions.
Adding human keypoint locations to the network input further improves performance.
Note that if we input an optical flow field to the network instead of depth (II), the
performance is only on par with the single-view method. The mapping from 2D optical
flow to depth depends on the relative camera poses, which are not provided to the network.
This result indicates that the network is unable to implicitly learn relative poses and
extract depth information.
Finally, we report the errors for full (unmasked) depth maps computed from motion
parallax between two frames (last row of Table. 3.1). Note that these depth maps can be
only computed if the entire scene, including people, is static (thus, this baseline serves
as an oracle and cannot be used at test time). As can be seen from the second column
(si-env), our model leads to 20% improvement compared to this baseline for non-human
regions, which suggests that our model refines the initial input depth (Dpp), rather than
just copying it. In human regions, where our model has no input depth information, our
performance is only 15% below that of depth from motion parallax (si-hum).
Figure 3.9 shows qualitative comparisons between our single-view model (I) and
our full model (IDppCMK). Our full model results are more accurate in both human
regions (first column) and non-human regions (second column). In addition, the depth
relationships between people and their surroundings are improved in all examples.
3.5.2 Evaluation on the TUM RGBD dataset
We also evaluate on a subset of the TUM RGBD dataset [339], which contains indoor
scenes featuring people performing complex actions, captured from different camera
51
(a) Ir (b) GT (c) DORN (d) DeMoN (e) Ours(SV) (f) Ours(Full)
Figure 3.10: Qualitative comparisons on the TUM RGBD dataset. (a) Reference
images, (b) ground truth sensor depth, (c) results of the single-view depth prediction
method DORN [94], (d) result of the two-frame motion stereo method DeMoN [354],
(e-f) depth predictions from our single view and two-frame models, respectively.
poses. Sample images from this dataset are shown in Figure 3.10(a-b).
To run our model, we first estimate camera poses using ORB-SLAM2, because we
found that estimates from ORB-SLAM2 were better synchronized with the RGB images
compared to the ground truth poses provided with the TUM dataset. In some cases,
due to low image quality and motion blur, the estimated camera poses can be incorrect.
We manually filter such failures by inspecting the camera trajectory and point cloud.
52
(a) Ir (b) DORN [94] (c) Chen et al. [63] (d) DeMoN [354] (e) Ours (full)
Figure 3.11: Comparisons on Internet video clips with moving cameras and people.
From left to right: (a) reference input image (b) results of DORN [94], (c) results of
Chen et al. [63], (d) results of DeMoN [354], (e) results of our full method.
In total, we obtain 11 valid image sequences with 1,815 images in total for evaluation.
We downsample these images to 512×384 resolution in order to preserve their original
aspect ratio (our model is fully convolutional and thus can be applied to different image
resolutions at test time).
We compare our depth predictions (using our MC trained models) with several state-
of-the-art monocular depth prediction methods trained on the indoor NYUv2 [191, 387,
94] and Depth in the Wild (DIW) datasets [63], as well as with a recent two-frame
stereo model DeMoN [354], which assumes a static scene. We also compare with Video-
53
(a) Input image (b) Defocus (c) Object insertion (d) Anaglyph
Figure 3.12: Depth-based visual effects. Using our predicted depth maps, we can apply
depth-aware visual effects on (a) input images; we show (b) defocus, (c) object insertion,
and (d) Anaglyph effects.
Popup [299], which deals with dynamic scenes. We use the same image pairs that were
used for computing Dpp as inputs to DeMoN and Video-Popup.
Figure 3.13: Depth-based image inpainting. We use depth prediction and camera poses
to warp the pixels in nearby frames for image inpainting and people removal. Top row
shows original images and bottom row shows inpainted images.
Quantitative comparisons are shown in Table 3.2, where we report five different
scale-invariant error measures as well as the standard RMSE metric and relative error;
these last two are computed by applying a single scaling factor that best aligns the
predicted and ground-truth depths in the least-squares sense. Our single-view model
already outperforms the other single-view models, demonstrating the benefit of the MC
dataset for training. Note that VideoPopup [299] failed to produce meaningful results
54
due to the challenging camera and object motion present in the data. Our full model, by
making use of the initial (masked) depth map, significantly improves performance for
all error measures. Consistent with our MC test set results, when we use optical flow as
input (instead of the initial depth map) the performance is only slightly better than the
single-view network. Finally, we show the importance of our proposed depth cleaning
methods that we apply to the training data (see Eq. 3.1). The same model trained using
the raw MVS depth estimates as supervision (“raw depth”) leads to a drop of about 15%
in performance.
Figure 3.10 shows a qualitative comparison between these different methods. Our
models’ depth predictions (Figure 3.10(f-g)) strongly resemble the ground truth and show
a high level of detail, as well as sharp depth discontinuities. This result is a notable
improvement over competing methods, which often produce significant errors in both the
human regions (e.g., legs in the second row of Figure 3.10), and the non-human regions
(e.g., table and ceiling in the last two rows).
3.5.3 Internet videos of dynamic scenes
We tested our method on challenging Internet videos (downloaded from YouTube and
Shutterstock) that involve simultaneous natural camera motion and human motion. Our
SLAM/SfM pipeline was used to generate sequences ranging from 5 to 15 seconds with
smooth and accurate camera trajectories, after which we apply our method to obtain the
required network input buffers.
We qualitatively compare our full model (IDppCMK ) with several recent learning
based depth prediction models: DORN [94], Chen et al. [63], and DeMoN [354]. For
fair comparisons, we use DORN with a model trained on NYUv2 for indoor videos and
55
a model trained on KITTI for outdoor videos; For Chen et al. [63], we use the models
trained on both NYUv2 and DIW. For all of our predictions, we use a single model
trained from scratch on our MC dataset.
As illustrated in Figure 3.11, our depth predictions are significantly better than the
baseline methods. In particular, DORN [94] has very limited generalization to Internet
videos, and Chen et al. [63], which is mainly trained on Internet photos, is not able to
capture accurate depth. DeMoN often produces incorrect depth, especially in human
regions, as it designed for static scenes. Our predicted depth maps capture accurate depth
ordering both between people and other objects in the scene (e.g., between the people
and buildings in the fourth row of Figure 3.11), and within human regions (such as the
arms and legs of the people in the first three rows of Figure 3.11).
Depth-based visual effects. Our depth predictions can be used to apply a range of
depth-based visual effects to video. Figure 3.12 shows depth-based defocus, insertion
of synthetic 3D graphics, as well as stereo pairs displayed as anaglyph images. In
Figure 3.13, we show an example of image inpainting by removing nearby humans using
our predicted depths.
The depth estimates are sufficiently stable over time to allow inpainting from frames
elsewhere in the video. To use a frame for inpainting, we construct a triangle heightfield
from the depth map, texture the heightfield with the video frame, and render the height-
field from the target frame using the relative camera transformation. Figure 3.12 (d, f)
shows the results of inpainting two street scenes. Humans near the camera are removed
using the human mask M , and holes are filled with colors from up to 200 frames later
in the video. Some artifacts are visible in areas that the human mask misses, such as
shadows on the ground.
56
3.6 Discussion
We demonstrated the power of a learning-based approach for predicting dense depth for
dynamic scenes where a monocular camera and people are freely moving. We make
a new source of data available for training: a large corpus of Mannequin Challenge
videos from YouTube, in which the camera moves around and people are “frozen” in
natural poses. We showed how to obtain reliable depth supervision from such noisy data,
and demonstrated that by using motion parallax cues available in a video sequence, our
models can significantly improve over prior state-of-the-art methods.
Our approach has a number of limitations. First, we assume known and accurate
camera poses, which can be difficult to compute accurately if moving objects cover most
of the scene or camera motion is close to a pure rotation. Second, our model can fail to
generalize to non-standard human poses, as shown in the first three rows of Fig. 3.14.
Third, the depths predicted by our model may be inaccurate for non-human moving
regions such as animals, cars, and shadows, as shown in the last three rows of Fig. 3.14.
Finally, our approach also uses just two views, rather than operating on an entire video
sequence. This can lead to temporally inconsistent depth estimates and reconstructions
across a video. Despite these limitations, we hope that our work can guide and enable
further progress in dense reconstruction of dynamic scenes.
57
Complex poses
Non-human movers
(a) Input image (b) Ours (RGB) (c) Ours (full)
Figure 3.14: Failure cases. From left to right: (a) input RGB image (b) depth predicted
from our single-view method (c) depth predicted from our proposed full method. Our
proposed full method can fail for reasons including (1) a failure to generalize to complex
human poses, (first three rows), or due to non-human movers such as animals, cars,
and shadows (last three rows). In some of these cases, our single-view method can
outperform our full two-view method, because added complexities can sometimes arise
in the presence of multiple views.
58
CHAPTER 4
LEARNING INTRINSIC IMAGE DECOMPOSITION FROM WATCHING
THE WORLD
4.1 Introduction
Intrinsic image decomposition is the problem of factorizing an input image I into a
product of a reflectance image and a shading image: I = R · S. While the vision
community has seen significant advances in single-image intrinsic image decomposition,
it remains a challenging, highly ill-posed problem. Hence, the use of machine learning for
this task is an appealing prospect. Unfortunately, it is also difficult to gather direct ground
truth training data. Previous work has collected ground truth via painting objects [115],
synthetic renderings [51, 57], and manual annotation [26, 185], but each of these methods
has significant limitations.
Inspired by how humans can learn by simply observing the world and formulating
consistent explanations, we consider an alternative, readily available source of training
data for learning intrinsic images: image sequences from the Internet for which the
viewpoint is fixed but illumination varies. Based on this idea, we introduce BIGTIME
(BT), a large dataset of time-lapse image sequences. While the sequences in BT do not
provide ground truth, they allow us to incorporate useful constraints during training, by
specifying that the model should predict outputs consistent with the sequence. While we
train on image sequences, our model can apply to a single image at inference time, as
illustrated in Figure 4.1.
Although a number of prior methods estimate intrinsic images from sequences, our
concept is quite different: we train on sequences, but learn to infer decompositions from
Training: we learn from unlabeled indoor and outdoor videos.
CNN R 
CNN R 
S 
S 
I 
I 
Testing: our CNN produces intrinsic images from a single photo.
Figure 4.1: To train, our method learns from unlabeled videos with fixed viewpoint
but varying illumination (top). At test time (bottom), our network produces an intrinsic
image decomposition (R, S) from a single image I .
single views. In a sense, our method lies between optimization-based intrinsic images
methods and machine learning approaches. In particular, our training loss incorporates
priors similar to those of optimization-based approaches, but in a feed-forward prediction
framework.
To fully utilize the information present in image sequences, we also introduce two
new methods for computing losses over whole sequences, and show how to efficiently
implement these losses inside a deep network. The first is an all-pairs weighted least
squares loss that considers all pairs of images. The second is a dense, spatio-temporal
smoothness loss that jointly considers all of the pixels in the entire sequence. While we
use these losses for training intrinsic images, they could also be applied to other problems
that involve image sequences, such as video segmentation.
60
In our evaluation, our method yields competitive or superior performance on two
standard real-world benchmarks, IIW and SAW, even when trained on BT without access
to annotations from those datasets. We further show improved results on the MIT intrinsic
images dataset, even compared to learning methods that utilize full supervised ground
truth.
4.2 Related work
Intrinsic images through optimization. Intrinsic images has been studied for nearly
fifty years, often within an optimization framework. Because the problem is ill-posed,
additional priors must be applied. For instance, the seminal Retinex algorithm [192]
assumes large image gradients correspond to changes in reflectance, while smaller
gradients are due to shading. Subsequently, many different priors have been proposed
to guide the decomposition [310, 409, 296, 311, 99], and many new optimization tools,
such as inference in dense CRFs, have been deployed [26]. Some recent approaches
make use of surface normals from RGB-D cameras [61, 18, 158]. Surface normals can
improve shading estimates, but such methods assume depth maps are available during
optimization.
Intrinsic images from multiple observations. A number of methods, starting with
Weiss [370], estimate intrinsic images from time-lapse sequences by assuming constant
reflectance but varying shading over time [231, 342, 128, 188, 187]. Such an approach
is similar to our training regime, although a crucial distinction is that once our model
is trained, we can run it on a single image. These methods rely on priors derived from
statistics of image sequences or lighting sources. We found that in practice these methods
require a) a large number of input images and b) images taken in outdoor or controlled
61
laboratory environments. In contrast, our method can learn from much shorter and less
controlled sequences.
Intrinsic images via supervised learning. Barron and Malik [19] proposed a unified
learning-based method that incorporates a number of complex priors on shape, albedo,
and illumination. However, their method only applies to single objects and does not
generalize well to real-world scenes. Recently, several approaches use deep learning to
predict albedo and shading via direct supervision. These methods train on the synthetic
Sintel [176, 52], object-centric MIT [115] or synthetic ShapeNet datasets [57, 157].
However, Sintel and ShapeNet are highly synthetic datasets, and networks trained on
them do not generalize well to real-world scenes. The MIT dataset consists of real images,
but these images depict objects captured in the lab, not realistic scenes, and the dataset
contains just 20 objects with ground truth.
Recently, two datasets have been created for real-world scenes. Intrinsic Images in
the Wild (IIW) [26] is a dataset of sparse, human-labeled relative reflectance judgments.
Shading Annotations in the Wild (SAW) [185] similarly contains sparse shading anno-
tations.Several methods [417, 428, 253, 185] train CNNs on sparse annotations from
IIW/SAW and use the predictions as priors for intrinsic images. However, it is difficult
to collect such annotations at scale, especially for shading relationships, which can be
challenging to perceive. Further, these datasets are limited to sparse annotations. We
propose an alternative form of training data that is much easier to capture and provides
full-image constraints.
62
I1
I 2
DR
I 3
c
I m2 + 
E
I m1
I m DS
Figure 4.2: System overview and network. During training, our network input is an
image sequence I, and the outputs are reflectance imagesR and shading images S for
the sequence. Each block in the network depicts a convolutional/deconvolutional layer.
E is an encoder, and DR and DS are decoders for the reflectance and shading images.
For the innermost feature maps, we have one side output c representing the illumination
color. E is an energy function measuring the cost of the decomposition.
4.3 Overview and network architecture
Our work makes two main contributions: a new dataset, BIGTIME, of image sequences
for learning intrinsic images (Sec. 6.3.1), and a new approach to learning single-view
intrinsic images from this data (Sec. 4.5). Because we train from image sequences, one
learning approach would be to use existing sequence-based intrinsic images algorithms
to produce approximate ground truth decompositions, then use these algorithmic outputs
as supervision. However, we found that for many image sequences, existing sequence-
based algorithms perform poorly because their assumptions are not met, as discussed
in Sec. 6.3.1. Hence, during training, our CNN directly takes an image sequence as
input, and processes it in a feed-forward fashion to produce reflectance and shading for
each image in the sequence, as shown in Figure 4.2. Because the network processes
each image independently, at test time multiple images are not required, i.e., we can use
the network to produce a decomposition for a single image. During training, the input
63
images interact through our novel loss function (Sec. 4.5), which evaluates the predicted
decompositions jointly for the entire sequence.
For our network, we use a variant of the U-Net architecture [291, 153] (Figure 4.2).
Our network has one encoder and two decoders, one for log-reflectance and one for
log-shading, with skip connections for both decoders. Each layer of the encoder consists
mainly of a 4 × 4 stride-2 convolutional layer followed by batch normalization [151]
as well as leaky ReLu [132]. For the two decoders, each layer is composed of a 4× 4
deconvolutional layer followed by ReLu. In addition to the decoders for reflectance and
shading, the network predicts one side output from the innermost feature maps, a single
RGB vector for each image corresponding to the predicted illumination color.
4.4 Dataset
To create the BIGTIME dataset, we collected videos and image sequences depicting
both indoor and outdoor scenes with varying illumination. While many time-lapse
datasets primarily capture outdoor scenes, we explicitly wanted representation from
indoor scenes as well. Our indoor sequences were gathered from Youtube, Vimeo, Flickr,
Shutterstock, and Boyadzhiev et al. [39], and our outdoor sequences were collected from
the AMOS [155] and Time Hallucination [317] datasets. For each video, we masked out
the sky as well as dynamic objects such as pets, people, and cars via automatic semantic
segmentation [407] or manual annotation. We collected 145 sequences from indoor
scenes and 50 from outdoor scenes, yielding a total of ∼6,500 training images.
Challenges with Internet videos. Most outdoor scenes in our dataset are from time-
lapse sequences where the sun moves evenly over time. Many existing algorithms for
multi-image intrinsic image decomposition work well on such data. However, we found
64
Figure 4.3: Examples of challenging images in our dataset. The first two images
depict colorful illumination. The last two images show strong sunlight/shadows.
that indoor image sequences are much more challenging because illumination changes
in indoor scenes tend to be less even or continuous compared to outdoor scenes. In
particular, we observed that:
1. most relevant video clips cover a short period of time and do not show large
changes in light direction,
2. several video clips are comprised of a light turning on/off in a room, producing a
limited number (<8) of valid images with different lighting conditions, and
3. the dynamic range of indoor scenes can be high, with strong sunlight or shadows
leading to saturation/clipping that can break intrinsic image algorithms.
These properties make our dataset even more complex than the IIW and SAW datasets.
Several difficult examples are shown in Fig. 4.3. We found that prior intrinsic image
methods designed for image sequences often fail on our indoor videos, as their assump-
tions tend to hold only for outdoor or lab-captured sequences. Example failure cases are
shown in Fig. 4.4. However, as we show in our evaluation, our approach is robust to such
strong illumination conditions, and networks trained on BT generalize well to IIW and
SAW.
65
Image Estimated R Estimated S
Figure 4.4: Failure cases for intrinsic image estimation algorithms. We applied a
state-of-the-art multi-image intrinsic image decomposition estimation algorithm [128] to
our dataset. This method fails to produce decomposition results suitable for training due
to strong assumptions that hold primarily for outdoor/laboratory scenes.
4.5 Approach
In this section, we describe our novel framework for learning reflectance and shading
from Internet time-lapse video clips. During training, we formulate the problem as a
continuous densely connected conditional random field (dense CRF) and learn a deep
neural network to directly predict a decomposition from single views in a feed-forward
fashion.
Image formation model. Let I denote an input image, and R and S denote the predicted
reflectance (albedo) and shading. Assuming an image of a Lambertian scene, we can
write the image decomposition in the log domain as:
log I = logR + logS +N (4.1)
where N models image noise as well as deviations from a Lambertian assumption. In
our model, S is a single-channel (grayscale) image, while R is an RGB image. However,
modeling S with a single channel assumes white light. In practice, the illumination color
can vary across each input video (for instance, red illumination at sunset/sunrise). Hence,
66
we also allow for a colored light in our model:
log I = logR + logS + c+N (4.2)
where c is a single RGB vector that is added to each element of the left-hand side. For
simplicity, we use Eq. 4.1 in the following sections; without loss of generality, we treat
c as being folded into the predicted shading. Each training instance is a stack of m
input images with n pixels taken from a fixed viewpoint and varying illumination. We
denote such an image sequence by I = {I i|i = 1 . . .m}, and denote the corresponding
predicted reflectances and shadings byR = {Ri|i = 1 . . .m}, and S = {Si|i = 1 . . .m},
respectively. Additionally, for each image I i we have a binary mask M i indicating which
pixels are valid (which we use to exclude saturated pixels, sky, dynamic objects, etc).
We wish to devise a method for learning single-view intrinsic image decomposition
that leverages having multiple views during training. Hence, we propose to combine
learning and estimation by encoding our priors into the training loss function. Essentially,
we learn a feed-forward predictor for single-image intrinsic images, trained on image
sequences with a loss that incorporates these priors, and in particular priors that operate
at the sequence level. This loss should also be differentiable and efficient to evaluate,
considerations which guide our design below.
Energy/loss function.
During training, we formulate the problem as a dense CRF over an image se-
quence I, where our goal is to maximize a posterior probability p(R,S|I) =
1
I exp (−E(R,S, I)), where Z(I) is the partition function. Maximizing p(R,S|I) isZ( )
equivalent to minimize an energy function E(R,S, I). Because we use a feed-forward
network to predict the decomposition, we also use this energy function as our training
67
loss. We define E as:
E(R,S, I) = Lreconstruct + w1Lconsistency + w2Lrsmooth + w3Lssmooth
We now describe each term in Eq. 4.3 in detail.
4.5.1 Image reconstruction loss
Given an input sequence I , for each image I i ∈ I we expect the predicted reflectance and
shading for I i to approximately reconstruct I i via our image formation model. Moreover,
since reflectance is constant over time, we should be able to use the reflectance Rj
predicted for any image Ij ∈ I to reconstruct I i, when paired with Si (and masked
by the valid image regions indicated by binary masks M i and M j). This yields a term
involving all pairs of images:
∑m ∑m
L = ∥∥ ∥Li ⊗M i ⊗M jreconstruct ⊗ 2(log I i − logRj − logSi)∥ (4.3)F
i=1 j=1
where ⊗ is the Hadamard product. Similar to [61], we weight our reconstruction loss
1
by input pixel luminance Li = lum(I i) 8 , since dark pixels tend to be noisy, and image
differences in dark regions are magnified in log-space.
We found that including such an all-pairs connected image reconstruction loss
improves prediction results, perhaps because it creates more communication between
predictions. A direct implementation of this loss takes time O(m2n). In Sec. 4.5.5 we
introduce a computational trick that reduces this to O(mn) time, which is key to making
training tractable.
68
4.5.2 Reflectance consistency
We also include a reflectance consistency loss that directly encodes the assumption that
the predicted reflectances should be identical across the image sequence:
∑m ∑m
L = ∥∥ ∥i j 2consistency M ⊗M ⊗ (logRi − logRj)∥ (4.4)F
i=1 j=1
As above, this can be directly computed in time O(m2n), but Sec. 4.5.5 shows how to
reduce this to O(mn).
4.5.3 Dense spatio-temporal reflectance smoothness
Our reflectance smoothness term Lrsmooth is based on the similarity of chromaticity and
intensity between pixels. Because we see a sequence of images at training time, we
can define a reflectance smoothness term that acts jointly on all of the images in each
sequence at once, allowing us to express smoothness in a richer way. Accordingly, we
introduce a novel spatio-temporal densely connected reflectance smoothness term that
considers the similarity of the predicted reflectance at each pixel in the sequence to all
other pixels in the sequence. Our method is inspired by the bilateral-space stereo method
of Barron et al. [17], but we show how to apply their single-image dense solver to an
entire image sequence and how to implement it inside a deep network. We define our
smoothness term as:
∑∑
L 1= Ŵ (logRi − logRjrsmooth pq p q)2 (4.5)2
Ii,Ij p∈Ii
q∈Ij
where p and q indicate pixels in the image sequence, and Ŵ is a (bistochastic) weight
matrix capturing the affinity between any two pixels p and q. Computing this equation
69
directly is very expensive because it involves all pairs of pixels in the sequence, hence
we need a more efficient approach.
First, note that if Ŵ is a bistochastic matrix, we can rewrite Eq. 4.5 in the following
simplified matrix form:
L = r>rsmooth (I − Ŵ )r (4.6)
where r is a stacked vector representation (of length mn) of all of the predicted log-
reflectance images in the sequence: r = [r1 r2 · · · rm]>, where ri is a vector containing
the values in logRi. However, now we have a potentially dense affinity matrix Ŵ ∈
Rmn×mn. But we can approximately evaluate this term much more efficiently if the
pixel-wise affinities are Gaussian, i.(e.,
W = exp −(f − f )>Σ−
)
1
pq p q (fp − fq)) (4.7)
where fp and fq are feature vectors for pixels p and q respectively, and Σ is a covariance
matrix. We can approximately minimize Eq. 4.6 in bilateral space by factorizing the
Gaussian affinity matrix W ≈ S>B̄S, where B̄ = B0B1 · · ·Bd + BdBd−1 · · ·B0 is a
symmetric matrix constructed as a product of sparse matrices representing blur operations
in bilateral space, d is the dimension of feature vector fp, and S is a sparse splat/slicing
matrix that transforms between image space and bilateral space. Finally, let Ŵ = NWN
be a bistochastic representation of W , where N is a diagonal matrix that bistochasticizes
W [181]. This bilateral embedding allows us to write the loss in Eq. 4.6 as:
L ≈ r>(I −NS>rsmooth B̄SN)r (4.8)
Note that Lrsmooth is differentiable and N and S are both sparse matrices that can be
computed efficiently. Our final form of Lrsmooth (Eq. 4.8) can be computed in time
O((d+ 1)mn), rather than O(m2n2).
We define the feature vector used to compute the affinities in Eq. 4.7 as fp =
[ xp, yp, Ip, c1, c2 ]
>, where (xp, yp) is the spatial position of pixel p in the image, Ip
70
is the intensity of p, and c = R1 and c G2 = are the first two elements of theR+G+B R+G+B
L1 chromaticity of p.
4.5.4 Multi-scale shading smoothness
In addition to a reflectance smoothness term, our loss also incorporates a shading
smoothness∑term, Lssmooth. This term is summed over each predicted shading image:
L m i issmooth = i=1 Lssmooth(S ), where Lssmooth(S ) is defined as a weighted L2 term over
neighboring pixels:
∑ ∑ ( )
L i i i
2
ssmooth(S ) = vpq logSp − logSq (4.9)
p∈Ii q∈N(p)
where N(p) denotes the 8-connected neighborhood around pixel p, and vpq is a weight
on each edge. Our insight is to leverage all of the input images to compute the weights
for each individual image. We are inspired by Weiss [370], who derives a multi-image
intrinsic images algorithm based on median image derivatives over the sequence. Essen-
tially, we expect the median image derivative over the input sequence (in the log domain)
to approximate the derivative of the reflectance image. If we denote Jpq = log Ip− log Iq
(dropping the image index i for convenience), then this suggests a weight of the form:
( )
vmed = exp −λmedpq (Jpq −median{Jpq})
2 (4.10)
where median{Jpq} is the median value of Jpq over the image sequence, and λmed is a
parameter defining the strength of vmedpq . This weight discourages shading smoothness
where the gradient of a particular image is very different from the median (as would
happen, e.g., for a shadow boundary).
We found that vmedpq works well as a weight for texture-less regions (for instance, it
captures the effect of a cast shadow on a flat wall well), but, due to noise present in dark
71
Image vmed max{vmed, vmedpq pq pq } vpq
Figure 4.5: Effect of vmed in shading smoothness term. (white = large weight, black
= small weight.) Adding the extra vmed can help capture smoothness in textured regions
such as the pillows in the first row and floor in the second row. The last column shows
the final smoothness weight vpq.
image regions, it does not always capture the desired shading smoothness for textured
surfaces. Figure 4.5 (bottom) illustrates such a case with a checkerboard pattern on the
floor. To address this issue, we define an additional weight vmedpq that is normalized by the
median derivative: ( ( ) )2
− Jmed med pq −median{Jpq}vpq = exp λ (4.11)median{Jpq}
We combine these weights as follows:
vpq = max{vmed, vmedpq pq } · (1−median{Wpq}) (4.12)
This final shading smoothness weight is more robust to textured regions while still
distinguishing shadow discontinuities. The last factor (1−median{Wpq}) reflects the
belief that we should enforce stronger shading smoothness on reflectance edges such as
textures and weaker smoothness on regions of constant reflectance.
Ideally, our shading smoothness term would be densely connected. However, the
median operator is nonlinear and cannot be integrated in a pixel-wise densely connected
term. Instead, to introduce longer-range shading constraints, we compute the shading
72
smoothness term at multiple image scales, by repeatedly downsizing each predicted
shading image by a factor of two. We set the number of scales to be 4, and each scale l is
weighted by a factor 1 .
l
4.5.5 All-pairs weighted least squares (APWLS)
Direct implementations of the all-pairs image reconstruction and reflectance consistency
terms from Sections 4.5.1 and 4.5.2 would take O(m2n) time. This quadratic complexity
would make training intractable for large enough m. Here, we propose a closed-form
version of this all-pairs weighted least squares loss (APWLS) that is linear in m. While
we apply this tool to our scenario, it can be used in other situations involving all-pairs
computation on image sequences.
In general, suppose each image I i is associated with two matrices P i and Qi and two
prediction images X i and Y i. We then can write APWLS as (see Appendix for a detailed
derivation):
∑m ∑m
APWLS = ||P i ⊗Qj ⊗ (X i − Y j)||2F (4.13)
i=1 j=1
=1>(ΣQ2 ⊗ ΣP 2X2 + ΣP 2 ⊗ ΣQ2Y 2 − 2ΣP 2Y ⊗ ΣQ2X)1 (4.14)
where ΣZ denotes the sum over all images of the Hadamard product indicated in the
subscript Z. Evaluating Eq. 4.13 requires time O(m2n), but rewritten as Eq. 4.14, just
O(mn).
We use this derivation to implement our image reconstruction loss Lreconstruct
(Eq. 4.14), by making the substitutions P i = Li ⊗M i, Qj = M j , X i = log I i − logSi
73
and Y j = logRj , and our reflectance consistency loss Lconsistency (Eq. 4.4) by substituting
P i = M i, Qj = M j , X i = logRi and Y j = logRj .
4.6 Evaluation
In this section we evaluate our approach by training solely on our BIGTIME dataset, and
testing on two standard datasets, IIW and SAW. The performance of machine learning
approaches can suffer from cross-dataset domain shift due to dataset bias. For example,
we show that the performance of networks trained on Sintel, MIT, or ShapeNet do not
generalize well to IIW and SAW. However, our method, though not trained on IIW or
SAW data, can still produce competitive results on both datasets. We also evaluate on the
MIT intrinsic images dataset [115], which has full ground truth. Rather than using the
ground truth during training, we train the network on image sequences provided by the
MIT dataset.
Training details. We implement our method in PyTorch [272]. In total, we have 195
image sequences for training. We perform data augmentation via random rotations,
flips, and crops. When feeding images into the network, we resize them to 256× 384,
384× 256, or 256× 256 depending on the original aspect ratio. For all evaluations, we
train the network from scratch using Adam [177].
4.6.1 Evaluation on IIW
To evaluate on the IIW dataset, we train our network on BT (without using IIW training
data) and directly apply our trained model on the IIW test split provided by [253]. Numer-
ical comparisons between our method and other optimization-based and learning-based
74
Method Training set WHDR%
Retinex-Color [115] - 26.9
Garces et al. [99] - 24.8
Zhao et al. [409] - 23.8
Bell et al. [26] - 20.6
Narihira et al. [253]∗ IIW 18.1∗
Zhou et al. [417]∗ IIW 15.7∗
Zhou et al. [417] IIW 19.9
DI [252] Sintel+MIT 37.3
Shi et al. [313] ShapeNet 59.4
Ours (w/ per-image Lreconstruct) BT 25.9
Ours (w/ local Lrsmooth) BT 27.4
Ours (w/ grayscale S) BT 22.3
Ours (full method) BT 20.3
Table 4.1: Results on the IIW test set. Lower is better for the Weighted Human Disagree-
ment Rate (WHDR). The second column indicates the training data each learning-based
method uses; “-” indicates the method is optimization-based. ∗ indicates WHDR is evalu-
ated based on CNN classifer outputs for pairs of pixels rather than full decompositions.
Method Training set AP%
Retinex-Color [115] - 91.93
Garces et al. [99] - 96.89
Zhao et al. [409] - 97.11
Bell et al. [26] - 97.37
Zhou et al. [417] IIW 96.24
DI [252] Sintel+MIT 95.04
Shi et al. [313] ShapeNet 86.30
Ours (w/ local Lssmooth) BT 97.03
Ours (w/o Eq. 4.11) BT 97.15
Ours (full method) BT 97.90
Table 4.2: Results on the SAW test set. Higher is better for AP%. The second column
is described in Table 4.1. Note that none of the methods use annotations from SAW.
approaches are shown in Table 4.1. Our method is competitive with both optimization-
based methods [26] and learning-based methods [417]. Note that the best WHDR
(marked ∗) in the table is achieved using CNN classier outputs on pairs of pixels, rather
than full image decompositions. In contrast, our results are based on full decompositions.
75
(a) Image (b) Bell et al.(R) (c) Bell et al.(S) (d) Zhou et al.(R) (e) Zhou et al.(S) (f) Ours (R) (g) Ours (S)
Figure 4.6: Qualitative comparisons for intrinsic image decomposition on the
IIW/SAW test sets. Our network predictions achieve comparable results to state-of-art
intrinsic image decomposition algorithms (Bell et al. [26] and Zhou et al. [417]).
Additionally, as we show in the next subsection, the best performing method (Zhou et
al. [417]) on IIW (which primarily evaluates reflectance) falls behind on SAW (which
evaluates shading), suggesting that their method tends to overfit on reflectance. shading
accuracy. We also see that networks trained on Sintel, MIT or ShapeNet perform poorly
on IIW, likely due to dataset bias.
We also perform an ablation study on different configurations of our framework.
First, we modify the image reconstruction loss to an alternate loss that considers each
image independently, rather than considering all pairs of images in a sequence. Second,
we evaluate a modified reflectance smoothness loss that uses local pairwise smoothness
(between neighboring pixels) rather than our proposed dense spatio-temporal smoothness.
Finally, we try using grayscale shading, rather than our colored shading. The results,
shown in the last four rows of Table 4.1, demonstrate that our full method can significantly
improve reflectance predictions on the IIW test set compared to simpler configurations.
76
MSE LMSE DSSIM
Method Training set GT refl. shading refl. shading refl. shading
SIRFS [19] MIT Yes 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985
DI [252] MIT+ST Yes 0.0277 0.0154 0.0585 0.0295 0.1526 0.1328
Shi [313] MIT+SN Yes 0.0278 0.0126 0.0503 0.0240 0.1465 0.1200
Ours MIT No 0.0147 0.0135 0.0341 0.0253 0.1398 0.1266
Table 4.3: Results on MIT intrinsics. For all error metrics, lower is better. ST=Sintel
dataset and SN=ShapeNet dataset. The second column shows the dataset used for
training. GT indicates whether the method uses ground truth for training.
4.6.2 Evaluation on SAW
Next, we test our network on SAW [185], again training without using data from SAW.
We also propose two improvements to the metric used to evaluate results on SAW:
First, the original SAW error metric is based on classifying a pixel p as having
smooth/nonsmooth shading based on the gradient magnitude of the predicted shading im-
age, ||∇S||2, normalized to the range [0, 1]. Instead, we measure the gradient magnitude
in the log domain. We do this because of the scale ambiguity inherent to shading and
reflectance, and because it is possible to have very bright values in the shading channel
(e.g., due to strong sunlight), and in such cases if we normalize shading to [0, 1] then most
of the resulting values will be close to 0. In contrast, computing the gradient magnitude
of log shading ||∇ logS||2 achieves scale invariance, resulting in fairer comparisons for
all methods. As in [185], we sweep a threshold τ to create a precision-recall (PR) curve
that captures how well each method captures smooth and non-smooth shading.
Second, Kovacs et al. [185] apply a 10 × 10 maximum filter to the shading gra-
dient magnitude image before computing PR curves, because many shadow boundary
annotations are not precisely localized. However, this maximum filter can result in
degraded performance for smooth shading regions. Instead, we use the max-filtered
log-gradient-magnitude image when classifying non-smooth annotations, but use the
77
unfiltered log gradient image when classifying smooth annotations (see Appendix for
details).
All methods, including our own, are trained without use of SAW data. Average
precision (AP) scores are shown in Table 4.2 (please see the Appendix for full precision-
recall curves). Our method has the best performance among all prior methods we tested,
and our full loss outperforms variants with terms removed. In particular, our method
outperforms the best optimization-based algorithm [26] on both IIW and SAW. On the
other hand, Zhou et al. [417] tends to overfit to IIW, as their performance on SAW ranks
lower than several other methods. Again, networks trained on Sintel, MIT, and ShapeNet
data perform poorly on SAW.
4.6.3 Qualitative results on IIW and SAW
Figure 4.6 shows qualitative results from our method and two other state-of-art intrinsic
image decomposition algorithms, Zhou et al. [417] and Bell et al. [26], on test images
from IIW and SAW. Our results are visually comparable to these methods. One observa-
tion is that our shading predictions for dark pixels can be quite dark, leading to reduced
contrast in the reflectance images. However, this loss of contrast does not hurt numerical
performance. Additionally, like other CNN approaches [252, 313], the direct predictions
from our network may not strictly satisfy I = R · S since the two decoders predict R
and S simultaneously at test time. As future work, it would be interesting to use our
predictions as priors for optimization to address these issues.
78
4.6.4 Evaluation on MIT intrinsic images
The MIT intrinsic images dataset [115] contains 20 objects with ground truth reflectance
and shading, as well as an associated image sequence taken from 11 different directional
light sources. We use the same training-test split as in Barron et al. [19], but instead of
training our network on the ground truth provided by the MIT dataset, we train only on
the provided image sequences using our learning approach. In this case, we configure
our network to produce grayscale shading outputs, since the MIT dataset only contains
grayscale shading ground truth images.
We compare our approach to several supervised learning methods including
SIRFS [19], Direct Intrinsics (DI) [252] and Shi et al. [313]. These prior methods
all train using ground truth reflectance and shading images, and additionally DI [252]
and Shi et al. [313] pretrain on Sintel [51] and ShapeNet [57], respectively. In contrast,
we train our network from scratch and only use the provided image sequences during
training. We adopt the same metrics as [313], including mean square error (MSE), local
mean square error (LMSE), and structural dissimilarity index (DSSIM).
Numerical results are shown in Table 4.3 and qualitative comparisons are shown in
Figure 4.7. Averaged over reflectance and shading, our results numerically outperform
both prior CNN-based supervised learning methods [252, 313]. In particular, our albedo
estimates are significantly better, while our shading estimates are comparable (slightly
better than [252], and slightly worse than [313]). SIRFS has the best numerical results
on the MIT, but SIRFS’s priors only apply to single objects, and their algorithm performs
much more poorly on full images of real-world scenes [252, 313].
79
(a) Image (b) GT (c) SIRFS (d) DI (e) Shi et al. (f) Ours
Figure 4.7: Qualitative comparisons on the MIT intrinsic test set. Odd-number rows
show predicted reflectance; even-numbered rows show predicted shading. (a) Input
image, (b) Ground truth (GT), (c) SIRFS [19], (d) Direct Intrinsics (DI) [252], (e) Shi et
al. [313], (f) Our method.
4.7 Discussion
In this chapter, we presented a new method for learning intrinsic images, supervised
not by ground truth decompositions, but instead by simply observing image sequences
with varying illumination over time, and learning to produce decompositions that are
consistent with these sequences. Our model can then be run on single images, producing
competitive results on several benchmarks. Our results illustrate the power of learning
80
decompositions simply from watching large amounts of video. In the future, we plan to
combine our approach with other kinds of annotations (IIW, SAW, etc), to measure how
well they perform when used together, and to use our outputs as inputs to optimization-
based methods.
81
CHAPTER 5
LEARNING BETTER INTRINSIC IMAGES THROUGH
PHYSICALLY-BASED RENDERING
5.1 Introduction
Intrinsic images is a classic vision problem involving decomposing an input image I
into a product of reflectance (albedo) and shading images R · S. Recent years have seen
remarkable progress on this problem, but it remains challenging due to its ill-posedness.
An attractive proposition has been to replace traditional hand-crafted priors with learned,
CNN-based models. For such learning methods data is key, but collecting ground truth
data for intrinsic images is extremely difficult, especially for images of real-world scenes.
One way to generate large amounts of training data for intrinsic images is to render
synthetic scenes. However, existing synthetic datasets are limited to images of single
objects [157, 313] (e.g., via ShapeNet [57]) or images of CG animation that utilize
simplified, unrealistic illumination (e.g., via Sintel [52]). An alternative is to collect
ground truth for real images using crowdsourcing, as in the Intrinsic Images in the Wild
(IIW) and Shading Annotations in the Wild (SAW) datasets [26, 185]. However, the
annotations in such datasets are sparse and difficult to collect accurately at scale.
Inspired by recent efforts to use synthetic images of scenes as training data for indoor
and outdoor scene understanding [287, 292, 96, 286], we present the first large-scale
scene-level intrinsic images dataset based on high-quality physically-based rendering,
which we call CGINTRINSICS (CGI). CGI consists of over 20,000 images of indoor
scenes, based on the SUNCG dataset [331]. Our aim with CGI is to help drive significant
progress towards solving the intrinsic images problem for Internet photos of real-world
Synthetic Images IIW Annotations SAW Annotations 
Train R 
S 
Input Image Decomposition  Network 
Figure 5.1: Overview and network architecture. Our work integrates physically-based
rendered images from our CGINTRINSICS dataset and reflectance/shading annotations
from IIW and SAW in order to train a better intrinsic decomposition network.
scenes. We find that high-quality physically-based rendering is essential for our task.
While SUNCG provides physically-based scene renderings [406], our experiments show
that the details of how images are rendered are of critical importance, and certain choices
can lead to massive improvements in how well CNNs trained for intrinsic images on
synthetic data generalize to real data.
We also propose a new partially supervised learning method for training a CNN
to directly predict reflectance and shading, by combining ground truth from CGI and
sparse annotations from IIW/SAW. Through evaluations on IIW and SAW, we find that,
surprisingly, decomposition networks trained solely on CGI can achieve state-of-the-art
performance on both datasets. Combined training using both CGI and IIW/SAW leads
to even better performance. Finally, we find that CGI generalizes better than existing
datasets by evaluating on MIT Intrinsic Images, a very different, object-centric, dataset.
83
5.2 Related work
Optimization-based methods. The classical approach to intrinsic images is to integrate
various priors (smoothness, reflectance sparseness, etc.) into an optimization frame-
work [192, 409, 296, 311, 99, 26]. However, for images of real-world scenes, such
hand-crafted prior assumptions are difficult to craft and are often violated. Several recent
methods seek to improve decomposition quality by integrating surface normals or depths
from RGB-D cameras [61, 18, 158] into the optimization process. However, these meth-
ods assume depth maps are available during optimization, preventing them from being
used for a wide range of consumer photos.
Learning-based methods. Learning methods for intrinsic images have recently been
explored as an alternative to models with hand-crafted priors, or a way to set the parame-
ters of such models automatically. Barron and Malik [19] learn parameters of a model
that utilizes sophisticated priors on reflectance, shape and illumination. This approach
works on images of objects (such as in the MIT dataset), but does not generalize to real
world scenes. More recently, CNN-based methods have been deployed, including work
that regresses directly to the output decomposition based on various training datasets,
such as Sintel [252, 176], MIT intrinsics and ShapeNet [313, 157]. Shu et al. [320]
also propose a CNN-based method specifically for the domain of facial images, where
ground truth geometry can be obtained through model fitting. However, as we show in
the evaluation section, the networks trained on such prior datasets perform poorly on
images of real-world scenes.
Two recent datasets are based on images of real-world scenes. Intrinsic Images in
the Wild (IIW) [26] and Shading Annotations in the Wild (SAW) [185] consist of sparse,
crowd-sourced reflectance and shading annotations on real indoor images. Subsequently,
84
several papers train CNN-based classifiers on these sparse annotations and use the
classifier outputs as priors to guide decomposition [185, 417, 428, 253]. However, we
find these annotations alone are insufficient to train a direct regression approach, likely
because they are sparse and are derived from just a few thousand images. Finally, very
recent work has explored the use of time-lapse imagery as training data for intrinsic
images [205], although this provides a very indirect source of supervision.
Synthetic datasets for real scenes. Synthetic data has recently been utilized to improve
predictions on real-world images across a range of problems. For instance, [287, 286]
created a large-scale dataset and benchmark based on video games for the purpose of
autonomous driving, and [25, 38] use synthetic imagery to form small benchmarks for
intrinsic images. SUNCG [406] is a recent, large-scale synthetic dataset for indoor scene
understanding. However, many of the images in the PBRS database of physically-based
renderings derived from SUNCG have low signal-to-noise ratio (SNR) and non-realistic
sensor properties. We show that higher quality renderings yield much better training data
for intrinsic images.
5.3 CGINTRINSICS Dataset
To create our CGINTRINSICS (CGI) dataset, we started from the SUNCG dataset [331],
which contains over 45,000 3D models of indoor scenes. We first considered the PBRS
dataset of physically-based renderings of scenes from SUNCG [406]. For each scene,
PBRS samples cameras from good viewpoints, and uses the physically-based Mitsuba
renderer [156] to generate realistic images under reasonably realistic lighting (including
a mix of indoor and outdoor illumination sources), with global illumination. Using such
an approach, we can also generate ground truth data for intrinsic images by rendering a
85
Figure 5.2: Visualization of ground truth from our CGINTRINSICS dataset. Top row:
rendered RGB images. Middle: ground truth reflectance. Bottom: ground truth shading.
Note that light sources are masked out when creating the ground truth decomposition.
standard RGB image I , then asking the renderer to produce a reflectance map R from
the same viewpoint, and finally dividing to get the shading image S = I/R. Examples
of such ground truth decompositions are shown in Figure 5.2. Note that we automatically
mask out light sources (including illumination from windows looking outside) when
creating the decomposition, and do not consider those pixels when training the network.
However, we found that the PBRS renderings are not ideal for use in training real-
world intrinsic image decomposition networks. In fact, certain details in how images are
rendered have a dramatic impact on learning performance:
Rendering quality. Mitsuba and other high-quality renderers support a range of ren-
dering algorithms, including various flavors of path tracing methods that sample many
light paths for each output pixel. In PBRS, the authors note that bidirectional path
tracing works well but is very slow, and opt for Metropolis Light Transport (MLT) with
a sample rate of 512 samples per pixel [406]. In contrast, for our purposes we found
86
Figure 5.3: Visual comparisons between our CGI and the original SUNCG dataset.
Top row: images from SUNCG/PBRS. Bottom row: images from our CGI dataset. The
images in our dataset have higher SNR and are more realistic.
that bidirectional path tracing (BDPT) with very large numbers of samples per pixel was
the only algorithm that gave consistently good results for rendering SUNCG images.
Comparisons between selected renderings from PBRS and our new CGI images are
shown in Figure 5.3. Note the significantly decreased noise in our renderings.
This extra quality comes at a cost. We find that using BDPT with 8,192 samples per
pixel yields acceptable quality for most images. This increases the render time per image
significantly, from a reported 31s [406], to approximately 30 minutes.1 One reason for
the need for large numbers of samples is that SUNCG scenes are often challenging from
a rendering perspective—the illumination is often indirect, coming from open doorways
or constrained in other ways by geometry. However, rendering is highly parallelizable,
and over the course of about six months we rendered over ten thousand images on a
cluster of about 10 machines.
Tone mapping from HDR to LDR. We found that another critical factor in image
generation is how rendered images are tone mapped. Renderers like Mitsuba generally
produce high dynamic range (HDR) outputs that encode raw, linear radiance estimates
1While high, this is still a fair ways off of reported render times for animated films. For instance, each
frame of Pixar’s Monsters University took a reported 29 hours to render [347].
87
Dataset Size Setting Rendered/Real Illumination GT type
MPI Sintel [51] 890 Animation non-PB spatial-varying full
MIT Intrinsics [115] 110 Object Real single global full
ShapeNet [313] 2M+ Object PB single global full
IIW [26] 5230 Scene Real spatial-varying sparse
SAW [185] 6677 Scene Real spatial-varying sparse
CGINTRINSICS 20,000+ Scene PB spatial-varying full
Table 5.1: Comparisons of existing intrinsic image datasets with our CGINTRINSICS
dataset. PB indicates physically-based rendering and non-PB indicates non-physically-
based rendering.
for each pixel. In contrast, real photos are usually low dynamic range. The process that
takes an HDR input and produces an LDR output is called tone mapping, and in real
cameras the analogous operations are the auto-exposure, gamma correction, etc., that
yield a well-exposed, high-contrast photograph. PBRS uses the tone mapping method
of Reinhard et al. [283], which is inspired by photographers such as Ansel Adams, but
which can produce images that are very different in character from those of consumer
cameras. We find that a simpler tone mapping method produces more natural-looking
results. Again, Figure 5.3 shows comparisons between PBRS renderings and our own.
Note how the color and illumination features, such as shadows, are better captured in our
renderings (we noticed that shadows often disappear with the Reinhard tone mapper).
In particular, to tone map a linear HDR radiance image IHDR, we find the 90th
percentile intensity value r90, then compute the image I
γ 1
LDR = αIHDR, where γ = is2.2
a standard gamma correction factor, and α is computed such that r90 maps to the value
0.8. The final image is then clipped to the range [0, 1]. This mapping ensures that at most
10% of the image pixels (and usually many fewer) are saturated after tone mapping, and
tends to result in natural-looking LDR images.
Using the above rendering approach, we re-rendered ∼ 20,000 images from PBRS.
We also integrated 152 realistic renderings from [38] into our dataset. Table 5.1 compares
88
our CGI dataset to prior intrinsic image datasets. Sintel is a dataset created for an
animated film, and does not utilize physical-based rendering. Other datasets, such as
ShapeNet and MIT, are object-centered, whereas CGI focuses on images of indoor scenes,
which have more sophisticated structure and illumination (cast shadows, spatial-varying
lighting, etc). Compared to IIW and SAW, which include images of real scenes, CGI has
full ground truth and and is much more easily collected at scale.
5.4 Learning Cross-Dataset Intrinsics
In this section, we describe how we use CGINTRINSICS to jointly train an intrinsic
decomposition network end-to-end, incorporating additional sparse annotations from IIW
and SAW. Our full training loss considers training data from each dataset:
L = LCGI + λIIWLIIW + λSAWLSAW. (5.1)
where LCGI, LIIW, and LSAW are the losses we use for training from the CGI, IIW, and
SAW datasets respectively. The most direct way to train would be to simply incorporate
supervision from each dataset. In the case of CGI, this supervision consists of full ground
truth. For IIW and SAW, this supervision takes the form of sparse annotations for each
image, as illustrated in Figure 5.1. However, in addition to supervision, we found that
incorporating smoothness priors into the loss also improves performance. Our full loss
functions thus incorporate a number of terms:
LCGI =Lsup + λordLord + λrecLreconstruct (5.2)
LIIW =λordLord + λrsLrsmooth + λssLssmooth + Lreconstruct (5.3)
LSAW =λS/NSLS/NS + λrsLrsmooth + λssLssmooth + Lreconstruct (5.4)
We now describe each term in detail.
89
5.4.1 Supervised losses
CGIntrinsics-supervised loss. Since the images in our CGI dataset are equipped with a
full ground truth decomposition, the learning problem for this dataset can be formulated
as a direct regression problem from input image I to output images R and S. However,
because the decomposition is only up to an unknown scale factor, we use a scale-invariant
supervised loss, LsiMSE (for “scale-invariant mean-squared-error”). In addition, we add a
gradient domain multi-scale matching term Lgrad. For each training image in CGI, our
supervised loss is defined as Lsup = L∑ siMSE
+ Lgrad, where
N
L 1 ∗ 2 ∗ 2siMSE = (Ri − crRi) + (SN i
− csSi) (5.5)
∑L ∑ i=1N1 l ∥ ∥ ∥ ∥L ∥ ∗ ∥ ∥ ∗ ∥grad = ∇Rl,i − cr∇Rl,i + ∇S − c ∇S1 l,i s l,i . (5.6)N 1l
l=1 i=1
R ∗l,i (Rl,i) and Sl,i (S
∗
l,i) denote reflectance prediction (resp. ground truth) and shading
prediction (resp. ground truth) respectively, at pixel i and scale l of an image pyramid.
Nl is the number of valid pixels at scale l and N = N1 is the number of valid pixels at
the original image scale. The scale factors cr and cs are computed via least squares.
In addition to the scale-invariance of LsiMSE, another important aspect is that we
compute the MSE in the linear intensity domain, as opposed to the all-pairs pixel
comparisons in the log domain used in [252]. In the log domain, pairs of pixels with
large absolute log-difference tend to dominate the loss. As we show in our evaluation,
computing LsiMSE in the linear domain significantly improves performance.
Finally, the multi-scale gradient matching term Lgrad encourages decompositions to
be piecewise smooth with sharp discontinuities.
Ordinal reflectance loss. IIW provides sparse ordinal reflectance judgments between
pairs of points (e.g., “point i has brighter reflectance than point j”). We introduce a
90
Image CGI (R) CGI (S) CGI+IIW (R) CGI+IIW (S)
Figure 5.4: Examples of predictions with and without IIW training data. Adding
real IIW data can qualitatively improve reflectance and shading predictions. Note for
instance how the quilt highlighted in first row has a more uniform reflectance after
incorporating IIW data, and similarly for the floor highlighted in the second row.
loss based on this ordinal supervision. For a given IIW training image and predicted
reflectance R∑, we accumulate losses for each pair of annotated pixels (i, j) in that image:
Lord(R) = (i,j) ei,j(R), where
wi,j(logRi − logR 2j) , ri,j = 0
ei,j(R) = 
w 2 (5.7)i,j (max(0,m− logRi + logRj)) , ri,j = +1
wi,j (max(0,m− logR + logR ))2j i , ri,j = −1
and ri,j is the ordinal relation from IIW, indicating whether point i is darker (-1), j is
darker (+1), or they have equal reflectance (0). wi,j is the confidence of the annotation,
provided by IIW. Example predictions with and without IIW data are shown in Fig. 5.4.
We also found that adding a similar ordinal term derived from CGI data can improve
reflectance predictions. For each image in CGI, we over-segment it using superpixel
segmentation [2]. Then in each training iteration, we randomly choose one pixel from
every segmented region, and for each pair of chosen pixels, we evaluate Lord similar to
Eq. 5.7, with wi,j = 1 and the ordinal relation derived from the ground truth reflectance.
SAW shading loss. The SAW dataset provides images containing annotations of smooth
(S) shading regions and non-smooth (NS) shading points, as depicted in Figure 5.1. These
91
annotations can be further divided into three types: regions of constant shading, shadow
boundaries, and depth/normal discontinuities.
We integrate all three types of annotations into our supervised SAW loss LS/NS.
For each constant shading region (with Nc pixels), we compute a loss Lconstant−shading
encouraging the variance of the predicted shading in the(region to be)zero:∑N 21 c ∑NcL 2 1constant−shading = (logSi) − logSi . (5.8)
N 2c Ni=1 c i=1
SAW also provides individual point annotations at cast shadow boundaries. As noted
in [185], these points are not localized precisely on shadow boundaries, and so we apply
a morphological dilation with a radius of 5 pixels to the set of marked points before
using them in training. This results in shadow boundary regions. We find that most
shadow boundary annotations lie in regions of constant reflectance, which implies that
for all pair of shading pixels within a small neighborhood, their log difference should be
approximately equal to the log difference of the image intensity. This is equivalent to
encouraging the variance of logSi − log Ii within this small region to be 0 [85]. Hence,
we define the loss for each shadow boundary region ((with Nsd) pixels as:∑ )N ∑N 21 sd 1 sdLshadow = (logSi − log I 2i) − (logSi − log Ii) (5.9)
N 2sd Ni=1 sd i=1
Finally, SAW provides depth/normal discontinuities, which are also usually shading
discontinuities. However, since we cannot derive the actual shading change for such
discontinuities, we simply mask out such regions in our shading smoothness term Lssmooth
(Eq. 5.11), i.e., we do not penalize shading changes in such regions. As above, we first
dilate these annotated regions before use in training. Examples predictions before/after
adding SAW data into our training are shown in Fig. 5.5.
92
Image CGI (R) CGI (S) CGI+SAW (R) CGI+SAW (S)
Figure 5.5: Examples of predictions with and without SAW training data. Adding
SAW training data can qualitatively improve reflectance and shading predictions. Note
the pictures/TV highlighted in the decompositions in the first row, and the improved
assignment of texture to the reflectance channel for the paintings and sofa in the second
row.
5.4.2 Smoothness losses
To further constrain the decompositions for real images in IIW/SAW, following clas-
sical intrinsic image algorithms we add reflectance smoothness Lrsmooth and shading
smoothness Lssmooth terms. For reflectance, we use a multi-scale `1 smoothness term to
encourage reflectance predictions to be piecewise constant:
∑L N
L 1
∑l ∑
rsmooth = vl,i,j ‖logRl,i − logRl,j‖
N l 1
(5.10)
l
l=1 i=1 j∈N (l,i)
where N (l, i) denotes the 8-connected(neighborhood of the pixel at po)sition i and scale
l. The reflectance weight v 1 T −1l,i,j = exp − (fl,i − fl,j) Σ (fl,i − fl,j) , and the feature2
vector fl,i is defined as [ p 1 2l,i, Il,i, cl,i, cl,i ], where pl,i and Il,i are the spatial position and
image intensity respectively, and c1 2l,i and cl,i are the first two elements of chromaticity. Σ
is a covariance matrix defining the distance between two feature vectors.
We also include a densely-connected `2 shading smoothness term, which can be
evaluated in linear time in the number of pixels N using bilateral embeddings [17, 205]:
1 ∑N ∑NLssmooth = Ŵi,j (logSi − 1logS 2 > >j) ≈ s (I −NbSb B̄bSbNb)s (5.11)2N N
i j
93
( )
where Ŵ is a bistochastic weight matrix derived fromW and p −pWi,j = exp −1 || i j ||2 .2 σp 2
We refer readers to [17, 205] for a detailed derivation. As shown in our experiments,
adding such smoothness terms to real data can yield better generalization.
5.4.3 Reconstruction loss
Finally, for each training image in each dataset, we add a loss expressing the constraint
that the reflectance and shading should reconstruct the original image:
∑N
L 1 2reconstruct = (Ii −RiSi) . (5.12)
N
i=1
5.4.4 Network architecture
Our network architecture is illustrated in Figure 5.1. We use a variant of the “U-Net”
architecture [205, 153]. Our network has one encoder and two decoders with skip
connections. The two decoders output log reflectance and log shading, respectively. Each
layer of the encoder mainly consists of a 4× 4 stride-2 convolutional layer followed by
batch normalization [151] and leaky ReLu [132]. For the two decoders, each layer is
composed of a 4× 4 deconvolutional layer followed by batch normalization and ReLu,
and a 1× 1 convolutional layer is appended to the final layer of each decoder.
5.5 Evaluation
We conduct experiments on two datasets of real world scenes, IIW [26] and SAW [185]
(using test data unseen during training) and compare our method with several state-of-
94
Method Training set WHDR Method Training set WHDR
Retinex-Color [115] - 26.9% Ours (log, LsiMSE) CGI 22.7%
Garces et al. [99] - 24.8% Ours (w/o Lgrad) CGI 19.7%
Zhao et al. [409] - 23.8% Ours (w/o Lord) CGI 19.9%
Bell et al. [26] - 20.6% Ours (w/o Lrsmooth) All 16.1%
Ours SUNCG 26.1%
Zhou et al. [417] IIW 19.9%
Ours† CGI 18.4%
Bi et al. [32] - 17.7% Ours CGI 17.8%
Nestmeyer et al. [254] IIW 19.5% Ours CGI+IIW(O) 17.5%
Nestmeyer et al. [254]∗ IIW 17.7% Ours CGI+IIW(A) 16.2%
DI [252] Sintel 37.3% Ours All 15.5%
Shi et al. [313] ShapeNet 59.4% Ours∗ All 14.8%
Table 5.2: Numerical results on the IIW test set. Lower is better for WHDR. The
“Training set” column specifies the training data used by each learning-based method: “-”
indicates an optimization-based method. IIW(O) indicates original IIW annotations and
IIW(A) indicates augmented IIW comparisons. “All” indicates CGI+IIW(A)+SAW. †
indicates network was validated on CGI and others were validated on IIW. ∗ indicates
CNN predictions are post-processed with a guided filter [254].
the-art intrinsic images algorithms. Additionally, we also evaluate the generalization of
our CGI dataset by evaluating it on the MIT Intrinsic Images benchmark [115].
Network training details. We implement our method in PyTorch [272]. For all three
datasets, we perform data augmentation through random flips, resizing, and crops. For
all evaluations, we train our network from scratch using the Adam [177] optimizer, with
initial learning rate 0.0005 and mini-batch size 16. We refer readers to the supplementary
material for the detailed hyperparameter settings.
5.5.1 Evaluation on IIW
We follow the train/test split for IIW provided by [253], also used in [417]. We also
conduct several ablation studies using different loss configurations. Quantitative compar-
isons of Weighted Human Disagreement Rate (WHDR) between our method and other
optimization- and learning-based methods are shown in Table 5.2.
95
Comparing direct CNN predictions, our CGI-trained model is significantly better
than the best learning-based method [254], and similar to [32], even though [254] was
directly trained on IIW. Additionally, running the post-processing from [254] on the
results of the CGI-trained model achieves a further performance boost. Table 5.2 also
shows that models trained on SUNCG (i.e., PBRS), Sintel, MIT Intrinsics, or ShapeNet
generalize poorly to IIW likely due to the lower quality of training data (SUNCG/PBRS),
or the larger domain gap with respect to images of real-world scenes, compared to CGI.
The comparison to SUNCG suggests the key importance of our rendering decisions.
We also evaluate networks trained jointly using CGI and real imagery from IIW. As
in [417], we augment the pairwise IIW judgments by globally exploiting their transitivity
and symmetry. The right part of Table 5.2 demonstrates that including IIW training data
leads to further improvements in performance, as does also including SAW training data.
Table 5.2 also shows various ablations on variants of our method, such as evaluating
losses in the log domain and removing terms from the loss functions. Finally, we test
a network trained on only IIW/SAW data (and not CGI), or trained on CGI and fine-
tuned on IIW/SAW. Although such a network achieves ∼19% WHDR, we find that the
decompositions are qualitatively unsatisfactory. The sparsity of the training data causes
these networks to produce degenerate decompositions, especially for shading images.
5.5.2 Evaluation on SAW
To evaluate our shading predictions, we test our models on the SAW [185] test set,
utilizing the error metric introduced in [205]. We also propose a new, more challenging
error metric for SAW evaluation. In particular, we found that many of the constant-
shading regions annotated in SAW also have smooth image intensity (e.g., textureless
96
Method Training set AP% (unweighted) AP% (challenge)
Retinex-Color [115] - 91.93 85.26
Garces et al. [99] - 96.89 92.39
Zhao et al. [409] - 97.11 89.72
Bell et al. [26] - 97.37 92.18
Zhou et al. [417] IIW 96.24 86.34
Nestmeyer et al. [254] IIW 97.26 89.94
Nestmeyer et al. [254]∗ IIW 96.85 88.64
DI [252] Sintel+MIT 95.04 86.08
Shi et al. [313] ShapeNet 86.62 81.30
Ours (log, LsiMSE) CGI 97.73 93.03
Ours (w/o Lgrad) CGI 98.15 93.74
Ours (w/o Lssmooth) CGI+IIW(A)+SAW 98.60 94.87
Ours SUNCG 96.56 87.09
Ours† CGI 98.16 93.21
Ours CGI 98.39 94.05
Ours CGI+IIW(A) 98.56 94.69
Ours CGI+IIW(A)+SAW 99.11 97.93
Table 5.3: Quantitative results on the SAW test set. Higher is better for AP%. The
second column is described in Table 5.2. The third and fourth columns show perfor-
mance on the unweighted SAW benchmark and our more challenging gradient-weighted
benchmark, respectively.
1 1
0.9 0.9
0.8 CGI 0.8 CGI
CGI+IIW CGI+IIW
CGI+IIW+SAW CGI+IIW+SAW
0.7 ShapeNet [Shi et al. 2017] 0.7 ShapeNet [Shi et al. 2017]
Sintel+MIT [Narihira et al. 2015] Sintel+MIT [Narihira et al. 2015]
[Bell et al. 2014] [Bell et al. 2014]
Retinex-Color [Grosse et al. 2009]
0.6 Retinex-Color [Grosse et al. 2009][Garces et al. 2012] 0.6 [Garces et al. 2012]
[Zhao et al. 2012] [Zhao et al. 2012]
[Zhou et al. 2015] [Zhou et al. 2015]
0.5 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall
Figure 5.6: Precision-Recall (PR) curve for shading images on the SAW test set. Left:
PR curves generated using the unweighted SAW error metric of [205]. Right: curves
generated using our more challenging gradient-weighted metric.
walls), making their shading easy to predict. Our proposed metric downweights such
regions as follows. For each annotated region of constant shading, we compute the
average image gradient magnitude over the region. During evaluation, when we add
the pixels belonging to a region of constant shading into the confusion matrices, we
97
Precision
Precision
multiply the number of pixels by this average gradient. This proposed metric leads to
more distinguishable performance differences between methods, because regions with
rich textures will contribute more to the error compared to the unweighted metric.
Figure 5.6 and Table 5.3 show precision-recall (PR) curves and average precision
(AP) on the SAW test set with both unweighted [205] and our proposed challenge error
metrics. As with IIW, networks trained solely on our CGI data can achieve state-of-
the-art performance, even without using SAW training data. Adding real IIW data
improves the AP in term of both error metrics. Finally, the last column of Table 5.3
shows that integrating SAW training data can significantly improve the performance on
shading predictions, suggesting the effectiveness of our proposed losses for SAW sparse
annotations.
Note that the previous state-of-the-art algorithms on IIW (e.g., Zhou et al. [417] and
Nestmeyer et al. [254]) tend to overfit to reflectance, hurting the accuracy of shading
predictions. This is especially evident in terms of our proposed challenge error metric.
In contrast, our method achieves state-of-the-art results on both reflectance and shading
predictions, in terms of all error metrics. Note that models trained on the original
SUNCG, Sintel, MIT intrinsics or ShapeNet datasets perform poorly on the SAW test set,
indicating the much improved generalization to real scenes of our CGI dataset.
Qualitative results on IIW/SAW. Figure 5.7 shows qualitative comparisons between
our network trained on all three datasets, and two other state-of-the-art intrinsic images
algorithms (Bell et al. [26] and Zhou et al. [417]), on images from the IIW/SAW test sets.
In general, our decompositions show significant improvements. In particular, our network
is better at avoiding attributing surface texture to the shading channel (for instance, the
checkerboard patterns evident in the first two rows, and the complex textures in the last
four rows) while still predicting accurate reflectance (such as the mini-sofa in the images
98
Image Bell et al.(R) Bell et al.(S) Zhou et al.(R) Zhou et al.(S) Ours (R) Ours (S)
Figure 5.7: Qualitative comparisons on the IIW/SAW test sets. Our predictions show
significant improvements compared to state-of-the-art algorithms (Bell et al. [26] and
Zhou et al. [417]). In particular, our predicted shading channels include significantly less
surface texture in several challenging settings.
of third row). In contrast, the other two methods often fail to handle such difficult settings.
In particular, [417] tends to overfit to reflectance predictions, and their shading estimates
strongly resemble the original image intensity. However, our method still makes mistakes,
such as the non-uniform reflectance prediction for the chair in the fifth row, as well as
residual textures and shadows in the shading and reflectance channels.
99
MSE LMSE DSSIM
Method Training set refl. shading refl. shading refl. shading
SIRFS [19] MIT 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985
DI [252] Sintel+MIT 0.0277 0.0154 0.0585 0.0295 0.1526 0.1328
Shi et al. [313] ShapeNet 0.0468 0.0194 0.0752 0.0318 0.1825 0.1667
Shi et al. [313]? ShapeNet+MIT 0.0278 0.0126 0.0503 0.0240 0.1465 0.1200
Ours CGI 0.0221 0.0186 0.0349 0.0259 0.1739 0.1652
Ours? CGI +MIT 0.0167 0.0127 0.0319 0.0211 0.1287 0.1376
Table 5.4: Quantitative Results on MIT intrinsics testset. For all error metrics, lower
is better. The second column shows the dataset used for training. ? indicates models
fine-tuned on MIT.
Image GT SIRFS DI Shi et al. Shi et al.? Ours Ours?
Figure 5.8: Qualitative comparisons on MIT intrinsics testset. Odd rows: reflectance
predictions. Even rows: shading predictions. ? are the predictions fine-tuned on MIT.
5.5.3 Evaluation on MIT intrinsic images
For the sake of completeness, we also test the ability of our CGI-trained networks to
generalize to the MIT Intrinsic Images dataset [115]. In contrast to IIW/SAW, the MIT
100
dataset contains 20 real objects with 11 different illumination conditions. We follow
the same train/test split as Barron et al. [19], and, as in the work of Shi et al. [313], we
directly apply our CGI trained networks to MIT testset, and additionally test fine-tuning
them on the MIT training set.
We compare our models with several state-of-the-art learning-based methods using
the same error metrics as [313]. Table 5.4 shows quantitative comparisons and Figure 5.8
shows qualitative results. Both show that our CGI-trained model yields better perfor-
mance compared to ShapeNet-trained networks both qualitatively and quantitatively,
even though like MIT, ShapeNet consists of images of rendered objects, while our dataset
contains images of scenes. Moreover, our CGI-pretrained model also performs better
than networks pretrained on ShapeNet and Sintel. These results further demonstrate the
improved generalization ability of our CGI dataset compared to existing datasets. Note
that SIRFS still achieves the best results, but as described in [252, 313], their methods
are designed specifically for single objects and generalize poorly to real scenes.
5.6 Discussion
We presented a new synthetic dataset for learning intrinsic images, and an end-to-end
learning approach that learns better intrinsic image decompositions by leveraging datasets
with different types of labels. Our evaluations illustrate the surprising effectiveness of
our synthetic dataset on Internet photos of real-world scenes. We find that the details of
rendering matter, and hypothesize that improved physically-based rendering may benefit
other vision tasks, such as normal prediction and semantic segmentation [406].
101
CHAPTER 6
CROWDSAMPLING THE PLENOPTIC FUNCTION
6.1 Introduction
There is a thought experiment that goes something like this:
Imagine a ‘camera’ with no optics or image sensor of any kind. Rather, it consists only of a box
equipped with GPS, a radio for Internet access, a button for ‘taking pictures’, and a screen for
displaying those pictures. When a user presses its button, the box searches the Internet for photos
tagged with its current location, and from these selects a best match to display on the screen.
This thought experiment is perhaps best understood in the context of popular tourist
attractions, of which one can often find countless images posted online (Figure 6.1,
second row). When pointed at such an attraction, one can imagine our box producing
images very similar to those of a real camera, forcing us to consider whether an image
we capture ourselves is meaningfully different from a near-identical one captured by
strangers. For many, the ensuing philosophical debate hinges on whether an image
reflects the scene as they remember it. After all, appearance is not generally constant
over time, even under a fixed geometry and viewpoint; in outdoor settings, for example,
weather changes, shadows move, and day turns to night—all resulting in appearance
changes that can be observed from a single view of the scene.
This poses an interesting challenge to the field of image-based rendering: can we use
crowdsourced imagery to synthesize arbitrary views of a scene with viewing conditions
that change over time? Without changing viewing conditions, this challenge would
reduce to the more familiar problem of reconstructing a 4D light field L(u, v, x, y) that
Time
Arrangement of 
Captured Images, Reference 
View, and Target Light Field
Exemplar Views from
Internet Photo Collections
Reconstructed Reference 
Light Field under Exemplar 
Viewing Conditions
(Plenoptic Slices)
Figure 6.1: Crowdsampled plenoptic slices. Given a large number of tourist photos
taken at different times of day, our system learns to construct a continuous set of lightfields
and to synthesize novel views capturing all-times-of-day scene appearance.
describes all light in our scene [199]. When we add time to our problem, it turns into a
5D reconstruction over what Adelson and Bergen [3] call the plenoptic function.1
In this paper, we propose a novel approach to neural image-based rendering from
crowdsourced images that leverages the sparse structure of the plenoptic function to
learn how scene appearance changes over space and time in an unsupervised manner.
Our approach takes unstructured Internet photos spanning some range of time-varying
appearance in a scene and learns how to reconstruct a plenoptic slice—a representation of
the light field that respects temporal structure in the plenoptic function when interpolated
over time—for each of the viewing conditions captured in our input data. By designing
our model to preserve the structure of real plenoptic functions, we force it to learn time-
varying phenomena like the motion of shadows according to sun position. This lets us, for
example, recover plenoptic slices for images taken at different times of day (Figure 6.1,
bottom row) and interpolate between them to observe how shadows move as the day
progresses (best seen in our supplemental video). In effect, we learn a representation
of the scene that can produce high-quality views from a continuum of viewpoints and
1[3] describes the plenoptic function as 7D, but we can reduce this to a 4D color light field supplemented
by time by applying the later observations of [199].
103
viewing conditions that vary with time.
Our work makes three key contributions: first, a representation, called a DeepMPI,
for neural rendering that extends prior work on multiplane images (MPIs) [418] to model
viewing conditions that vary with time; second, a method for training DeepMPIs on
sparse, unstructured crowdsampled data that is unregistered in time; and third, a dataset
of crowdsampled images taken from Internet photo collections, along with details on
how it was collected and registered.
Compared with previous work, our approach inherits the advantages of recent methods
based on MPIs [418, 335, 240, 69, 90], including the ability to produce high-quality novel
views of complex scenes in real time and the view consistency that arises from a 3D scene
representation (in contrast to neural rendering approaches that decode a separate view
for each desired viewpoint). To these advantages we add the key ability to synthesize
and interpolate continuous, photo-realistic, time-varying changes in appearance. We
compare our approach both quantitatively and qualitatively to recent neural rendering
methods, such as Neural Rerendering in the Wild [238], and show that our method
produces superior results.
6.2 Related Work
The study of image-based rendering is motivated by a simple question: how do we use a
finite set of images to reconstruct an infinite set of views? Different branches of research
have explored this question from different angles and with different assumptions. Here
we outline the space of approaches, highlighting work most closely related to our own.
Novel view synthesis. Novel view synthesis has traditionally been approached through
104
either explicit estimation of scene geometry and color [133, 426, 58], or using coarser
estimates of geometry to guide interpolation between captured views [47, 75, 329]. Light
field rendering [199, 114, 55] pushes the latter strategy to an extreme by using dense
structured sampling of the light field to make reconstruction guarantees independent of
specific scene geometry. Subsequent works [198, 314, 73, 356, 274, 315] have leveraged
observations on the structure of light fields to build on this approach. However, most
IBR algorithms are designed to model static appearance, making them ill-suited for our
problem.
Recently, deep learning techniques have been applied to this problem. Several
works [348, 134] rely on global meshes to guide view synthesis. However, such methods
heavily rely on the accuracy of 3D models, and often fail to model complex scene com-
ponents such as translucent and thin objects. Other works predict appearance flow [419],
depth probabilities [91, 391], or RGBD light fields [164, 336]. However, many of these
methods independently synthesize appearance for each view, leading to inconsistent
renderings across views.
Our approach builds on the use of multiplane images (MPIs) [418] for novel view
synthesis. Several recent methods have shown that MPIs are an effective and learnable
representation for light fields [335, 240, 69, 90]. We build on this representation by intro-
ducing the DeepMPI, which further captures viewing condition–dependent appearance.
We are also inspired by recent work that poses view synthesis as decoding features from
a learned latent space [326, 219, 327, 348, 66, 89]. However, such work has been limited
to synthetic environments or objects captured in controlled settings and is difficult to
apply to crowdsampled images.
Appearance modeling. Several works have modeled the time-varying appearance of
outdoor scenes using physically-motivated approaches [309, 129, 189] or by combining
105
data-driven methods and dense geometry [100, 277, 400, 232]. Additionally, Martin-
Brualla et al. [229, 228] reconstruct time-lapses of urban scenes from Internet photos.
However, their method relies on timestamps, and models appearance changes at much
coarser granularity (scene dynamics across years). The recent work of Meshry et al. [238]
is probably closest to our own. They model appearance changes across varying times
of day by learning an appearance embedding. However, their method relies heavily on
dense multi-view stereo geometry, and tends to produce temporal artifacts under complex
appearance changes. In contrast, our approach is capable of rendering a more continuous
range of photo-realistic views across diverse appearances, without relying on dense input
geometry.
Deep image synthesis. Our work is also related to the problem of image-to-image
translation [62, 154, 367, 271], multi-model image-to-image translation [422, 423, 143,
195] and style transfer [102, 353, 142, 312]. Recently, Generative Adversarial Networks
(GANs) [112, 227, 117] have successfully produced photo-realistic imagery, enabling a
variety of applications in deep image synthesis [381, 300, 169, 168, 194, 368]. However,
there has been comparatively little investigation of 3D scene representations for deep
image synthesis. Our method demonstrates the ability to learn a generative 3D scene
representation and produce high-quality novel views of complex scenes.
6.3 Approach
Given a set I = {I1, I2, ..., In} of crowdsampled photos with corresponding camera
viewpoints C = {c1, c2, ..., cn} captured in a common scene, we formulate our problem
as the reconstruction of plenoptic slices (local light fields parameterized by an apperance
descriptor) around some reference view r conditioned on each of the scene appearances
106
captured in I (see Figure 6.1 for a geometric sketch of this setup). We present our
approach in three parts: first, we describe how the input images I are collected and
registered (Section 6.3.1); then we discuss our representation of the plenoptic function,
which extends multiplane images (MPIs) to model appearance changes over time (Section
6.3.2); and finally we describe how to train this representation on our crowdsampled data
(Sections 6.3.3 and 6.3.4).
Note on notation: Throughout the paper, we will use superscripts to denote camera
viewpoints and subscripts to denote image or voxel indices.
6.3.1 Collecting Crowdsampled Data
We selected a number of popular tourist sites and downloaded ∼50K photos from Flickr
for each site. For each scene, we must then register these photos by solving for a
camera pose and intrinsic parameters for each image. As running structure from motion
(SfM) from scratch on such quantities of images is very expensive, we instead started
with a existing SfM reconstruction of each site from the MegaDepth dataset [207], and
performed camera relocalization to efficiently register each new image against the existing
reconstruction [304].
For each landmark, we then identified a reference viewpoint r to center our recon-
struction by using a canonical view selection algorithm similar to that of Simon et al.to
find viewpoints with a high density of nearby views [323]. We then select all images
captured from within a sphere centered at r for use in our method, randomly splitting the
set gathered from each landmark into training and test data. We manually set the field of
view of the reference viewpoint so that it has good coverage of the scene.
107
(a) Trevi Fountain (b) The Pantheon (c) Sacre Coeur
Figure 6.2: Registered photo collections. Example SfM reconstructions of clusters of
Internet photos sharing similar viewpoints, labeled as red dots.
We found that the camera parameters estimated from relocalization are sometimes
inaccurate, and so we reapply a global SfM and bundle adjustment to the smaller set of
selected images near each scene’s reference view to reestimate these images’ camera
parameters. We used this data pipeline to gather and register photos for eight locations,
and will release this data to the research community. Figure 6.2 shows final SfM
reconstructions for three of these landmarks.
6.3.2 The DeepMPI Scene Representation
We base our representation on the multiplane image (MPI) format [346, 418], which
represents light fields locally as a stack of fronto-parallel planar RGBα layers arranged
at varying distances from the camera, akin to a stack of transparencies. Novel views
are rendered from an MPI by warping the layers into a new view, then performing an
over operation to composite the warped layers into a rendered image. Individual RGBα
elements (“voxels”) of an MPI are indexed by (x, y) position and plane depth d.
While MPIs have been remarkably effective for reconstructing fixed light fields from
sparse views [90], they do not encode any information about how viewing conditions
may vary with time. Furthermore, even if we were given a regular MPI corresponding
to viewing conditions for each of our input images, directly interpolating between these
108
MPIs would still fail to capture temporal structure in the plenoptic function. For example,
interpolating between morning and afternoon MPIs would cause shadows cast by the
sun to appear in duplicate when, in reality, a single shadow moved over time. This
observation highlights the distinction between what we call a light field and what we call
a plenoptic slice: we use the latter to describe a reparameterization of the light field that
is better-suited for interpolation over time.
Inspired by DeepVoxels [326], we introduce DeepMPIs to help learn this reparameteri-
zation. DeepMPIs augment standard RGBα MPIs by appending a learnable latent feature
vector at each MPI voxel (see Figure 6.4). For a given scene, we position a DeepMPI
at the reference viewpoint r, and denote this reference DeepMPI as Dr = (Br, αr, F r).
Each voxel of Dr at spatial location and depth p = (x, y, d) consists of a base RGB color
Brp, an alpha weight α
r
p, and a latent feature vector F
r
p. We set the number of DeepMPI
depth planes to 64 with uniform sampling in disparity space, and we adopt the method of
Zhou et al. [418] to set the depth of the near and far planes of the DeepMPI.
In our Appendix we relate the design of this representation and its training to priors
on the sparse structure of the plenoptic function. At a high level, the α planes encode
visibility information, which we expect to remain constant even as lighting and other
viewing conditions change with time. The latent feature planes F r are trained to capture
correlations between different viewing conditions that arise from, for example, limited
variation in material properties and correlation among surface normals within the scene. A
plenoptic slice then consists of a DeepMPI and some exemplar image Ik. We can convert
this to a standard RGBα MPI representing appearance under the specific conditions
captured in Ik by using a decoder that is trained jointly with our DeepMPI, which we
describe in Section 6.3.4.
To compute a DeepMPI from a collection of registered images, we use a two-stage
109
process: first, we first estimate base color and α planes (Section 6.3.3), then optimize
latent features F r jointly with our neural rendering network (Section 6.3.4) to enable
controllable, varying appearance.
6.3.3 Stage 1: Optimizing DeepMPI Color and α Planes
In the first stage of our method, we optimize base color planes Br and alpha planes αr
in our DeepMPI as if it were a standard RGBα MPI. One simple approach would be to
jointly optimize Br and αr from scratch so as to minimize a reconstruction loss over
all images (i.e., the difference between a known image and an MPI-predicted image
from that viewpoint, averaged over all input images). However, as described in [90],
such a method exhibits slow convergence and can be prone to local minima. In addition,
compared to [90], our setting is more challenging because Internet photos exhibit diversity
in camera parameters and viewing conditions. Instead, we propose a simple yet effective
approach to estimating Br and αr given a set of posed input views.
We start by creating a mean RGB plane sweep volume (PSV) at the reference
viewpoint by reprojecting every image to the reference viewpoint via each depth plane,
then averaging all reprojected images at each depth plane. We initialize the base color
planes Br to this mean RGB PSV. Keeping these color planes fixed, we optimize the
alpha planes αr to minimize reconstruction losses over the training photos. Specifically,
given a photo Ik at viewpoint ck, we project both Br and αr to ck, then apply the over
operation from back to front to render a base color image B̂k:
( )
B̂k = O Wk(Br),Wk(αr) , (6.1)
where O is the over operation and Wk is the warping operation from the reference
viewpoint r to the target viewpoint ck. We compare the rendered base color image B̂k
110
(a) viewpoint ck (b) base color B̂k (c) our depth (d) baseline depth
Figure 6.3: Renderings of base color and alpha. From left to right: (a) original photos
at target viewpoint ck, (b) our estimated base color at ck, (c) pseudo-depth computed
from the RGBα MPI at ck using our two-phase approach, (d) pseudo-depth from the
baseline. For depth maps, red=close and blue=far.
and Ik using a reconstruction loss consisting of a pixel-wise l1 loss and a multi-scale
gradient consistency loss [207, 203]. We observe that the gradient consistency loss leads
to higher rendering quality and faster convergence.
Since the mean RGB PSV cannot accurately model scene content that is occluded
in the reference view, after optimizing αr with fixed Br, we unfreeze Br and jointly
optimize Br and αr using the reconstruction loss described above. We observe that this
two-phase training method leads to more accurate estimates of αr than the alternative
of optimizing Br and αr together from scratch. Figure 6.3, shows examples of input
viewpoints and rendered base color images, as well as a comparison of pseudo-depths
derived from alpha planes αr computed by our two-phase training method and by the
baseline. Once Br and αr are estimated, they are fixed for the subsequent stage of
training, described below.
111
Figure 6.4: Learning framework. Our method builds a reference DeepMPI Dr, consist-
ing of base color, alpha, and latent feature components organized into planar layers. A
rendering network G takes a DeepMPI projected to a target viewpoint ck, and predicts
corresponding RGB color layers. The appearance of these layers is modulated by an
appearance vector zs produced by encoder E. The over operation O is applied to the
resulting RGBα MPI to render a view. We jointly train the encoder E, rendering network
G, and latent features F r in the DeepMPI by comparing a rendered view with an original
exemplar image Ik = Is. During inference, given an exemplar photo Is, we can synthe-
size novel views close to the reference viewpoint, while also preserving the exemplar’s
appearance.
6.3.4 Stage 2: Learning How Appearance Changes with Time
Our method’s second stage optimizes the latent features F r in our DeepMPI, together with
an appearance encoder E and rendering network G, to capture and render time-varying
appearance. Our learning framework is summarized in Figure 6.4.
Appearance encoder. To model appearance variation, we devise a method wherein an
encoder E learns to map an exemplar image Is and an auxiliary deep buffer Φrs to a latent
appearance vector zs. Prior work, such as Meshry et al. [238], represents such variation
by learning an appearance vector from the exemplar image and a deep buffer containing
semantic and depth information. However, their deep buffer is aligned with the viewpoint
of the exemplar image. This makes the encoding of exemplar data view-dependent
112
when, under fixed conditions, the information (e.g., sun direction) it reflects should be
largely view-independent. In contrast, we utilize a deep buffer aligned with the reference
viewpoint.
In particular, our encoder E computes a latent appearance vector zs:
zs = E (Is,Φ
r
s) (6.2)
where Is is an exemplar image and Φrs is a reference viewpoint–aligned deep buffer
containing (1) a rectified RGB image over-composited from a PSV that reprojects
exemplar Is to the reference viewpoint via the depth planes of the reference DeepMPI,
(2) a flattened base color image over-composited from base color layers Br, and (3) a
flattened latent feature map at the reference viewpoint over-composited from DeepMPI
features F r.
Such a deep buffer allows E to learn complex appearance by aligning the illumination
information in the exemplar image with the shared scene intrinsic properties encoded
in the reference DeepMPI. Without such alignment, it is difficult for E to consistently
establish appearance correspondence across different viewpoints. Column (d) of Fig-
ure 6.5 shows examples of rendered images without use of such a deep buffer. One can
see that the deep buffer guides the model to capture complex illumination effects such as
the realistic shadows highlighted in the first row. Moreover, integrating the base color
and latent feature map at the reference viewpoint into Φrs, and adding Is as inputs to E
can help the model to extrapolate appearance outside the field of view of the exemplar
image, as shown in the last row of Figure 6.6.
Neural renderer. A plenoptic slice is now represented by the reference DeepMPI Dr
and an appearance vector zs. Given these inputs, our neural renderer G predicts the
corresponding RGB color planes. We could either predict these RGB planes at the
113
reference viewpoint, or after first warping the DeepMPI to the target viewpoint. We
choose the latter because it simplifies efficient implementation, as noted below. Let Dk
denote the reference DeepMPI Dr after warping into target viewpoint c kk. Given D and
zs, G predicts the RGB color planes Cks of a(standar)d RGBα MPI at target viewpoint ck:
Cks = G D
k, zs (6.3)
In particular, G takes in each layer of Dk independently and predicts a corresponding
RGB layer whose appearance is controlled by zs. A rendered RGB image with the
appearance of Is at viewpoint ck can then be obtained using the over operation with
precomputed alpha weights αk in Dk: ( )
Îks = O Cks , αk (6.4)
As shown in Fig. 6.4, during training we set exemplar image Is = Ik, i.e., we aim to
reconstruct image Ik at viewpoint ck. At inference, Is is not necessarily Ik.
Our rendering network G is a U-Net variant with an encoder-decoder architecture.
Prior methods [238, 423] embed z in the bottleneck or input of G. Instead, we use
Adaptive Instance Normalization (AdaIN) layers [142] whose parameters are dynamically
generated from z via an MLP. AdaIN has been shown to be effective in capturing both
global and spatially varying appearance of exemplar images. We find that AdaIN not
only helps model natural scene appearance, but also stabilizes training. Column (b) of
Figure 6.5 shows examples of our rendered images without AdaIN; one can see the model
using AdaIN preserves more faithful scene appearance including the style and color of
exemplar images.
In practice, feeding a full-resolution DeepMPI into G and performing back-
propagation is very memory intensive. Hence, during training, we operate on random
256× 256 crops of training images, and only the necessary portion of Dr is warped to ck
and fed to G. At test time, any size input can be used.
114
Our Rendered View (a) GT (b) w/o AdaIN (c) w/o F
r (d) w/o E(Φrs) (e) Ours
Figure 6.5: Comparisons of images reconstructed with different configurations of
our method. The images rendered from our full approach (e) are more similar to the
ground truth images (a) than other configurations. In particular, the images rendered
from the models without AdaIN (b) or the DeepMPI (c) are less realistic, and the model
that does not feed the deep buffer Φrs to the encoder (d) fails to capture accurate scene
appearance, as indicated in the highlighted regions.
Losses. To train G and E, we compute losses between output views and ground-truth
exemplar views. Our training loss is composed of three terms:
L = LVGG + wGANLGAN + wstyleLstyle, (6.5)
where LVGG, LGAN, and Lstyle denote VGG perceptual loss, adversarial loss, and style loss.
For LVGG, we adopt the formulation of [418, 62]; LGAN is computed from multi-scale
discriminators [367] with an objective similar to LSGAN [227].
To further enforce that the appearance of rendered images matches that of exemplar
images, our style loss Lstyle compares l1 differences between Gram matrices constructed
from VGG features at different layers. We empirically observe Lstyle can guide our model
to correctly capture the appearance of exemplar images, especially for rare photos such
as those taken at sunset.
115
6.4 Experiments
We conduct extensive experiments to validate our proposed approach on our Internet
photo dataset. We first compare with two baseline methods both quantitatively and quali-
tatively on the tasks of view synthesis, appearance transfer and appearance interpolation.
We also present an ablation study to examine the impact of different configurations of
our model. Finally, we perform a user study whose results demonstrate the quality of our
synthesized novel views.
Data and implementation. We evaluate our approach on five of our reconstructed
scenes, which contain on average 2,064 images. For each scene, images are randomly
split into training and test sets with a 85:15 ratio. We train a separate model for each scene.
To mask out transient objects such as people and cars during training and evaluation, we
adopt state-of-the-art semantic and instance segmentation algorithms [59, 131] to create
binary object masks. We set the dimension of the latent appearance vector to z 16s ∈ R ,
and that of our latent DeepMPI features to F r ∈ R8p . We refer readers to the Appendix
for scene statistics, network architectures, and other implementation details.
Baselines. We compare our approach to two state-of-the-art multi-modal image-to-image
translation methods, adapted to our task: MUNIT [143] and Neural Rerendering in the
Wild (NRW) [238]. To compare to MUNIT, we adapt their network G to predict an
RGB image at the target viewpoint from a corresponding base color input, and train with
a bidirectional reconstruction loss. For NRW, both E and G take as input base color,
per-frame depth derived from the DeepMPI, and semantic segmentation at the target
viewpoint. G then predicts a corresponding RGB image conditioned on the appearance
vector extracted by E. We follow the same staged training strategy and use the same
losses as in [238].
116
Trevi Fountain Sacre Coeur The Pantheon Top of the Rock Piazza Navona
Method l1 LPIPS PSNR l1 LPIPS PSNR l1 LPIPS PSNR l1 LPIPS PSNR l1 LPIPS PSNR
MUNIT [143] 0.768 2.62 20.1 0.740 2.08 20.2 0.560 1.51 21.4 0.876 3.68 18.2 0.984 2.80 17.4
NRW [238] 0.779 2.07 20.0 0.808 1.90 19.6 0.592 1.35 21.1 0.802 2.76 19.3 1.050 2.64 17.1
w/o 2-phase 0.651 1.68 21.0 0.695 1.61 20.8 0.515 1.12 21.9 0.694 2.19 20.4 1.010 2.52 17.4
w/o AdaIN 0.780 1.87 19.8 0.801 1.89 19.6 0.609 1.30 20.9 0.773 2.58 19.3 1.150 2.97 17.1
w/o F r 0.712 1.74 20.5 0.737 1.78 20.2 0.556 1.25 21.5 0.720 2.47 19.9 1.045 2.62 17.0
w/o E(Φrs) 0.670 1.70 20.9 0.715 1.66 20.5 0.549 1.16 21.5 0.703 2.24 20.0 1.017 2.52 17.2
Ours (full) 0.618 1.56 21.8 0.676 1.57 21.0 0.495 1.08 22.5 0.642 2.48 20.7 0.933 2.32 17.6
Table 6.1: Quantitative comparisons on our test set. Lower is better for l1 and LPIPS
and higher is better for PSNR. l1 errors are scaled by 10 for ease of presentation.
Error metrics. Similar to [238], we report test image reconstruction errors using three
error metrics: l1 error, peak signal-to-noise ratio (PSNR), and perceptual similarity (via
LPIPS [405]). Prior work has found the LPIPS metric to be better correlated with human
visual perception than other metrics.
Quantitative comparison. For fair comparison, we train and evaluate the baselines
using the same data and hyperparameter settings as our method. Table 6.1 shows results
of quantitative comparisons on our test set. Our proposed approach outperforms the two
baseline methods by a large margin in terms of l1 and PSNR, and is significantly better
in terms of LPIPS, indicating that our method achieves higher rendering quality and
realism.
Ablation study. We perform an ablation study to analyze the effect of individual system
components. In particular, we replace four components with simpler configurations: (1)
using a train-from-scratch baseline to estimate alpha, as described in Section 6.3.3 (w/o
2-phase), (2) including z as an input to G rather than using AdaIN (w/o AdaIN), (3)
removing latent features from the DeepMPI (w/o F r), and (4) encoding z only from the
exemplar image and not additionally from the deep buffer (w/o E(Φrs)). Quantitative
results are reported in Table 6.1. Latent DeepMPI features, as well as the use of AdaIN in
our neural renderer, yield significant improvements, and lead to better rendering quality
for thin structures and attached shadows, as highlighted in Figure 6.5. Encoding the
117
(a) exemplar I (b) MUNIT [143] (c) NRW [238] (d) Ours
Figure 6.6: Appearance transfer comparison. From left to right: (a) exemplar images
used to extract appearance vectors, (b) predictions from MUNIT [143], (c) predictions
from NRW [238], (d) predictions from our method. Compared to the baselines, our
rendered images are more photo-realistic and are more faithful to the appearance of the
exemplar images. Please zoom in to highlighted regions for better visual comparisons.
reference deep buffer also yields rendered images that better match the exemplar image.
Rendering with appearance transfer. Figure 6.6 shows qualitative comparisons be-
tween our method and the two baselines on our test set in terms of rendering quality
and appearance transferability (i.e., how well the model can transfer illumination and
118
Start Image End Image
Start Image End Image
Figure 6.7: Appearance interpolation. The left- and rightmost exemplar images indicate
start and end appearance. Intermediate images are generated by linearly interpolating
latent vectors from the two images. Odd rows show interpolation results from NRW [238],
and even rows from our method. Moving shadows are indicated in highlighted regions.
appearance of an exemplar image to a target viewpoint). We demonstrate compelling
results in challenging cases such as sunset, which is a rare condition in the input photos.
Compared to MUNIT, our rendered images are more realistic and exhibit fewer artifacts.
For example, our rendered images successfully model specularities on glass windows,
details on running water and droplets, cast shadows, and directional lighting effect as
shown in the highlighted regions in Figure 6.6. Our approach can also generate complex
highlights and cast shadows from the sun. Compared with NRW, our rendered images
are more faithful to the illumination in the exemplar image (e.g., for sunset appearance).
Moreover, our approach can extrapolate appearance beyond the field of view of the
exemplar image, as shown in the last row of Figure 6.6.
Appearance interpolation. A key advantage of our method is the ability to interpolate
between plenoptic slices in the latent appearance space. We conduct qualitative com-
parisons between our approach and NRW on appearance interpolation. We choose two
images to define the start and end appearance, and linearly interpolate their latent vectors
119
Ours NRW Ours NRW
Figure 6.8: 4D Photos. We demonstrate an application of creating 4D photos by perform-
ing spatial-temporal interpolation in which both camera viewpoint and scene illumination
change simultaneously.
to produce in-between appearances. Figure 6.7 shows a comparison of interpolation
results. In the first two rows of the figure, we observe that our method can simulate
the progression of surfaces exposed to sunlight as the sun moves, while NRW fails to
produce this effect. In the last row, our approach recovers the gradual motion of shadows
throughout the day, while shadows in the NRW results tend to fade less naturally during
interpolation.
4D photos. Figure 6.8 shows an application of our method to generating animated 4D
photos by animating the 3D viewpoints and simultaneously interpolating between latent
appearance features. Our results achieve convincing changes across a variety of times of
day and lighting conditions.
User study. We ran a user study using 24 random sets of videos with camera movements
and synthesized images from 5 different scenes. Each video is a sequence of novel views
generated by our method, NRW [238], or MUNIT [143]. To quantify the performance
of appearance transfer, we also show comparisons of results generated from different
exemplar images selected from our test set. We invited 46 participants and asked them to
rank the results of the three approaches. 88% of the time, participants responded that the
videos produced by our system are the most temporally coherent. 82% of the time, they
responded that the results from our method best reproduce the details of structure and
120
(a) insufficient view (b) MPI limits (c) exemplar for (d) (d) missing shadow
Figure 6.9: Limitations. Some failure cases include: (a) input photo collections that do
not span the full range of desired viewpoints, or (b) intrinsic limitations of MPI leading
to poor extrapolation to large camera motions. In addition, as shown in (c) (exemplar
image with strong shadow) and (d) (resulting rendering), our method can fail to model
strong cast shadows produced by occluders outside the reference field of view.
illumination one would expect of a real-world scene. 77% of the time, they responded
that the results from our method are the most faithful to the corresponding exemplar.
6.5 Discussion
Limitations. Our method inherits limitations from MPIs. For example, MPIs fail to
generalize to viewpoints that are not well-sampled, or that are far from the reference
view of the MPI (see Figure 6.9(a-b)). In addition, our model can also sometimes fail
to model cast shadows from occluders outside of the reference field of view, as shown
in Figure 6.9(c) and (d). Despite these limitations, we believe our work represents a
significant advance towards photo-realistic capture and rendering of the world from
crowd photography.
Conclusion. We presented a method for synthesizing novel views of scenes under time-
varying appearance from Internet photos. We proposed a new DeepMPI representation
and a method for optimizing and decoding DeepMPIs conditioned on viewing conditions
present in different photos. Our method can synthesize plenoptic slices that can be
interpolated to recover local regions of the full plenoptic function. In the future, we
121
envision enabling even larger changes in viewpoint and illumination, including 4D
walkthroughs of large-scale scenes.
122
CHAPTER 7
NEURAL SCENE FLOW FIELDS FOR SPACE-TIME VIEW SYNTHESIS OF
DYNAMIC SCENES
7.1 Introduction
The topic of novel view synthesis has recently seen impressive progress due to the use
of neural networks to learn representations that are well suited for view synthesis tasks.
Most prior approaches in this domain make the assumption that the scene is static, or
that it is observed from multiple synchronized input views. However, these restrictions
are violated by most videos shared on the Internet today, which frequently feature scenes
with diverse dynamic content (e.g., humans, animals, vehicles), recorded by a single
camera.
We present a new approach for novel view and time synthesis of dynamic scenes
from monocular video input with known (or derivable) camera poses. This problem is
highly ill-posed since there can be multiple scene configurations that lead to the same
observed image sequences. In addition, using multi-view constraints for moving objects
is challenging, as doing so requires knowing the dense 3D motion of all scene points (i.e.,
the “scene flow”).
In this work, we propose to represent a dynamic scene as a continuous function of
both space and time, where its output consists of not only reflectance and density, but
also 3D scene motion. Similar to prior work, we parameterize this function with a deep
neural network (a multi-layer perceptron, MLP), and perform rendering using volume
tracing [241]. We optimize the weights of this MLP using a scene flow fields warping
loss that enforces that our scene representation is temporally consistent with the input
views. Crucially, as we model dense scene flow fields in 3D, our function can represent
the sharp motion discontinuities that arise when projecting the scene into image space,
even with simple low-level 3D smoothness priors. Further, dense scene flow fields also
enable us to interpolate along changes in both space and time. To the best our knowledge,
our approach is the first to achieve novel view and time synthesis of dynamic scenes
captured from a monocular camera.
As the problem is highly underconstrained, we introduce several components that
improve rendering quality over a baseline solution. Specifically, we introduce a disoc-
clusion confidence measure to handle the inherent ambiguities of scene flow near 3D
disocclusions. We also show how to use data-driven priors to avoid local minima during
optimization, and describe how to effectively combine a static scene representation with
a dynamic one which lets us render views with higher quality by leveraging multi-view
constraints in rigid regions.
In summary, our key contributions include: (1) a neural representation for space-
time view synthesis of dynamic scenes that we call Neural Scene Flow Fields, that
has the capacity to model 3D scene dynamics, and (2) a method for optimizing Neural
Scene Flow Fields on monocular video by leveraging multiview constraints in both rigid
and non-rigid regions, allowing us to synthesize and interpolate both view and time
simultaneously.
7.2 Related Work
Our approach is motivated by a large body of work in the areas of novel view synthesis,
dynamic scene reconstruction, and video understanding.
124
Novel view synthesis. Many methods propose first building an explicit 3D scene
geometry such as point clouds or meshes, and rendering this geometry from novel
views [47, 58, 75, 133, 179, 329]. Light field rendering methods on the other hand,
synthesize novel views by using implicit soft geometry estimates derived from densely
sampled images [55, 114, 199]. Numerous other works improve the rendering quality of
light fields by exploiting their special structure [73, 274, 314, 356]. Yet another promising
3D representation is multiplane images (MPIs), that have been shown to model complex
scene appearance [45, 46, 69, 90, 240, 335].
Recently, deep learning methods have shown promising results by learning a repre-
sentation that is suited for novel view synthesis. Such methods have learned additional
deep features that exist on top of reconstructed meshes [134, 289, 348] or dense depth
maps [91, 391]. Alternately, pure voxel-based implicit scene representations have become
popular due to their simplicity and CNN-friendly structure [66, 89, 219, 326, 327, 348].
Our method is based on a recent variation of these approaches to represent a scene as
neural radiance field (NeRF) [241], which model the appearance and geometry of a scene
implicitly by a continuous function, represented with an MLP. While the above methods
have shown impressive view synthesis results, they all assume a static scene with fixed
appearance over time, and hence cannot model temporal changes or dynamic scenes.
Another class of methods synthesize novel views from a single RGB image. These
methods typically work by predicting depth maps [203, 262], sometimes with additional
learned features [375], or a layered scene representation [316, 351] to fill in the content
in disocclusions. While such methods, if trained on appropriate data, can be used on
dynamic scenes, this is only possible on a per-frame (instantaneous) basis, and they
cannot leverage repeated observations across multiple views, or be used to synthesize
novel times.
125
Novel time synthesis. Most approaches for interpolating between video frames work
in 2D image space, by directly predicting kernels that blend two images [259, 260, 261],
or by modeling optical flow and warping frames/features [15, 161, 257, 258]. More
recently, Lu et al. [221] show re-timing effect of people by using a layered representation.
These approaches generate high-quality frame interpolation results, but operate in 2D
and cannot be used to synthesize novel views in space.
Space-time view synthesis. There are two main reasons that scenes change appearance
across time. The first is due to illumination changes; prior approaches have proposed
to render novel views of single object with plausible relighting [31, 33, 34], or model
time-varying appearance from internet photo collections [208, 230, 238]. However, these
methods operate on static scenes and treat moving objects as outliers.
Second, appearance change can happen due to 3D scene motion. Most prior work
in this domain [14, 28, 338, 425] require multi-view, time synchronized videos as input,
and has limited ability to model complicated scene geometry. Most closely related ours,
Yoon et al. [399] propose to combine single-view depth and depth from multi-view stereo
to render novel views by performing explicit depth based 3D warping. However, this
method has several drawbacks: it relies on human annotated foreground masks, requires
cumbersome preprocessing and pretraining and tends to produce artifacts in disocclusions.
Instead, we show that our model can be trained end-to-end and produces much more
realistic results, and is able to represent complicated scene structure and view dependent
effects along with natural degrees of motion.
Dynamic scene reconstruction. Most successful non-rigid reconstruction systems
either require RGBD data as input [40, 79, 150, 255, 396, 427], or can only reconstruct
sparse geometry [270, 360, 324, 411]. A few prior monocular methods proposed using
126
strong hand-crafted priors to decompose dynamic scenes into piece-wise rigid parts [186,
282, 299]. Recent work of Luo et al. [222] estimates temporally consistent depth maps
of scenes with small object motion by optimizing the weights of a single image depth
prediction network, but we show that this approach fails to model large and complex 3D
motions. Additional work has aimed to predict per-pixel scene flows of dynamic scenes
from either monocular or RGBD sequences [43, 145, 160, 223, 242, 340].
7.3 Approach
We build upon prior work for static scenes [241], to which we add the notion of time,
and estimate 3D motion by explicitly modeling forward and backward scene flow as
dense 3D vector fields. In this section, we first describe this time-variant (dynamic) scene
representation (Sec. 7.3.1) and the method for effectively optimizing this representation
(Sec. 7.3.2) on the input views. We then discuss how to improve the rendering quality
by adding an additional explicit time-invariant (static) scene representation, optimized
jointly with the dynamic one by combining both during rendering (Sec. 7.3.3). Finally,
we describe how to achieve space-time interpolation of dynamic scenes through our
trained representation (Sec. 7.3.4).
Background: static scene rendering. Neural Radiance Fields (NeRFs) [241] repre-
sent a static scene as a radiance field defined over a bounded 3D volume. This radiance
field, denoted FΘ, is defined by a set of parameters Θ that are optimized to reconstruct
the input views. In NeRF, FΘ is a multi-layer perceptron (MLP) that takes as input a
position (x) and viewing direction (d), and produces as output a volumetric density (σ)
127
and RGB color (c):
(c, σ) = FΘ(x,d) (7.1)
To render the color of an image pixel, NeRF approximates a volume rendering integral.
Let r be the camera ray emitted from the center of projection through a pixel on the
image plane. The expected color Ĉ∫of that pixel is then given by:tf
Ĉ(r) = T (t)(σ(r∫(t)) c(r(t),d) dttn t )
where T (t) = exp − σ(r(s)) ds . (7.2)
tn
Intuitively, T (t) corresponds to the accumulated transparency along that ray. The loss
is then the difference between the reconstructed color Ĉ, and the ground truth color C
corresponding to the pixel that ray ori∑ginated from r:
L 2static = ||Ĉ(r)−C(r)||2. (7.3)
r
7.3.1 Neural scene flow fields for dynamic scenes
To capture scene dynamics, we extend the static scenario described in Eq. 7.1 by including
time in the domain and explicitly modeling 3D motion as dense scene flow fields. For
a given 3D point x and time i, the model predicts not just reflectance and opacity, but
also forward and backward 3D scene flow Fi = (fi→i+1, fi→i−1), which denote 3D offset
vectors that point to the position of x at times i + 1 and i − 1 respectively. Note that
we make the simplifying assumption that movement that occurs between observed time
instances is linear. To handle disocclusions in 3D space, we also predict disocclusion
weightsWi = (wi→i+1, wi→i−1) (described in Sec. 7.3.2). Our dynamic model is thus
defined as:
(c , σ ,F ,W ) = F dyi i i i Θ (x,d, i). (7.4)
128
Figure 7.1: Scene flow fields warping. To render a frame at time i, we perform volume
tracing along ray ri with RGBσ at time i, giving us the pixel color Ĉi(ri) (left). To warp
the scene from time j to i, we offset each step along ri using scene flow fi→j and volume
trace with the associated color and opacity (cj, σj) (right).
Note that for convenience, we use the subscript i to indicate a value at a specific time i.
7.3.2 Optimization
Temporal photometric consistency. The key new loss we introduce enforces that the
scene at time i should be consistent with the scene at neighboring times j ∈ N (i), when
accounting for motion that occurs due to 3D scene flow. To do this, we volume render
the scene at time i from 1) the perspective of the camera at time i and 2) with the scene
warped from j to i, so as to undo any motion that occurred between i and j. As shown in
Fig. 7.1 (right), we achieve this by warping each 3D sampled point location xi along a
ray ri during volume tracing using the predicted scene flows fields Fi to look up the RGB
color cj and opacity σj from neighboring time j. This yields a rendered image, denoted
129
Ĉj→i, of the scene at time j wi∫th both camera and scene motion warped to time i:tf
Ĉj→i(ri) = Tj(t)σj(ri→j(t)) cj(ri→j(t),di)dt
tn
where ri→j(t) = ri(t) + fi→j(ri(t)). (7.5)
We minimize the mean squared error (MSE) between each warped rendered view and the
ground truth view:
∑ ∑
Lpho = ||Ĉj→i(ri)−Ci(ri)||22 (7.6)
ri j∈N (i)
An important caveat is that this loss is not valid at 3D disocculusion regions caused
by motion. Analogous to 2D dense optical flow [235], there is no correct scene flow
when a 3D location becomes occluded or disoccluded between frames. These regions are
especially important as they occur at the boundaries of moving objects (see Fig. 7.2 for
an illustration). To mitigate errors due to this ambiguity, we predict two extra continuous
disocclusion weight fieldswi→i+1 andwi→i−1 ∈ [0, 1], corresponding to fi→i+1 and fi→i−1
respectively. These weights serve as an unsupervised confidence of where the temporal
photoconsistency loss should be applied; ideally they should be low at disocclusions and
close to 1 everywhere else. We apply these weights by volume rendering the weight
along the ray ri with opacity from time j, and multiplying the accumulated weight at
each 2D pixel:
∫ tf
Ŵj→i(ri) = Tj(t)σj(ri→j(t))wi→j(ri(t))dt (7.7)
tn
We avoid the trivial solution where all predicted weights are zero by adding `1
regularization to encourage predicted weights to be close to one, giving us a new weighted
130
F a e F a e  
Bac g d
M i g Objec
Re de ed I age
? Wa ed I age ?
Figure 7.2: Scene flow disocclusion ambiguity. In this 2D orthographic example, a
single blue object translates to the right by one pixel from frame at time i to frame at time
j. Here, the correct scene flow at the point labeled a, e.g., fi→j (a), points one unit to the
right, however, for the scene flow fi→j (c) (and similarly fj→i (a)), there can be multiple
answers. If fi→j (c) = 0, then the scene flow would incorrectly point to the foreground in
the next frame, and if fi→j (c) = 1, the scene flow would point to the freespace location
d in the next frame.
loss:
∑ ∑
Lpho = Ŵ 2j→i(ri)||Ĉj→i(ri)−Ci(ri)||2
ri j∈N (i) ∑
+ βw ||wi→j(xi)− 1‖|1, (7.8)
xi
where βw is a regularization weight which we set to 0.1 in all our experiments. We use
N (i) = {i, i ± 1, i ± 2}, and chain scene flow and disocclusion weights for the i ± 2
case. Note that when j = i, there is no scene flow warping or disocculusion weights
involved (fi→j = 0, Ŵj→i(ri) = 1), meaning that Ĉi→i(ri) = Ĉi(ri), as in Fig. 7.1(left).
Comparing Fig. 7.3(e) and Fig. 7.3(d), we can see that adding this disocclusion weight
improves rendering quality near motion boundaries.
Scene flow priors. To regularize the predicted scene flow fields, we add a 3D scene
flow cycle consistency term to encourage that at all sampled 3D points xi, the predicted
forward scene flow fi→j is consistent with the backward scene flow fj→i at the corre-
131
Figure 7.3: Qualitative ablations. Results of our full method with different loss com-
ponents removed. The odd rows show zoom-in rendered color and the even rows show
corresponding pseudo depth. Each component reduces the overall quality in different
ways.
sponding location sampled at time j (i.e. at position xi→j = xi + fi→j). Note that this
cycle consistency is also only valid outside 3D disocclusion regions, so we use the same
predicted disocclusion weights to modulate this term, giving us:
∑ ∑
Lcyc = wi→j||fi→j(xi) + fj→i(xi→j)||1 (7.9)
xi j∈N (i)
We additionally add low-level regularizations Lreg on the predicted scene flow fields.
First, following prior work [255], we enforce scene flow spatial smoothness by minimiz-
ing the `1 difference between scenes flows sampled at neighboring 3D position along
each ray. Second, we enforce scene flow temporal smoothness by encouraging 3D point
trajectories to be piece-wise linear [360]. Finally, we encourage scene flow to be small
in most places [357] by applying an `1 regularization term, since motion is isolated to
dynamic objects. All terms are weighted equally in all experiments shown in this paper.
Please see the supplementary material for complete descriptions.
132
Data-driven priors. Since reconstruction of dynamic scenes with a monocular camera
is highly ill-posed, the above losses can on occasion converge to sub-optimal local
minima when randomly initialized. Therefore, we introduce two data-driven losses, a
geometric consistency prior and a single-view depth prior: Ldata = Lgeo + βzLz. We set
βz = 2 in all our experiments.
The geometric consistency prior helps model build correspondence association more
accurately between adjacent frames. In particular, it minimizes the reprojection error
of scene flow displaced 3D points w.r.t. the derived 2D optical flow which we compute
using FlowNet2 [146].
Suppose pi is a 2D pixel position at time i. The corresponding 2D pixel location in
the neighboring frame at time j displaced through 2D optical flow ui→j can be computed
as pi→j = pi + ui→j .
To estimate the expected 2D point location p̂i→j at time j displaced by predicted
scene flow fields, we first compute the expected scene flow F̂i→j(ri) and the expected
3D point location X̂i(ri) of the ray ri through volume rendering. p̂i→j is then computed
by performing perspective projection of the expected 3D point location displaced by the
scene flow (i.e. X̂i(ri) + F̂i→j(ri)) into the viewpoint corresponding to the frame at time
j. The geometric consistency is computed as the `1 difference between p̂i→j and pi→j ,
∑ ∑
Lgeo = ||p̂i→j(ri)− pi→j(ri))||1. (7.10)
ri j∈{i±1}
We also add a single view depth prior that encourages the expected termination depth
Ẑi computed along each ray to be close to the depth Zi predicted from a pre-trained
single-view depth network [281]. As single-view depth predictions are defined up to an
133
Combined render Static only Dynamic only
Figure 7.4: Dynamic and static components. Our method learns static and dynamic
components in the combined representation. Note that person is almost still in the second
example.
unknown scale and shift, we utilize a robust scale-shift invariant loss [281]:
∑
Lz = ||Ẑ∗i (ri)− Z∗i (ri)||1 (7.11)
ri
where ∗ is a whitening operation that normalizes the depth to have zero mean and unit
scale.
From Fig 7.3(b), we see that adding data-driven priors help the model learn correct
scene geometry especially for dynamic regions. However, as both of these data-driven
priors are noisy (rely on inaccurate or incorrect predictions), we use these for initialization
only, and linearly decay the weight of Ldata to zero during training for a fixed number of
iterations.
7.3.3 Integrating a static scene representation
The method described so far already outperforms the state of the art, as shown in
Tab. 7.1. However, unlike NeRF, our warping-based temporal loss can only be used
in a local temporal neighborhood N (i), as dynamic components typically undergo too
134
much deformation to reliably infer correspondence over larger temporal gaps. Rigid
regions, however, should be consistent and should leverage observations from all frames.
Therefore, we propose to combine our dynamic (time-dependent) scene representation
with a static (time-independent) one, and require that when combined, the resulting
volume-traced images match the input frames. We model each representation with its
own MLP, where the dynamic scene component is represented with Eq. 7.4, and the static
one is represented as a variant of Eq. 7.1:
(c, σ, v) = F stΘ(x,d) (7.12)
where v is an unsupervised 3D blending weight field, that linearly blends the RGBσ from
static and dynamic scene representations along each ray. Intuitively, v should assign
a low weight to the dynamic representation at rigid regions, as these can be rendered
in higher fidelity by the static representation, while assigning a lower weight to the
static representation in regions that are moving, as these can be better modeled by the
dynamic representation. We found adding the extra v leads to better results and more
stable convergence than the configuration without v. The combined rendering equation is
then written as:
∫ tf
Ĉcbi (r ) = T
cb
i i (t)σ
cb cb
i (t) ci (t)dt, (7.13)
tn
where σcbi (t) c
cb
i (t) is a linear combination of static and dynamic scene components,
weighted by v(t):
σcbi (t) c
cb
i (t) = v(t) c(t)σ(t) + (1-v(t)) ci(t)σi(t) (7.14)
For clarity, we omit ri in each prediction. We then train the combined scene representation
135
Figure 7.5: Static scene representation ablation. Adding a static scene representation
yields higher fidelity renderings, especially in static regions (a,c) when compared to the
pure dynamic model (b).
by minimizing MSE between Ĉcbi with the corresponding input view:∑
Lcb = ||Ĉcb 2i (ri)−Ci(ri)||2. (7.15)
ri
This loss is added to the previously defined losses on the dynamic representation, giving
us the final combined loss:
L = Lcb + Lpho + βcycLcyc + βdataLdata + βregLreg (7.16)
where the β coefficients weight each term. Fig. 7.4 shows separately rendered static and
dynamic scene components, and Fig. 7.5 visually compares renderings with and without
integrating a static scene representation.
136
Figure 7.6: Novel time synthesis. Rendering images by interpolating the time index
(top) yields blending artifacts compared to our scene flow based rendering (bottom).
7.3.4 Space-time view synthesis
To render novel views at a given time, we simply volume render each pixel using Eq. 7.5
(dynamic) or Eq. 7.13 (static+dynamic). However, we observe that while this approach
produces good results at times corresponding to input views, the representation does not
allow us to interpolate time-variant geometry at in-between times, leading instead to
rendered results that look like linearly blended combinations of existing frames (Fig. 7.6).
Instead, we render intermediate times by warping the scene based on the predicted
scene flow. For efficient rendering, we propose a splatting-based plane-sweep volume
tracing approach. To render an image at intermediate time i+ δi, δi ∈ (0, 1) at a specified
target viewpoint, we sweep over every ray emitted from the target viewpoint from front
to back. At each sampled step t along the ray, we query point information through our
models at both times i and i+ 1, and displace all 3D points at time i by the scene flow
xi + δifi→i+1(xi), and similarity for time i + 1. We then splat the 3D displaced points
onto a (c, α) accumulation buffer at the target viewpoint, and blend splats from time i
and i+ 1 with linear weights 1− δi, δi. The final rendered view is obtained by volume
rendering the accumulation buffer (see supplementary material for a diagram).
137
3D Photo NeRF Yoon et al. Ours GT
Figure 7.7: Qualitative comparisons on the Dynamic Scenes dataset. Compared with
prior methods, our rendered images more closely match the ground truth, and include
fewer artifacts, as shown in the highlighted regions.
7.4 Experiments
Implementation details. We use COLMAP [304] to estimate camera intrinsics and
extrinsics, and consider these fixed during optimization. As COLMAP assumes a static
scene, we mask out features from regions associated with common classes of dynamic
objects using off-the-shelf instance segmentation [131]. During training and testing,
we sample 128 points along each ray and normalize the time indices i ∈ [0, 1]. As
with NeRF [241], we use positional encoding to transform the inputs, and parameterize
scenes using normalized device coordinates. A separate model is trained for a each scene
using the Adam optimizer [177] with a learning rate of 0.0005. While integrating the
static scene representation, we optimize two networks simultaneously. Training a full
model takes around two days per scene using two NVIDIA V100 GPUs and rendering
takes roughly 6 seconds for each 512× 288 frame. We refer readers to the supplemental
material for our network architectures, hyperparameter settings, and other implementation
details.
138
Dynamic Only Full
Methods MV
SSIM (↑) PSNR (↑) LPIPS (↓) SSIM (↑) PSNR (↑) LPIPS (↓)
SinSyn [375] No 0.371 14.61 0.341 0.488 16.21 0.295
MPIs [351] No 0.494 16.44 0.383 0.629 19.46 0.367
3D Ken Burn [262] No 0.462 16.33 0.224 0.630 19.25 0.185
3D Photo [316] No 0.486 16.73 0.217 0.614 19.29 0.215
NeRF [241] Yes 0.532 16.98 0.314 0.893 24.90 0.098
Luo et al. [222] Yes 0.530 16.97 0.207 0.746 21.37 0.141
Yoon et al. [399] Yes 0.547 17.34 0.199 0.761 21.78 0.127
Ours (wo/ static) Yes 0.760 21.88 0.108 0.906 26.95 0.071
Ours (w/ static) Yes 0.758 21.91 0.097 0.928 28.19 0.045
Table 7.1: Quantitative evaluation of novel view synthesis on the Dynamic Scenes
dataset. MV indicates whether the approach makes use multi-view information or not.
Methods Dynamic Only Full
SSIM (↑) PSNR (↑) LPIPS (↓) SSIM (↑) PSNR (↑) LPIPS (↓)
NeRF [241] 0.522 16.74 0.328 0.862 24.29 0.113
[316] + [260] 0.490 16.97 0.216 0.616 19.43 0.217
[399] + [260] 0.498 16.85 0.201 0.748 21.55 0.134
Ours (w/o static) 0.720 21.51 0.149 0.875 26.35 0.090
Ours (w/ static) 0.724 21.58 0.143 0.892 27.38 0.066
Table 7.2: Quantitative evaluation of novel view and time synthesis. See Sec. 7.4.2
for a description of the baselines.
7.4.1 Baselines and error metrics
We compare our approach to state-of-the-art single-view and multi-view novel view syn-
thesis algorithms. For single-view methods, we compare to MPIs [351] and SinSyn [375],
trained on indoor real estate videos [418]; 3D Photos [316] and 3D Ken Burns [262]
were trained mainly on images in the wild. Since these methods can only compute depth
up to an unknown scale and shift, we align the predicted depths with the SfM sparse point
clouds before rendering. For multi-view, we compare to a recent dynamic view synthesis
method [399]. Since the authors do not provide source code, we reimplemented their
approach based on the paper description. We also compare to a video depth prediction
method [222] and perform novel view synthesis by rendering the point cloud into novel
139
views while filling in disoccluded regions. Finally, we train a standard NeRF [241], with
and without the added time domain, on each dynamic scene.
We report the rendering quality of each approach with three standard error metrics:
structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and
perceptual similarity through LPIPS [405], on both the entire scene (Full) and in dynamic
regions only (Dynamic Only).
7.4.2 Quantitative evaluation
We evaluate on the Nvidia Dynamic Scenes Dataset [399], which consists of 8 scenes
with human and non-human motion recorded by 12 synchronized cameras. As in the
original work [399], we simulate a moving monocular camera by extracting images
sampled from each camera viewpoint at different time instances, and evaluate the result
of view synthesis with respect to known held-out viewpoints and frames. For each scene,
we extract 24 frames from the original videos for training and use the remaining 11
held-out images per time instance for evaluation.
Novel view synthesis. We first evaluate our approach and other baselines on the task
of novel view synthesis (at the same time instances as the training sequences). The
quantitative results are shown in Table 7.1. Our approach without the static scene
representation (Ours w/o static) already significantly outperforms other single-view
and multi-view baselines in both dynamic regions and on the entire scene. NeRF has
the second best performance on the entire scene, but cannot model scene dynamics.
Moreover, adding the static scene representation improves overall rendering quality by
more than 30%, demonstrating the benefits of leveraging global multi-view information
140
from rigid regions where possible.
Novel view and time synthesis. We also evaluate the task of novel view and time syn-
thesis by extracting every other frame from the original Dynamic Scenes dataset videos
for training, and evaluating on the held-out intermediate time instances at held-out camera
viewpoints. Since we are not aware of prior monocular space-time view interpolation
methods, we use two state-of-the-art view synthesis baselines [316, 399] to synthesize
images at the testing camera viewpoints followed by 2D frame interpolation [260] to
render intermediate times, as well as NeRF evaluated directly at the novel space-time
views. Table 7.2 shows that our method significantly outperforms all baselines in both
dynamic regions and the entire scene.
Ablation study. We analyze the effect of each proposed system component in the task
of novel view synthesis by removing (1) all added losses, which gives us NeRF extended
to the temporal domain (NeRF (w/ time)); (2) the single view depth prior (w/o Lz); (3)
the geometry consistency prior (w/o Lgeo); (4) the scene flow cycle consistency term
(w/o Lcyc); (5) the scene flow regularization term (w/o Lreg); (6) the disocculusion weight
fields (w/oWi); (7) the static representation (w/o static). The results, shown in Table 7.3,
demonstrate the relative importance of each component, with the full system performing
the best.
7.4.3 Qualitative evaluation
We provide qualitative comparisons on the Dynamic Scenes dataset (Fig. 7.7) and on
monocular video clips collected in-the-wild from the internet featuring complex ob-
ject motions such as jumping, running, or dancing with various occlusions (Fig. 7.8).
141
Methods Dynamic Only Full
SSIM (↑) PSNR (↑) LPIPS (↓) SSIM (↑) PSNR (↑) LPIPS (↓)
NeRF (w/ time) 0.630 18.89 0.159 0.875 24.33 0.081
w/o Lz 0.710 19.66 0.132 0.882 25.16 0.078
w/o Lgeo 0.713 19.74 0.139 0.885 25.19 0.079
w/o Lcyc 0.721 20.26 0.121 0.889 26.08 0.076
w/o Lreg 0.740 21.22 0.110 0.892 26.27 0.074
w/oWi 0.754 21.31 0.112 0.894 26.23 0.074
w/o static 0.760 21.88 0.108 0.906 26.95 0.071
Full (w/ static) 0.758 21.91 0.097 0.928 28.19 0.045
Table 7.3: Ablation study on the Dynamic Scenes dataset. See Sec. 7.4.2 for detailed
descriptions of each of the ablations.
Our rendered views 3D Photo NeRF Yoon et al. Luo et al. Ours
Figure 7.8: Qualitative comparisons on monocular video clips. When compared to
baselines, our approach more correctly synthesizes hidden content in disocclusions
(shown in the last three rows), and locations with complex scene structure such as the
fence in the first row.
NeRF [241] correctly reconstructs most static regions, but produces ghosting in dy-
namic regions since it treats all the moving objects as view-dependent effects, leading
to incorrect interpolation results. The state-of-the-art single-view method [316] tends
to synthesize incorrect content at disocclusions, such as the bins and speaker in the
142
(a) Non-seen disocclusion (b) GT for (c) (c) Missing details
Figure 7.9: Limitations. Our method is unable to extrapolate content unseen in the
training views (a), and has difficulty recovering high frequency details if a video involves
extreme object motions (b,c).
last three rows of Fig. 7.8. In contrast, methods based on reconstructing explicit depth
maps [222, 399] have difficulty modeling complex scene appearance and geometry such
as the thin structures in the third row of Fig. 7.7 and the first row of Fig. 7.8.
7.5 Discussion
Limitations. Monocular space-time view synthesis of dynamic scenes is very challeng-
ing, and we have only scratched the surface with our proposed method. In particular, there
are several limitations to our approach. Similar to NeRF, training and rendering times
are high, even at limited resolutions. Additionally, each scene has to be reconstructed
from scratch and our representation is unable to extrapolate content unseen in the training
views (See Fig. 7.9(a)). Furthermore, we found that rendering quality degrades when
either the length of the sequence is increased given default number of model parameters
(most of our sequences were trained for 1∼2 seconds), or when the amount of motion is
extreme (See Fig. 7.9(b-c), where we train a model on a low frame rate video). Finally,
our method can end up in the wrong local minima if object motion and camera motion
are close to a degenerate case, e.g., colinear, as described in Park et al. [270].
143
Conclusion. We presented an approach for monocular novel view and time synthesis
of complex dynamic scenes by Neural Scene Flow Fields, a new representation that
implicitly models scene time-variant reflectance, geometry and 3D motion. We have
shown that our method can generate compelling space-time view synthesis results for
scenes with natural in-the-wild scene motion. In the future, we hope that such methods
can enable high-resolution views of dynamic scenes with larger scale and larger viewpoint
changes.
144
CHAPTER 8
ETHICS IN DATA-DRIVEN COMPUTER VISION
8.1 Introduction
Deep learning has revolutionized almost all computer vision problems and has made
significant progress in achieving automatic scene understanding of our physical world.
The effectiveness of data-driven computer vision has also led to intensive growth of
interest in technology transfer, ranging from autonomous vehicles, and mixed reality, to
agriculture, transportation, health care, and education. Despite growing enthusiasm in
this field over the past few years, there is also alarm about the ethical consequences of
the rapid progress in computer vision. For example, when we apply computer vision
algorithms on large amounts of Internet visual data, how can we guarantee that sensitive
or private information will be kept secret? How can we ensure that these techniques will
not create discrimination and inequality across gender and race? How do we guarantee
that a model is trustworthy and transparent to humans while performing a task?
To better understand these questions, in this chapter I focus on three major ethical
implications of data-driven computer vision that are closely related to the topics we
present in this thesis. Specifically, in Section 8.2 I explore privacy and security concerns
raised by current computer vision and machine learning systems. In Section 8.3, fairness
issues surrounding modern data-driven approaches are discussed in detail. In Section 8.4,
concepts of interpretability that can be used to help build a robust and safe vision system
are described. In each section, I provide motivation as well as potential opportunities and
challenges we must face. I also categorize these ethical challenges based on different
assumptions, and describe case studies that have attempted to resolve these issues. Lastly,
in Section 8.5 I briefly discuss ethical aspects that are of concern in other fields such as
social science, business, and law.
8.2 Privacy and Security
One of the most important ethical issues we must deal with when we develop a data-driven
computer vision system is privacy and security. On one hand, computer vision driven
techniques help people enhance privacy and security in their daily-life activities [6]. For
instance, FaceID technology has been shown to be a better and more secure authentication
system for protecting people’s privacy. On the other hand, these techniques also expose
alarming privacy and security issues due to several factors, including inappropriate stor-
age, tracking and processing of the personal data, and model vulnerability to adversarial
attacks. These factors can cause security problems, in which the attacks subvert common
behaviors of the models by forcing them to make undesired predictions. They can also
cause breach of privacy. For instance, recent work has demonstrated that computer vision
algorithms can passively recover sounds from a silent video captured from a consumer
camera [74]; a press of a key on a mobile phone can also be recorded even if the victim
is out of sight of the attacker [390]. Therefore, it is our responsibility to understand these
potential problems with the aim of developing secure and privacy-preserving computer
vision systems.
8.2.1 Security
Security implies that the system behaves normally and that its correctness, efficiency,
and integrity hold, without being compromised by external adversarial attacks. Current
security attacks in data-driven computer vision can be categorized according to different
146
Figure 8.1: Facial expression manipulation. Computer vision has enabled control of
facial expression from arbitrary videos by using image and depth information. This has
raised significant security concerns regarding misuse in fake news or propaganda. Figure
adapted from Thies et al.[349].
phases in the life cycle of a model [12, 217, 119].
The most prevalent type of security attack is evasion attack, where an estimated
perturbation (i.e., noise) is added to input images to compromise performance during the
inference stage [113, 345, 53, 268, 247]. over last few years, evasion attacks have quickly
evolved from ideal white-box attacks to more practical black-box attacks, in which an
attack can occur without having access to the model and training data [267, 42, 60, 147,
148, 118]. Security breaches can also be achieved through poisoning attacks, where
attackers compromise model performance by adding malicious images to the training
data, intentionally modifying the training data, or installing a backdoor [249, 393, 250,
183, 308, 420, 218, 65].
To defend against security attacks, a variety of techniques have been proposed. For
example, the use of adversarial examples, robust statistics, or regularization in the case
of adversarial training [383, 10, 332, 269, 224, 294, 243, 384, 279] is an effective way
to increase a model’s robustness. Detecting and rejecting of poisoned examples (i.e,
data sanitization) has also been introduced for the purpose of defending poisoning
attacks [236, 239, 388, 265].
147
Image Forensics. Chapters 6 and 7 of this thesis exemplify rapid progress on image and
view synthesis, and these methods demonstrate the power of generating or manipulating
images or videos that are difficult to distinguish from real media. Despite such progress,
their potential use in a wide variety of applications also raises significant security con-
cerns. For example, several research methods and software such as Face2Face [349]
and DeepFakes technology [373] have shown that current computer vision and graphics
algorithms are able to faithfully transfer and manipulate human facial expressions, or to
replace one person face with that of someone else (Figure 8.1).
While these results are exciting from a technical standpoint, people are aware of
their significant security fallout after such technologies go viral. At the individual and
organizational levels, these techniques can be used to enable cyber-criminals to engage
in blackmail or financial fraud via impersonation [171]. Furthermore, there is evidence
of their broader social-political ramifications. For instance, a fabricated video made
by the Gabon government showing the president’s appearance triggered an attempted
military coup [41] in 2019. In addition, in an election campaign in India in 2020, a
Delhi candidate used a similar technique to criticize the incumbent Delhi government in
English in order to promote voter bases with different language backgrounds [70].
Beyond human faces, many neural rendering and generative modeling algorithms
have emerged, demonstrating their potential for synthesizing and editing images of
arbitrary objects or scenes in the wild [130, 76, 421, 350, 343, 169, 240, 241, 204].
For instance, video inpainting algorithms enable completion of corrupted regions of
a video or removal of unwanted regions of a video. Despite their wide use in actual
practice, they also raise significant security concerns, since these methods can be used in
a malicious way, including for spread of misinformation and fake news, financial fraud,
and introduction of counterfeit evidence in a court of law .
148
To resolve these potential security issues caused by image forgery techniques, sig-
nificant efforts have been made by computer vision and graphics researchers. On the
research end, special datasets and techniques have been introduced for image foren-
sics. For instance, FaceForensics++ [295] is a large-scale dataset that helps researchers
develop learning-based models that accurately detect facial forgeries. Several other
works [144, 378, 363, 9, 364, 414] also demonstrate how to automatically detect and
localize arbitrary image and video forgeries in order to prevent image synthesis and
editing algorithms from being mishandled.
Social media platforms such as Twitter, YouTube, and Facebook also have already
taken actions to prevent misuse of synthesis and manipulation of images [127]. Specif-
ically, in the United States, the states of Virginia, California, New York, and Texas
have introduced legislation to combat misuse of manipulated images [184, 44]. Also,
the Chinese government announced that starting in 2020 the use of DeepFace-related
technology without clear notice would be considered a crime [394].
8.2.2 Privacy
Preservation of privacy suggests that information from the models or the dataset which is
considered confidential and sensitive cannot be traced or inferred. Privacy threats can be
either direct or indirect, depending on whether the attackers have access to the original
data.
Direct privacy breaches are due mainly to unintentional handling of data on the
service provider’s side [67], lack of use of effective encryption mechanism during trans-
mission [266, 211, 182], or bypassing of authentication through backdoor attacks [371].
Indirect privacy breaches, on the other hand, can be often carried out via a membership
149
Figure 8.2: Privacy-preserving image synthesis from 3D reconstruction. From left to
right: original photo, synthesized image from standard SfM reconstruction [304] through
a technique from Pittaluga et al. [278], synthesized image from privacy-preserving SfM
3D reconstruction [105], which excludes sensitive visual information such as humans.
Figure adapted from Geppert et al. [333].
inference attack, in which the adversary trains a surrogate model to speculate on whether
an instance belongs to the training data [318, 397, 220]. For instance, Fredrikson et
al. [93] showed that a facial image in the training set can be mostly recovered even if the
attackers have access only to the model’s predicted confidence score and the person’s
name.
To defend against privacy attacks, many strategies that incorporate knowledge from
computer security, cryptography, and machine learning have been proposed. There are
three major defense techniques widely used in data-driven computer vision. The first
one is homomorphic encryption (HE), which forces computation to be performed on the
encrypted data without explicitly decrypting it [107, 136, 54]. The second approach is
secure multi-party computation (SMC), in which multiple computing parties are involved
but each party has access to only a portion of the entire set of private data [246, 215,
297]. Lastly, the notion of differential privacy (DP) was introduced to protect sensitive
information from being inferred. DP defends against attacks by adding various kinds of
randomized noise to different phases of a deep learning pipeline in order to increase its
robustness to privacy attacks [1, 276, 266, 82, 163].
150
Privacy-preserving 3D vision. One concrete case study related to our discussion of
privacy preservation involved topics visual 3D localization and reconstruction. Most
systems proposed in this thesis depend on structure from motion (SfM) or simultaneous
localization and mapping (SLAM) algorithms. SfM and SLAM are typical techniques for
automatically estimating camera poses and for sensing the surrounding 3D environment
from a set of 2D images. SfM and SLAM algorithms have been deployed in numerous
real-world VR and AR applications with cloud or mobile services. However, these
applications usually require users to upload images to local platforms or to remote
servers, which raises significant privacy issues in regard to potential disclosure of users’
sensitive information. For instance, the images uploaded on the user side might include
users’ identity, and distributing such data will be risky if confidential information can
easily be obtained by criminals. Moreover, recent research has shown that even if the
images are compressed into a latent feature represented by a low-dimensional vector, it
is still possible to use them to reveal the essential contents of the scene. For instance,
Pittaluga et al. [278] showed that the images can be reconstructed from sparse SfM 3D
point clouds of corresponding scenes even if original input images are discarded (Center
of Figure 8.2). Fortunately, a number of methods have been developed to address these
3D privacy concerns. The goal is to preserve original capabilities of persistent localization
and image query from the environment while obscuring privacy-related structures and
contents (image on the right side of Figure 8.2). In particular, recent work [333, 333, 105]
proposes hiding of user information by transforming 2D or 3D image point features to
random obfuscated 2D or 3D line features, as shown in Figure 8.3). Such a randomized
oriented- line representation manages to hide most of the sensitive image contents while
still implicitly including sufficient geometric information to enable robust and accurate
localization, image query, and entire 3D mappings.
151
Figure 8.3: Privacy-preserving 3D representation. Instead of the use of points as a 3D
representation, the use of randomized 3D lines to enable privacy-persevering localization
has been proposed. Figure adapted from Speciale et al. [333].
8.3 Fairness and Bias
The second important question we must ask about a data-driven computer vision system
is whether it ensures fairness. A model is fair if its outputs are independent of (i.e., have
no correlation with) the inputs’ attributes such as gender and skin tones.
Why do we care about fairness in the the age of deep learning? The reason is that
data and model unfairness can harm individuals, especially minority, and can exacerbate
discrimination and inequality that already exist in our society. For instance, a recent report
shows a job-recruiting platform based on data-driven models that tend to assign much
higher rankings to men who are less qualified than to women who are more qualified
in terms of skills [178, 190]. Vision-based applications such as image captioning and
facial recognition have also been reported to have serious gender and ethnicity bias,
where the systems perform much better on white men than on African Americans
women [403, 48, 135] (Figure 8.4).
Technically speaking, unfairness suggests that a trained model tends to have biased
predictions for certain groups of individuals with specific identities or attributes, while the
152
Figure 8.4: Unfairness in data-driven computer vision. State-of-the-art facial recog-
nition systems all reveal gender and ethnicity bias in the model’s predictions. These
algorithms perform much better on light-skinned males than dark-skinned females. Figure
adapted from Buolamwini et al. [48]
predictions for some other groups can be less reliable or less accurate. One reason behind
such unfairness is data. Most deep learning based approaches require large a amount of
data in order to learn useful priors. However, crowd-sourced data are usually unbalanced,
skewed, tainted, or limited [16]. This can be due to external noise, human inherent bias,
or sample-size disparity (i.e., disproportionately represented groups contribute less to the
training set). Consequently, such intrinsic data bias can cause a model to downweight
the importance of under-represented groups of individuals. For instance, less than 2
percent of the people in the ImageNet appear as individuals above the age of 60 [81].
As a result, the models trained on the ImageNet can perform poorly on elderly person.
Interestingly, recent work [408, 365] also shows that even a deep neural network can
implicitly associate a certain correlation between targets and the underlying attributes of
the individuals portrayed, and can amplify the stereotypes existing in our society, even
if the training dataset is perfectly balanced. They showed that there exist some features
that are more critical and more highly correlated with an individual’s identity than others
for making predictions. As a result, it’s easier for a deep neural network to memorize
153
only these biased input features during training in order to obtain good accuracy while
ignoring possible sources of diversity.
In response, many metrics have been proposed to help measure and assess the fairness
of a model [98, 358]. One typical metric is based on individual fairness or group
fairness [83, 97, 29]. Individual fairness means that similarity should be higher for pairs
of individuals who have similar attributes. Group fairness, on the other hand, measures
similarity on different groups separately, where each group includes individuals with
similar sensitive attributes. In practice, group fairness is more widely adopted in current
research and applications, and it can be further measured by demographic parity [36],
equality of odds [124], and predictive quality parity [124].
To ensure model fairness, many techniques have been introduced recently, and most of
them can be categorized based on when they are applied [27]. Pre-processing approaches
focus on reducing model biases by improving the quality of the training dataset. This
can be achieved by extending the training data with data augmentation, adding more
diverse data sources, or adjusting the weights of the training samples. Masking or
hiding relevant sensitive attributes of the inputs have also been shown to be useful on
improving model fairness [366, 165]. In-processing strategies seek a solution that corrects
the bias during training; most current solutions of this type are based on the idea of
adding extra regularization terms by explicitly incorporating fairness metrics into model
training [293, 7, 166]. Lastly, post-processing strategies are applied during the inference
stage. For instance, the fairness metrics we described can be applied to calibrate the
model during inference in order to reduce discrimination [123, 408].
Data bias in monocular depth prediction. The issue of data bias also appears in 3D
vision. For example, Chapters 2 and 3 introduce two large-scale RGBD datasets that help
154
Figure 8.5: Dataset bias in depth prediction. Qualitative comparison of state-of-the-art
depth prediction model [281], and our proposed depth prediction models in this thesis
which was trained on MegaDepth [207] and MannquinChallenge [203], using images
from the Microsoft COCO dataset [210]. Figure adapted from Ranftl et al. [281].
us achieve accurate monocular depth prediction, but more recent work [281] demonstrates
that our trained models still expose bias due to lack of diversity in the training dataset.
For instance, in the MegaDepth dataset [207], most of the ground-truth depth is generated
for outdoor buildings or statues, making the trained networks generalize poorly to indoor
environments. On the other hand, the MannquinChallenge dataset was designed to yield
more accurate estimates of the geometry for people, hence the trained models might
not generalize well to non-human objects such as animals or vehicles. Furthermore,
most videos in the MannquinChallenge dataset include children and teenagers from
North America. As a consequence, our trained model might not perform equally well on
people of different ages and skin colors. This fairness issue is manifested in Figure 8.5.
The Current state-of-the-art depth prediction model alleviates this problem by training
a model on a more diverse training data from different sources, and it demonstrates
155
Ranftl et al.[281] Li et al.[207] Li et al.[203] Input
less-bias depth prediction and better generalization to unseen settings.
8.4 Interpretability and Transparency
The third major issue we have to face is model interpretability, meaning that a model’s
reasoning should be understood and analyzed by humans. It is clear that deep learn-
ing based approaches have dominated almost all the benchmarks in computer vision.
However, if these data-driven methods work so well nowadays, why can’t we simply
trust the model when it makes a prediction? The reason is that the level of accuracy and
error in the benchmarks leads to an incomplete description of a model, as described by
Doshi-Velez and Kim [78]. In order to gain better control and understanding of designed
computer vision systems, researchers and practitioners need to pay greater attentions to
interpretability.
Why is model interpretability important in our real-world tasks? First, interpretability
can build trust between humans and machines. For instance, in certain life-critical tasks
such as medical diagnosis, the decision by a model will be adopted only if its mechanism
is transparent to humans. Furthermore, interpretability can also help humans identify
underlying problems in order to address other issues such as unfairness and privacy,
meaning that we can propose a better solution to defend against attacks once we have
better understanding of the inner of the models.
Second, interpretability can enhance safety. The most typical example is vision-based
perception systems deployed in current self-driving vehicles. It is important that object
detection system in a self-driving car perform well for pedestrians and cyclists at all
times of the day and night, in all types of whether, and in all seasons of the year, because
failure to do so, even once, can have serious consequences. For instance, in a famous
156
case of an Uber fatality that occurred in 2018, the vision system in the self-driving car
didn’t detect pedestrians at night and consequently didn’t issue warnings [372]. As a
result, the Uber vehicle struck and killed a person. Thus in order to guarantee safety
in such life-critical applications, understanding when and why a method will fail is an
important topic that researchers and developers should take into consideration.
Generally speaking, interpretability of a model can be categorized into two main
types, as proposed by Lipton et al. [212]. The first type is intrinsic transparency, meaning
that whether users can walk through each step of reasoning in the system, and whether
the inner workings of each system component can be analyzed and understood before
training. Most of the classical shallow learning models belong to this type. For example,
people have the ability to understand each step of K-nearest neighbors, shallow decision
trees, or support vector machine algorithms, and their mathematical properties can be
analyzed even before training.
The second type is post-hoc interpretability, meaning the model’s mechanism is
analyzed and understood by people after training. Most deep neural network models
belong to this type, and numerous efforts have been made to analyze why deep neural
networks succeed in many vision tasks. For example, the technique researchers often
adopt is feature visualization [87, 264, 225, 325, 23], which is used to visualize learned
CNN features by finding the inputs that maximize the hidden activations [264]. This
method tells a story about what important information the network tries to learn from
the training data. As an alternative, pixel attribution [401, 334, 307, 341] analyzes which
part of the input image is responsible for a network to make a certain prediction.
Uncertainty for interpretable inverse graphics. Last but not least, one particular case
related to the topic of inverse graphics is modeling of uncertainty. Many inverse graphics
problems we discussed can be treated as pixel-to-pixel transformation task, with the goal
157
Figure 8.6: Uncertainty modeling in depth prediction. Uncertainty modeling can help
in identifying confident regions during depth prediction, making the model robust to
noise and potential attacks. From left to right: input image, ground-truth depth, depth
prediction, estimated aleatoric uncertainty, and estimated epistemic uncertainty. Figure
adapted from Kendall et al. [173].
of predicting dense scene intrinsic properties from corresponding input RGB images.
However, most of the proposed approaches do not explicitly capture model uncertainty,
which prevents us from understanding model reliability in a transparent manner. In the
classical machine learning and signal processing communities, uncertainty modeling is a
well-studied problem [172], and has become a key building block in many applications,
including flight navigation and aircraft landing. However, it is not straightforward to
incorporate uncertainty modeling into current deep learning techniques, because of the
extremely high dimensions of the parameter space. In order to address this issue in differ-
ent inverse graphics tasks, the Bayesian deep learning approach [173, 174] has recently
been revived. In particular, the Baysian deep learning approach introduces two types
of uncertainty to help in understanding when and where the model is confident during
prediction. The first one is epistemic uncertainty, which captures the uncertainty due to
lack of sufficiency and diversity of the training data. This uncertainty is important for
safety-critical scenarios where the training dataset is small such as in medical diagnosis.
158
Figure 8.7: Uncertainty modeling in novel view synthesis. By modeling transient and
sensitive objects in Internet photos as aleatoric uncertainty, we obtain novel view synthesis
results with better rendering quality and privacy-preserving properties. From left to right:
rendered static component, rendered transient component, composite rendered image,
original photo, and estimated aleatoric uncertainty. Figure adapted from Martin-Brualla et
al. [230].
The second one is aleatoric uncertainty, which captures information that the model itself
is not able to explain even if given enough training data. This uncertainty is important
in situations where we have sufficient data or real-time performance is required such as
in autonomous driving. Uncertainty modeling has recently been used in several inverse
graphics tasks. For example, in the case of monocular depth prediction shown in Fig-
ure 8.6, uncertainty modeling can help people identify and understand where the model
has inaccurate predictions and what the causes of making wrong predictions are. This
is important for ensuring the safety and transparency of deep learning models in tasks
that require 3D understanding of the scene [173]. The idea of uncertainty modeling was
also adopted recently in the task of novel view synthesis , shown in Figure 8.7, where the
authors proposed modeling transient objects in Internet photos by incorporating aleatoric
uncertainty. Since these transient objects usually carry private or sensitive information,
they show that uncertainty-modeling technique not only help improve rendering quality
but also ensuring preservation of privacy [230].
159
8.5 Other Aspects
In this section, I discuss several other social implications that should be considered in
developing a data-driven computer vision system. These topics are interdisciplinary and
also play important roles in different aspects in our society.
8.5.1 Policy and Regulation.
The first topic is legal aspects surrounding deployment of data-driven systems. To pro-
mote the benefits of current techniques for every individual, and to properly manage the
risks associated with these techniques, it is necessary to let lawmakers become involved
in the development of vision systems. Recent work suggests that policy and regulation
should be encoded before a system is deployed, in order to prevent foreseen problems,
and should also be applied after the system is built in order to prevent unforeseen prob-
lems [68]. As emphasized in several recent reports [337, 50, 376], legislative issues to
which attention must be paid are as follows:
• Legal liability for possible negative outcomes on individuals by the system, and
legal accountability for the use of personal data
• Proper regularization on potential misuse of data and models for cyber-security
and cyber-privacy, as well as regularization on the safety and health of individual
• Regulations on algorithmic discrimination and unlawful profiling by automated
decision-making from the data-driven systems
• Intellectual-Property and copyright laws applied to the ownership of emerging
techniques and to the process of data collection
160
Fortunately, several countries have started to address these legal issues by establishing
new policies for better system development. The most notable example is general data
protection regulation (GDPR) [361], which is designed to protect private information
of European Union citizens as well as to ensure fairness and transparency while using
personal data for research and product development. Other countries, such as China, have
also announced plans to regularize and boost the healthy development of AI systems in
the long run [197].
8.5.2 Employment and HCI
Automation and unemployment. As a result of rapid progress in the development of
data-driven computer vision, many industries have adopted different techniques in auto-
mated processes, and this has drastically improved efficiency and productivity. However,
it has also raised concerns regarding the harmful consequences for the workforce and
employment, especially in the areas of transportation, financial services, and commerce.
For instance, we can imagine autonomous Uber cars or driverless truck taking over the
roads in a couple of years. The fact that AI systems will take over many tasks that have
traditionally been performed by human has raised increasing concerns about massive
unemployment and reduction in wages [80, 290]. On the other hand, as described by
Petropoulos [275], technological innovations always affect employment in two ways:
They can displace the labor of humans labors, but they can also introduce new job
opportunities and spark a demand for employees who are capable of dealing with the
new technology. For example, deep learning research has created many job opportunities
such as crowd-sourcing services provided through Amazon Mechanical Turk. Therefore,
in order to adapt to this new era of automation by deep learning, different parties and
individuals need better ways to accommodate possible outcomes triggered by the recent
161
deep learning revolution.
Human–computer collaboration. The importance of interaction and collaboration
between humans and vision systems cannot be denied, since such collaboration enables
their complementary strengths to be maximized [288]. On one hand, computer vision
systems can assist humans to extend their abilities in their daily life. For instance,
vision-aided 3D SLAM systems have been used to assist visually impaired persons to
navigate in unknown environments [137, 138]. On the other hand, humans can also assist
computer vision systems. This is often related to the notion of human-in-the-loop, where
involvement of human intelligence can help improve model performance by increasing
the quality of training data and extra user guidance can lead to better decision-making
on the part of machines. For instance, recent computer vision and graphics work has
demonstrated that user guidance significantly helps deep learning models produce more
desirable effects in a variety of image-editing applications, such as colorization [404],
inpainting [410], segmentation [209], and content creation [271]. Thus having a human-
centered system will be a critical step in forming a virtuous cycle in human–AI relations
that contribute to the social welfare.
8.5.3 Control and Surveillance
As described earlier, collecting a large amount of data from diverse sources in the wild
can help improve performance of deep neural networks. Without proper regularization,
however, this could lead to dystopian control of human liberty, freedom, and democracy.
For example, facial-recognition systems have recently has been criticized for their po-
tential to bring about algorithmic authoritarian control and governance [248, 126]. As
reported by the New York Times, millions of cameras have been set up in China as a
162
governmental surveillance mechanism for criminal identification. However, as pointed
out by Hartzog [126], such invasive mass surveillance is intrinsically oppressive and will
leads to a “Panopticon”, in which the civil liberties of individuals can be harmed, since
people are more likely to follow the rules and act differently than they otherwise would,
if they concerned that they may be under surveillance.
Another typical consequence is the “Surveillance Capitalism” described by Shoshana
Zuboff [429], in which personal-behavior data obtained from users of a product is
incorporated into prediction and shaping of future user behavior surrounding that product.
This is usually due to massive surveillance underlying online services such as search
engines and social media. For example, if you have recently browsed images of laptops,
you have seen see advertisements related to discounts on those laptops on different
websites. In fact, this technique was used by Google, where people abused personal data
from the user side to grow the target product market [430].
Since surveillance capitalism has been widely used to translate personal information
into predictions of future behavior for purposes of production and profit-making, its
impact on intervention in and control of the behavior of individuals in a direction
that favors capitalists has not taken place without notice by the public. As warned
by Zuboff [429], surveillance capitalism will encroach on the ability of humans to act
autonomously and on the functioning of society in a democracy, and it will introduce new
types of social inequality and injustice because of its unilateral nature and its asymmetric
effects on different individuals. Thus people should be aware of the potential damage
caused by leakage of private data in terms of shaping our values, and governments are
urged to enact legislation that protects human rights and democracy, and prevents us from
stepping into a system of totalitarianism.
163
8.5.4 Sustainability
Machine learning and computer vision techniques can also have a broad impact on
sustainable development.
Social sustainable development. Data-driven vision systems have been shown to be
useful for facilitating social equality. For example, computer vision models have recently
been used as an aid in identification of poverty and famine through satellite images [49,
11], as well as in prediction of livelihood and socioeconomic attributes through street-
view images [103, 196] and in management and discovery of natural resources [392, 88].
However, it could have a negative effect on social cohesion. For instance, greatly
increased use of data-driven media platforms could lead to the phenomena of “filter
bubble” [431], meaning that social media can shape our views of the world in which we
live, in that the algorithms could limit us to being exposed only toe the texts, images,
and voice that we would like to see, which could be different from what we should
see [122]. That could in turn lead to sociopolitical polarization, increased social isolation,
and intensification of bias and hatred.
Economic sustainable development. Data-driven vision techniques also contribute to
development of greater efficiency in numerous areas, such as agricultural management,
where computer vision techniques can be used for analysis and prediction of yields of
crops and fruits [121] and for for early diagnosis of diseases in plants [245]. However,
some recent reports [71] also show that data-driven techniques can potentially cause
larger income gaps and increased inequality across countries because of unbalanced
resource allocations. Furthermore, because of the limited transparency of current deep-
learning-based approaches, product economic stability cannot be guaranteed [30].
164
Environmental sustainable development. On the bright side, machine learning mod-
els can help support low-carbon energy systems with higher productivity and efficiency,
and thus they are an important tool for addressing problem of global climate warm-
ing [359]. We have also seen how these algorithms can improve the health of our ecosys-
tem. For instance, some vision techniques are used to reduce environmental pollution by
identifying the locations of oil spoils in the oceans [175]. The trends of desertification
and invasive species can also be analyzed and tracked by deep neural network models
[244, 24, 86], which aid in policy-making on environmental protection and remedy. On
the other hand, the high energy demands and expensive computational resources needed
by many current data-driven studies and product designs lead to increasing demand for
electricity and increasing waste of natural resources [162]. Therefore, designing a more
efficient and energy-preserving deep learning platform is key to sustainable development
of our environment.
8.6 Discussion
In this chapter, I presented important ethical principles in modern data-driven compute
vision. I discussed both the benefits and the potential problems that current techniques
would bring to our society. In particular, I discussed the issues of privacy, security,
fairness, and interpretability that we have to face in this age of deep learning evolution.
I presented different approaches that have been proposed to deal with these issues, by
pointing to concrete case studies. Furthermore, I discussed the broader financial, legal,
and societal impacts of these techniques.
There are still a lot of challenges we have to address in order to achieve an ethical,
fair, safe, and transparent data-driven computer vision system. For example, we still
165
don’t have a unified and scalable framework for evaluating the interpretability of deep
learning models, since current approaches often involve tedious human-in-the-loop
inspections. Therefore, development of an efficient interpretable approach that can extend
to wide deployment at scale is an area that deserves further study. In addition, deeper
understanding of deep neural networks can also improve the effectiveness of strategies
for defending against potential privacy and security attacks [149]. Researchers need to
have a better understanding of the trade-offs among privacy, fairness, and efficiency,
since previous work has demonstrated that decreasing biases and increasing security can
compromise models’ prediction performance [106]. Therefore, greater efforts should
be devoted to building an unbiased and privacy-preserving model that also achieves the
desired prediction accuracy.
166
CHAPTER 9
CONCLUSION
In this thesis, I presented new approaches for addressing the tasks in inverse graphics, with
the goal of learning scene geometry, appearance and dynamics. By using massive amount
of Internet visual data, my proposed approach can achieve state-of-the-art results on
varieties of in the wild scenarios. In particular, in Chapters 2 and 3, I introduced new large-
scale dataset from Internet photo collections and Youtube Videos of MannquionChallenge.
Based on these, I developed new models for learning better dense depth from single RGB
image or regular videos of dynamic scenes with moving people.
In chapters 4 and 5, I addressed the problem of intrinsic image decomposition, with
its goal of estimating scene surface material and illumination properties by decomposing
an image into reflectance and shading maps. My work introduced two new datasets from
Internet time-lapse videos and synthetic images rendered by physically based rendering
engine. I proposed novel strategies for training deep neural networks on these datasets to
obtain state-of-the-art decomposition results on photos of real world scenes.
In chapters 6 and 7, I further demonstrated how to tackle the challenging problem
of space-time view synthesis. Specifically, I showed that a large number of Internet
photos can constitute an useful data source for helping DeepMPI scene representation to
enable synthesizing photo-realistic novel views, while modeling time-varying appearance
changes of different landmarks around world. Furthermore, scene dynamics from a
monocular video can also be encoded into neural scene flow fields representation, that
allows us to simultaneously perform novel view synthesis and create slow-mo effect.
Lastly, in chapter 8 I discussed several potential ethical problems that the current
deep learning and computer vision techniques can bring about. It served as a survey
that reminds researchers and developers of the importance of their social and political
influence, and guide people to develop more effective techniques to address these ethical
issues.
Future direction. Although my work has made great advancements on better under-
standing space-time information of our physical world, a number of open challenges
still remain to be solved. For example, I tend to solve each scene intrinsics estimation
problem independently, and I also separate scene dynamics into two individual factors,
namely illumination changes and object motions. But I believe the ultimate goal of
inverse graphics is to bring scenes truly to life. In other words, I imagine a model that can
jointly capture all the scene intrinsic properties and model both short-term and long-term
temporal changes of our physical world. Thus an interesting research direction is how
to use both Internet photos and videos in deploying a unified system for holistic scene
understanding and modeling from images in the wild.
168
APPENDIX A
CHAPTER 2 APPENDIX
A.1 Depth Map Refinement and Enhancement
In this section, we provide additional details for our depth map refinement and enhance-
ment methods presented in Section 3.2 and 3.3 of Chapter 2.
A.1.1 Modified MVS algorithm
Our modified MVS algorithm and semantic segmentation-based depth map filtering are
summarized in Algorithm 1. Our algorithm first runs PatchMatch [35] using photometric
consistency constraints, as implemented in COLMAP, to solve for an initial depth map
D0 (with some pixels whose depth could not be estimated marked as invalid). Next,
K iterations of PatchMatch using geometric consistency constraints are run. For each
iteration k, we compare the depth values at each pixel before and after the update and
keep the smaller (closer) of the two, to get an updated depth map Dk. After K iterations
of PatchMatch, we apply a median filter to DK and only keep depths whose values are
stable, in that they are close to their median-filtered value. Finally, we remove spurious
depths from transient objects based on semantic segmentation, as described in Section 3.3
of Chapter 2. Regarding the parameters defined in Algorithm 1, we set τ1 = τ2 = 1.15
and K = 3. Two additional examples of depth maps with and without our refinements
are shown in Figure A.1.
Algorithm 1 Depth Refinement and Semantic Cleaning
Input: Input image I , semantic segmentation map L (divided into subregions F (fore-
ground), B (background), and S (sky).
Output: Refined depth map D for image I .
1: Run PatchMatch using photometric consistency constraints to solve for initial depth
estimate of D0. Pixels in D0 without an assigned depth are instead assigned a NaN
sentinel value.
2: for round k = 1 to K do
3: Run PatchMatch using geometric consistency constraints on Dk−1 to get updated
depth estimate Dk.
4: Rk = Dk/Dk−1 (element-wise)
5: for each valid (non-NaN) pixel p of Rk do
6: if Rkp > τ1 then
7: Dkp = D
k−1
p
8: else
9: Dk kp = Dp
10: Apply 5× 5 median filter on DK , storing result in D̂K .
11: Filter (replace with NaN) unstable pixels from DKp for which
max(D̂Kp /D
K , DKp p /D̂
K) > τ2.
12: for each connected component C from F do
13: if fraction of valid depths in C is > 50% then
14: keep depths in region C from DK .
15: else
16: remove all depths in region C from DK .
17: Filter out all depths in sky region S.
18: Apply morphological erosion followed by small connected components removal
operation on DK to obtain final depth map D.
A.1.2 Foreground and background classes
In this subsection, we provide details of the foreground object classes used to define the
foreground mask F for each image, and similarly the background object classes used to
define the background mask B. These classes are subsets of the classes recognized by
our semantic segmentation module, as described in Section 3.3 of the Chapter 2.
Foreground classes. F = {person, table, chair, seat, signboard, flower, book, bench,
boat, bus, truck, streetlight, booth, poster, van, ship, fountain, bag, minibike, ball, animal,
170
(a) Input photo (b) Raw depth (c) Refined depth
Figure A.1: Additional example comparisons between MVS depth maps with and
without our proposed refinement/cleaning methods. Column (b) (before filtering):
the plinth of the statue in the first row and the “Statue of Liberty” in the second row
both show depth bleeding effect. Column (c) (after filtering): our refinement method
corrects or removes such depth values.
bicycle, sculpture, traffic light, bulletin board}
Background classes. B = {building, house, skyscraper, hill, tower, waterfall, mountain}.
171
A.1.3 Automatic ordinal depth labeling
In this subsection, we provide additional details for our automatic ordinal depth labeling
method. Recall that O (“Ordinal”) is the subset of photos that do not satisfy the “no
selfies” criterion described in Chapter 3. Recall that the “no selfies” criterion rejects
images I for which < 30% of the pixels (ignoring the sky region S) consists of valid
depth values)—otherwise, these images are added to the set O. For each image I ∈ O,
and given foreground pixel F and B in I as defined above, we compute two regions,
Ford ∈ F and Bord ∈ B, such that all pixels in Ford are likely in front of all pixels in Bord.
In particular, we assign any connected component C of F to Ford if the area of C is
larger than 5% of the image. We assign a pixel p ∈ B to Bord if it satisfies the following
conditions:
1. p belongs to the background region B,
2. the area of p’s connected component in B is larger than 5% of the image, and
3. p has a valid depth value that lies in the last quartile of the full range of depths for
I .
Originally, we considered a more complex approach involving geometric reasoning (e.g.,
estimating where foreground objects touch the ground), but we found that the simple
approach above works very well (> 95% accuracy in pairwise ordinal relationships),
likely because natural photos tend to be composed in certain common ways. Additional
examples of our automatic ordinal depth labels are shown in Figure A.2.
172
Figure A.2: Additional examples of automatic ordinal labeling. Blue mask: fore-
ground (Ford) derived from semantic segmentation. Red mask: background (Bord) derived
from reconstructed depth.
A.2 SfM Disagreement Rate (SDR)
In this section, we provide additional details for our SfM Disagreement Rate (SDR) error
metric defined in Section 5.1 of Chapter 3.
SDR is based on the rate of disagreement between a predicted depth map and the
ordinal depth relationships derived from estimated ground truth SfM points. We use
sparse SfM points for this purpose rather than dense MVS depths for two reasons: (1) we
found that sparse SfM points can capture some structures not reconstructed by MVS (e.g.,
complex objects such as lampposts), and (2) we can select a robust subset of SfM points
based on measures from SfM such as the number of observing cameras or uncertainty of
the estimated depth computed by bundle adjustment.
We define SDR(D,D∗), the ordinal disagreement rate between the predicted (non-
log) depth map D = exp(L) and gr∑ound-(truth SfM depths D
∗, as:
1 )
SDR(D,D∗) = 1 ord(Di, Dj) 6= ord(D∗, D∗i j ) (A.1)n
i,j∈P
where P is the set of pairs of pixels with available SfM depths to compare, n is the total
number of pairwise comparisons, and ord(·, ·) is one of three depth relations (further-than,
173
closer-than, and same-depth-as):

1 if DiD > 1 + δj
ord(Di, Dj) = −1 if
Di
D < 1− δ (A.2)
 j0 if 1− δ ≤ DiD ≤ 1 + δj
In other words, SDR is the rate of disagreement between predicted and ground-truth
depths in terms of pairwise depth orderings. Note that SDR is an unweighted measure for
simplicity (all measurements count the same towards the cost), but we can also integrate
depth uncertainty derived from bundle adjustment as a weight.
We also define SDR= and SDR 6= as the disagreement rate with ord(D∗i , D
∗
j ) = 0
and ord(D∗, D∗i j ) 6= 0 respectively. In our experiments, we set δ = 0.1 for tolerance to
uncertainty in SfM points.
Because SDR is based on point pairs and hence takes O(n2) time to compute, for
efficiency we subsample SfM points by splitting each image into 15× 15 blocks, and for
each block, randomly sampling an SfM point (if any exist). We then use these sampled
points to create a clique of ordinal relations, where each edge connecting two features is
augmented with the ordinal depth label. To obtain reliable sparse points we only sample
SfM points seen by > 5 cameras and with reprojection error < 3 pixels. Figure A.3
shows several examples of SfM points we sample for evaluating SDR.
174
Figure A.3: Examples of sampled SfM points. Red circles indicate sampled SfM points
with the radius indicating estimated depth derived from SfM; small radius = small (close)
depth, large radius = large (far) depth.
175
APPENDIX B
CHAPTER 3 APPENDIX
B.1 Derivations of depth from motion parallax
Here we provide detailed derivations of depth from motion parallax using the
Plane+Parallax representation (Section 4.1).
Recall in Chapter 3, we define the relative camera pose as R ∈ SO(3), t ∈ R3 from
source image Is to reference image Ir with common intrinsics matrix K. We denote the
forward flow from Ir to Is as ffwd, and the backward flow from Is to Ir as fbwd. Let Π
denote a real or virtual planar surface, and let d′Π denote the distance between the camera
center of source image Is and the plane Π, and h the distance between the 3D scene
point corresponding to 2D pixel p and Π. It can be shown (See Appendix of [152] for
full intermediate derivations) that
h tz h
p = pw +
D (p) d′
pw − ′ Kt (B.1)
pp Π Dpp(p)dΠ
h
= pw +
D (p)d′
(tzpw −Kt) (B.2)
pp Π
where Dpp(p) is the estimated depth at p in the reference image Ir, tz is the third
component of translation vector t, and pw is the 2D image point in Ir that results from
warping the corresponding 2D pixel (by optical flow f sfwd) in I by a homography A:
Ap′
pw = (′ ) (B.3)aT3 p
n′T
where A = K R + t K−1
d′Π
where p′ = p + ffwd(p), aT3 is the third row of A, and n
′ is normal of plane Π with
respect to the camera of source image Is. Note that the original paper [152] divides the
P+P representation into two cases depending on whether tz = 0, but we combine these
two cases into one equation shown in Equation B.2 by algebraic manipulations.
Now, if we set plane Π at infinity, using L’Hôpital’s rule, we can cancel out H and
d′Π and obtain the following equations:
tzpw −Kt
p = pw + (B.4)
Dpp(p)
||tzpw −Kt||2
Dpp(p) = ,||p− pw||2
A′p′
where pw = ′ and A
′ = KRK−1.
a Tp′3
B.2 Derivation of error metrics
Recall that in Section 5 of Chapter 3 we define five different deth error metrics based
the on scale-invariant RMSE (si-RMSE). Here we provide definitions of each error
metric. Note that we can use similar algebraic manipulations to those proposed in [206]
to evaluate all terms in time linear in the number of pixels.
As in Chapter 3, we denote with D̂ the predicted depth, and denote with Dgt the
ground truth depth. We define R(p) = log D̂(p) − logDgt(p), i.e., the difference
between computed and ground truth log-depth. We also denote human regions as H
(with Nh valid pixels), non-human (environment) regions as E (with Ne valid pixels),
and the full image region as I = H ∪ E (with N = Ne +Nh valid pixels).
Our error metrics are defined as follows:
si-full measures the si-RMSE between all pairs of pixels, giving the overall accuracy
177
across the entire image:
1 ∑∑
si-full = (R(p)−R(q))2 (B.5)
N2
1 ∑p∈I∑q∈I
= R(p)2( +R(q)
2 − 2R(p)R(q) (B.6)
N2
p∈I q
2 ∑∈I ∑ ∑ )
= N R(p)2 − R(p) R(q) (B.7)
N2
∑ p∈I (p∑∈I )q∈I22 2
= R(p)2 − R(p) (B.8)
N N2
p∈I p∈I
si-env measures pairs of pixels in non-human regions E thus computing the accuracy
of the depth in the environmen∑t:1 ∑
si-env = ( (R(p)−R(q))
2
) (B.9)N2e p∈E q∈∑E2 ∑ ∑
= N R(p)2e − R(p) R(q) (B.10)
N2e p∈E p∈E q∈E
si-hum measures pairs where one pixel lies in the human region H and one lies
anywhere in the image, thus computing overall depth accuracy for the people in the
scene:
1 ∑∑
si-hum = R(p)2 +R(q)2( − 2R(p)R(q) ) (B.11)NNh p∈H∑q∈I1 ∑ ∑ ∑
= N R(p)2 +Nh R(q)
2 − 2 R(p) R(q) (B.12)
NNh
p∈H q∈I p∈H q∈I
si-hum can further be divided into the sum of two error measures: si-intra measures
si-RMSE withinH, or human a∑ccu∑racy independent of the environment:1
si-intra = ( (R(p)−R(q))
2
) (B.13)N2h p∈H q∈∑H2 ∑ ∑
= Nh R(p)
2 − R(p) R(q) (B.14)
N2h p∈H p∈H q∈H
178
(a)
(b) (c)
Figure B.1: Network Architecture. Each block with a different color (id) in (a) indicates
a convolutional layer. The block labeled H indicates a 3× 3 convolutional layer and all
other blocks are implemented as a variant of an Inception module [344], as shown in
(b). Parameters for each type of layer are shown in (c). We use bilinear interpolation to
upsample features in the network. Figures modified from Chen et al. [63].
si-inter measures si-RMSE between pixels inH and in E , or human accuracy w.r.t. the
environment:
1 ∑∑
si-inter = 2( R(p) +R(q)
2 − 2R(p)R(q) ) (B.15)NeNh p∈H q∑∈E1 ∑ ∑ ∑
= Ne R(p)
2 +N 2h R(q) − 2 R(p) R(q) . (B.16)
NeNh
p∈H q∈E p∈H q∈E
B.3 Network Architecture
Our network architecture is a variant of the hourglass network proposed by Chenet
al. [63], and is shown in Figure B.1. Specifically, our network has a standard encoder
and decoder U-Net structure, with matching input and output resolution, consisting of
179
approximately 5M parameters. In addition, an Inception module [344] is used in each
convolutional layer of the network. We replaced the nearest-neighbor upsampling layers
by bilinear upsampling layers, which we found produced sharper depth maps while
slightly improving overall accuracy.
180
APPENDIX C
CHAPTER 4 APPENDIX
C.1 Hyperparameters Setting
For all experiments, we set our hyperparameters as follows. For the overall energy
function defined in Equation 3 in the Chapter 4, we set w1 = 1, w2 = 6 and w3 = 2.
For Equation 8 describing the affinity between pixels, we define a covariance matrix
Σ between reflectance feature vectors fp and fq as follows: we set Σ to be a diagonal
matrix for simplicity, and define Σ = diag(0.12, 0.12, 0.12, 0.0252, 0.0252). Lastly, for
Equations 11 and 12 relating to shading smoothness, we set λmed = 20 and λmed = 4.
C.2 All-Pairs Weighted Least Squares (APWLS)
In this section, we provide a detailed derivation of our proposed All-Pairs Weighted Least
Squares computation (APWLS), as described in Section 5.5 of the Chapter 4. Suppose
that we have a image sequence with m images, and each image has n pixels. Now
suppose each image I i is associated with two matrices P i and Qi and two predictions X i
and Y i. We then can write APWLS as
∑m ∑m
APWLS = ||P i ⊗Qj ⊗ (X i − Y j)||2F (C.1)
∑i=1 ∑j=1n m ∑m ( )
= P iQj(X i − Y j 2( p(p p p ) )) (C.2)∑p=1 i=∑1 j=1n m ∑m
= (P i)2 (Qj)2( p (( p (X
i
)p − Y
j 2
p ) (C.3)
∑p=1n ∑i=1 j=∑1m m ∑m ∑ ))m
= (P i)2 (Qj)2 (X i 2 j 2p p p) + (Qp) (Y
j)2 − 2X ip p (Qj)2Y jp p
p=1 i=1 j=1 j=1 j=1
(C.4)
=1>(ΣQ2 ⊗ ΣP 2X2 + ΣP 2 ⊗ ΣQ2Y 2 − 2ΣP 2X ⊗ ΣQ2Y )1 (C.5)
∑
where Σ = m
∑
Q2 i=1Q
i ⊗ Qi; ΣP 2 = P i ⊗ P i; Σ m i i i iP 2X2 = i=1 P ⊗ P ⊗ X ⊗ X ;
Σ = QiQ2Y 2 ⊗Qi ⊗ Y i ⊗ Y i; ΣP 2X = P i ⊗ P i ⊗X i; Σ = Qi ⊗Qi ⊗ Y iQ2Y .
C.3 Additional details for SAW evaluation metrics
In this section, we reiterate the two improvements we made to the metric used to evaluate
results on SAW annotations (described in Section 6.2 of the Chapter 4) and provide more
detailed explanations.
First, the original SAW error metric, as described by Kovacs et al. [185], is based
on classifying a pixel p as having smooth/nonsmooth shading based on the gradient
magnitude of the predicted shading image, ||∇S||2, normalized to the range [0, 1]. Instead,
we measure the gradient magnitude in the log domain. We do this because of the scale
ambiguity inherent to shading and reflectance, and because it is possible to have very
bright values in the shading channel (e.g., due to strong sunlight), and in such cases if we
normalize shading to [0, 1] then most of the resulting values will be close to 0. In contrast,
182
computing the gradient magnitude of log shading ||∇ logS||2 achieves scale invariance,
resulting in fairer comparisons for all methods. As in [185], we sweep a threshold τ to
create a precision-recall (PR) curve that captures how well each method captures smooth
and non-smooth shading.
Second, Kovacs et al. [185] apply a 10× 10 maximum filter to the shading gradient
magnitude image before computing PR curves, because many shadow boundary annota-
tions are not precisely localized. However, this maximum filter can result in degraded
performance for smooth shading regions. Consider adding 1% salt-and-pepper noise
to the shading estimate. Applying a maximum filter to this noisy gradient magnitude
image would make it seem as if there are large changes everywhere. Moreover, we
found several annotated smooth regions are close to the boundaries of shading changes
caused by depth/normal discontinuities, and if we apply a maximum filter, we might
integrate incorrect shading information out of annotated regions into our evaluation.
Instead, we create two maps, the original ||∇ logS||2, and the 10× 10 maximum filtered
to ||∇ logS||2, which we denote ||∇ logS||max2 . We use ||∇ logS||2 to classify smooth
shading annotations and ||∇ logS||max2 to classify non-smooth annotations.
183
APPENDIX D
CHAPTER 5 APPENDIX
D.1 Additional details for training losses
D.1.1 Ordinal term for CGINTRINSICS
Recall LCGI (Equation 2 in Chapter 5), the loss defined for our CGINTRINSICS training
data:
LCGI = Lsup + λordLord + λrecLreconstruct (D.1)
We now provide the full formula for Lord, the ordinal loss term. In particular, for a
given CGINTRINSICS training image and predicted reflectance R, we accumulate losses
for each pair of pixels (i, j) generated from a set of pixels P , where one pixel is sampled
at random from oversegmented regions in that image:
∑
Lord(R) = fi,j(R), (D.2)
(i,j)∈P×P
i 6=j
where


(logRi − logRj)2, −τ1 < P ∗i,j < τ 1
(max(0, τ2 − logRi + logRj))
2 , P ∗i,j > τ
fi,j(R) = 
2
(D.3)
(max(0, τ2 − logRj + logR ))
2 , P ∗
 i i,j
< −τ2
0, otherwise
where R∗ is the rendered ground truth reflectance, P ∗ = logR∗i,j i − logR∗j , τ1 = log(1.05)
and τ2 = log(1.5). The intuition is that we categorize pairs of ground truth reflectances at
pixels (i, j) as having an “equal,” “greater than,” or “less than” relationship, and then add
a penalty if the predicted reflectances at those pixels do not satisfy the same relationship.
D.1.2 Additional hyperparameter settings
In all experiments described in Chapter 5, we set λIIW = λSAW = 2, λord = λrs =
λS/NS = 1, λrec = 2 and λss = 4. The number of image scales L = 4. The margin in
Equation 7 in Chapter 5 m = 0.425. For simplicity, the covariance matrix Σ defined in
Lrsmooth (Equation 10 in Chapter 5) is a diagonal matrix,defined as: 2σp  2
Σ = 
σ p 
σ2 
 I 2 σc 


σ2c
where σp = 0.1, σI = 0.12, σc = 0.03.
185
APPENDIX E
CHAPTER 6 APPENDIX
E.1 Priors on the Plenoptic Function
Our scene representation and approach to training are motivated by simple priors on
structure in the plenoptic function. Given that our crowdsampling of a scene is unpre-
dictable and unregistered in time, we focus primarily on periodic changes in reflectance
and illumination—most notably, this includes how the appearance of a scene changes
from day to night. For most scenes, this change is dominated by the motion of the sun.
However, we also see scenes where, for example, visible lights turn off and on throughout
the day (e.g., a cityscape or an attraction that lights up at night). Below we describe two
priors that motivate the design of our representation and training.
E.1.1 Constant Visibility and Light Field Gradients
Even in non-Lambertian scenes with changing reflectance and illumination, we may
expect the structure of visibility to remain relatively constant over time. If we consider
slices of the plenoptic function at different times, we can think of this expectation
in terms of gradients in the respective light fields. To see this, consider that every
scene point corresponds to some 4D hyperplane in the light field. If the light transport
function around our point is smooth, then we can expect that this hyperplane will be
locally constant. Gradients then primarily occur at boundaries between hyperplanes,
which include occlusion boundaries and edges in reflectance (e.g., surface texture)
or illumination (e.g., shadows). The structure of visibility in a scene determines the
adjacency of these hyperplanes, thereby limiting the set of gradients that can be introduced
by changing reflectance or illumination. Our approach leverages this prior that different
slices of the plenoptic function share visibility structure by fixing alpha values of our
DeepMPI. Recall that each voxel of an MPI can be interpreted as a floating semi-
transparent surface point. This corresponds to a constant hyperplane in our reconstructed
light field just as an analogous real surface point would. Fixing the alphas determines the
visibility of such points, and therefore the adjacency of their corresponding hyperplanes
in the reconstructed light field.
E.1.2 Common Light Sources, Material Properties, and Normals
We can think of the light transport function around every point in our scene as mapping
incoming light to outgoing light. The appearance of a point in a particular image is then
a sample from this transport function. Without explicitly modeling the transport function,
we can reason about correlations among the samples provided by different scene points
and across different viewing conditions. For example: it is reasonable to expect that
many visible points in a given scene will share the same material properties, and that the
relationship between surface normals at different points will remain constant (this is true
so long as surface geometry does not change). Furthermore, we can expect correlation
due to different points being lit by the same source—often, the sun. We learn how to
leverage these many sources of correlation by training our DeepMPI with feature vectors
attached to each voxel. Intuitively, this creates a latent space where surface points with
highly correlated appearance end up with similar feature vectors, preserving important
correlations in our generated MPIs.
187
E.2 Scene Statistics
Table E.1 shows statistics for each scene, including number of valid images, field of view
(FoV) of the reference DeepMPI, and depth of the near and far MPI planes. We adopt the
same method as that of Zhou et al. [418] to estimate the scale of each scene in order to
set near and far plane depth. All data, including original images, registered poses, and
SfM reconstructions will be released to the research community.
Table E.1: Scene statistics. We include (1) total number of images, 2) field of view
(FoV) of the reference DeepMPI, and (3) depth of near and far MPI planes. The first five
scenes are used for evaluation in Chapter6.
Scenes # images FoV (°) (near/far) plane depth
Trevi Fountain 3453 70 1/4
Sacre Coeur 2112 65 1/20
The Pantheon 1917 65 1/25
Top of the Rock 2232 75 1/75
Piazza Navona 606 70 1/25
Mount Rushmore 3075 30 1/4
Lincoln Memorial 2582 45 1/4
Eiffel Tower 1999 65 1/20
E.3 Losses
E.3.1 Losses optimizing DeepMPI color and α planes
Recall that in Section 3.3 of Chapter 6, we compare the rendered base color image B̂k
and real photo Ik at the target viewpoint using a reconstruction loss Lrecon. Lrecon consists
188
of a pixel-wise l1 loss and a multi-scale gradient consistency loss [203]:
∑Ns
L k k k krecon = ||B̂ (0)− I (0)||1,1 + wgrad ||∇B̂ (s)−∇I (s)||1,1, (E.1)
i=1
where S is the number of scales we create for calculating the gradient consistency loss,
and I(s) is the image at scale s (where s = 0 is equivalent to the original resolution). In
our experiments, we set S = 3 and use nearest neighbor downsampling to create image
pyramids for both rendered and ground truth images.
E.3.2 Training Losses
Recall that in Section 3.4 of Chapter 6, to train the rendering network G, the appearance
encoder E, and the latent feature in the DeepMPI, we compute losses between output
views and ground-truth exemplar views. Specifically, our training loss is composed of
three terms:
L = LVGG + wGANLGAN + wstyleLstyle, (E.2)
LVGG is a normalized VGG perceptual loss similar to that used in [418, 62]:
∑
LVGG = wl||φ kl(Î )− φl(Ik)||1, (E.3)
l
where φl(x) indicates an output from VGG layer l ∈ { conv1 2, conv2 2, conv3 2,
conv4 2, conv5 2 } with input x, and weight wl is proportional to the reciprocal of the
number of neurons in the corresponding VGG layer.
Furthermore, we add an adversarial loss LGAN to improve the realism of the rendered
images. In particular, LGAN is computed from multi-scale discriminators [367] with an
189
objective similar to LSGAN [227]:
LGAN = LGAN(D[)(+ LGAN(G)), ] [ ( ) ] (E.4)
LGAN(D) = E
2 2
Ik∼p(I) [(D(I
k
( )− 1 )+ Ez)∼p](Ik) D G(D
k, z) , (E.5)
z
L k 2GAN(G) = Ez∼p (Ik) D G(D , z) − 1 , (E.6)z
where LGAN(D) is the loss for the discriminator, and LGAN(G) is the loss for our neural
render.
To further enforce that the appearance of rendered images match the appearance of
the exemplar images, we add a style loss Lstyle, which compares the l1,1 norm of the
difference between Gram matric∑es constructed from VGG features at different layers:
Lstyle = ||g(φ k kl(Î ))− g(φl(I ))||1,1 (E.7)
l
where g(x) is the Gram matrix from a VGG feature x.
E.4 Network Architecture
Let D and C denote number of depth planes and channels in our DeepMPI, and let Hk
and W k denote the height and width of the view at target viewpoint ck.
Appearance encoder. Our appearance encoder consists of two encoders, denoted E1 and
E2. E1 takes as input a reference feature buffer Φrs with fixed resolution of 512×512, and
produces a latent feature vector z ∈ R5121 . We adopt the encoder implemented by Park et
al. [271] for E1. E2 takes as input an exemplar image Is with varying aspect ratios, and
produces a latent feature vector z ∈ R2562 . We adopt the encoder from Huang et al. [143]
as E2. The two latent vectors are then passed through a fully connected layer in order to
produce a final latent appearance vector z ∈ R16 we describe in Chapter 6.
190
Figure E.1: Visual examples of reference base color images. These are over-
composited from base color planes of the reference DeepMPI.
Neural renderer. We adopt the U-Net modified from Zhu et al. [423] as our neural
rendering network. In summary, during training and evaluation, we feed the DeepMPI
at the target viewpoint with size D × C ×Hk ×W k to the rendering network, and the
network predicts RGB MPI planes with size D× 3×Hk ×W k. However, the rendering
network operates at each depth slice of the DeepMPI at target viewpoint independently
(with size C ×Hk ×W k), and predicts the corresponding RGB color image (with size
3×Hk ×W k). Therefore, our rendering network independently processes every depth
slice of DeepMPI, without considering interactions between them.
Our rendering network consists of five convolutional layers in both the encoder and
decoder. Each layer of the encoder consists of a 3 × 3 stride-2 convolutional layer
followed by Instance Normalization [352] and leaky ReLu. Each layer of the decoder
consists of bilinear sampling followed by a 3× 3 convolutional layer. Adaptive Instance
Normalization layers (AdaIN) [142] are embedded between bilinear sampling and feature
concatenation of skip connections.
Discriminator. We adopt the network architecture from Huang et al. [143] as the
discriminator used for our GAN loss. In particular, the discriminator takes as input
images at three scales, and predicts scores from each patch of the input image.
191
E.5 Training and Implementation
We implement our framework using PyTorch. In all our experiments, we empirically
set hyper-parameters wgrad = 0.25, wGAN = 0.2, wstyle = 5. We set the resolution of
reference DeepMPI to 784× 784.
In the first stage, we optimize base color and α planes in the reference DeepMPI for
100 epochs in total (70 epochs in phase one, and 30 epoch in phase two) using a single
Tesla T40 GPU. We adopt the Adam [177] optimizer with initial learning rate 1× 10−3
for the optimization.
In the second stage, we use 4 Tesla T40 GPUs to jointly train the rendering network
G, appearance encoder E, and latent features F r in the DeepMPI for 50 epochs. During
training, we adopt the Adam optimizer [177] and set a learning rate of 3× 10−4 for E, G
and F r, and a learning rate of 1× 10−5 for the discriminator. In addition, since Internet
photos have varying aspect ratios and orientations, we resize their weight and height to a
factor of 32 depending on images’ original aspect ratios. Due to GPU memory limits,
during training, we randomly crop a patch of 256× 256 from the resized images and we
only render a view corresponding to the patch. However, our method cann render a full
image with resolution up to 640× 480 at inference time on a single GPU.
E.6 Visual Illustrations
Examples of mean RGB PSV and base color. Figure E.2 shows examples of the
reference mean RGB color PSV at different depth layers from different scenes. These
are used for initializing base color planes, as described in Section 3.3 of Chapter 6. In
addition, Figure E.1 shows estimated reference base colors, which are over-composited
192
Figure E.2: Visual illustration of reference mean RGB PSV. Different images in each
row indicate different depth planes of the plane sweep volume (PSV). The mean RGB
images at different depth planes have different in-focus regions.
from base color planes in the reference DeepMPI.
Examples of rectified RGB images. Figure E.3 shows visual examples of rectified RGB
images in the feature buffer Φrs aligned with the reference viewpoint, described in Section
3.4 of Chapter 6.
E.7 User Study
Table E.2-E.4 show scores from 1104 votes (46 participants × 24 comparisons) for
each of following questions, respectively (where users are shown results from multiple
algorithms to choose from):
193
Figure E.3: Visual examples of rectified RGB images. The reference rectified images
are geometrically stable and globally aligned up to disocculusion.
MUNIT [143] NRW [238] Ours
Trevi Fountain 2% 12% 86%
Piazza Navona 6% 25% 69%
Top of the Rock 0% 8% 92%
Sacre Coeur 1% 14% 85%
The Pantheon 4% 18% 78%
Total 3% 15% 82%
Table E.2: User study, Share of votes on Q1.
Q1: “Which one looks most photo-realistic? e.g. which video best reproduces the
details of geometry and illumination you would expect of a real world scene?”
Q2: “Which one appears to be most consistent across viewpoints, with the least jitter or
flicker across frames?”
194
MUNIT [143] NRW [238] Ours
Trevi Fountain 1% 7% 93%
Piazza Navona 1% 28% 71%
Top of the Rock 0% 9% 91%
Sacre Coeur 1% 10% 89%
The Pantheon 0% 7% 93%
Total 1% 11% 88%
Table E.3: User study, Share of votes on Q2.
MUNIT [143] NRW [238] Ours
Trevi Fountain 4% 13% 83%
Piazza Navona 9% 30% 61%
Top of the Rock 0% 9% 91%
Sacre Coeur 3% 16% 81%
The Pantheon 4% 27% 69%
Total 4% 19% 77%
Table E.4: User study, Share of votes on Q3.
Q3: “Which one is most faithful to the appearance of the source image? For instance,
which image best resembles the illumination and shading on the building in the source
image?”
The user study contains three sets of video comparisons and two sets of image
comparisons randomly selected from each scene. Our method received the majority of
votes on all questions across all five scenes.
195
APPENDIX F
CHAPTER 7 APPENDIX
F.1 Scene Flow Regularization Details
Recall that Lreg is used as a regularization loss for the predicted scene flow fields,
consisting of three terms with equal weights: Lreg = Lsp + Ltemp + Lmin, corresponding
to spatial smoothness, temporal smoothness, and small scene flow. Scene flow spatial
smoothness [255] minimizes the weighted `1 difference between scenes flows sampled at
neighboring 3D position along each ray ri. In particular, the spatial smoothness term is
written as:
∑ ∑ ∑
L distsp = w (xi,yi)||fi→j(xi)− fi→j(yi)||1, (F.1)
xi yi∈N (xi) j∈{i±1}
where N (xi) is the neighboring points of xi sampled along the ray ri, and
weights are computed by the Euclidean distance between the two points: wdist(x,y) =
exp (−2||x− y||2).
Scene flow temporal smoothness, inspired by Vo et al. [360], encourages 3D point
trajectories to be piece-wise linear. This is equivalent to minimizing sum of forward
scene flow and backward scene flow from each sampled 3D point along the ray:
∑
L 1temp = ||fi→i+1(xi) + fi→i−1(xi)||22 (F.2)2
xi
Finally, we encourage scene flow to be minimal in most of 3D space [357] by applying
a l1 regularization term to each predicted scene flow:
∑ ∑
Lmin = ||fi→j(xi)||1 (F.3)
xi j∈{i±1}
F.2 Data Driven Prior Details
Geometric consistency prior. Recall the geometric consistency prior minimizes the
reprojection error of scene flow displaced 3D points w.r.t. the derived 2D optical flow.
Suppose pi is a 2D pixel position at time i. The corresponding 2D pixel location in the
neighboring frame at time j displaced through 2D optical flow ui→j can be computed as
pi→j = pi + ui→j .
To estimate the expected 2D point location p̂i→j at time j displaced by predicted
scene flow fields, we first compute the expected scene flow F̂i→j(ri) and the expected
3D point location X̂i(ri) of the ra∫y ri through volume rendering:tf
F̂i→j(ri) = ∫ Ti(t)σi(ri(t)) fi→j(ri(t))dt, (F.4)tn tf
X̂i(ri) = Ti(t)σi(ri(t))xi(ri(t))dt. (F.5)
tn
p̂i→j is then computed by performing perspective projection of the expected 3D point
location displaced by the scene flow (i.e. X̂i(ri) + F̂i→j(ri)) into the viewpoint corre-
sponding to the frame at time j:
p̂ ji→j(ri) = π(K(R (X̂i(ri) + F̂i→j(ri)) + t
j)), (F.6)
where (Rj, tj) ∈ SE(3) are rigid body transformations that transform 3D points from
the world coordinate system to the coordinate system of frame at time j. K is a camera
intrinsic matrix shared among all the frames, and π is perspective division operation. The
197
geometric consistency can be applied by comparing the l1 difference between p̂i→j and
pi→j:
∑ ∑
Lgeo = ||p̂i→j(ri)− pi→j(ri))||1. (F.7)
ri j∈N (i)
Single-view depth prior. The single view depth prior encourages the expected termi-
nation depth Ẑi computed along each ray to be close to the depth Zi predicted from a
pre-trained single-view depth network [281]. As single-view depth predictions are defined
up to an unknown scale and shift, we utilize a robust scale-shift invariant loss [281]:
∑
L = ||Ẑ∗ ∗z i (ri)− Zi (ri)||1 (F.8)
ri
We normalize the depths to have zero translation and unit scale using robust estimator:
∗ Z(ri)− shift(Z)Z (ri) = ,scale(Z)
where shift(Z) = median(Z), scale(Z) = mean(|Z − shift(Z)|). (F.9)
Due to computational limits, we are not able normalize the entire depth image during
training, so we normalize the depth value using the shift and scale estimate from current
sampled points in each training iteration. Furthermore, since we reconstruct the entire
scene in normalized device coordinate (NDC) space, and the MiDAS model [281] predicts
disparity in Euclidean space with an unknown scale and shift, we can use the NDC ray
space derivation from NeRF [241] to derive that the depth in NDC space is equal to
negative disparity in Euclidean space up to scale and shift, so our single-view term is
implemented as:
∑ ∗
L 1z = ||Ẑ∗i (ri) + (ri)||1 (F.10)Z
r ii
198
Figure F.1: Space-time view synthesis. We propose a 3D splatting-based approach to
perform space-time interpolation at specified target viewpoint (shown as green camera)
at an intermediate time i+ δi. Specifically, we sweep a plane over every ray r emitted
from the specified target viewpoint from front to back. At each sampled step t along the
ray, we query the color and density information (c, α), and the scene flows at times i and
i + 1. We then displace the 3D points along the ray by the scaled scene flow δifi→i+1,
(1 − δi)fi→i−1 respectively (left). The 3D displaced points are then splatted from time
i and i+ 1 onto a (c, α) accumulation buffer at the target viewpoint, and the splats are
blended with linear weights 1− δi, δi (middle). The final rendered view is obtained by
volume rendering the accumulation buffer (right).
F.3 Space-Time Interpolation Visualization
In Sec 3.4 of our Chapter 6, we propose a splatting-based plane-sweep volume tracing
approach to perform space-time interpolation to synthesize novel views in at novel view
points, and in between input time indices. We show a visual illustration of this in Fig. F.1.
In practice, we use the CUDA implementation of average splatting from Niklaus et
al. [258] to efficiently perform forward splatting of the 3D points through scene flow
fields.
199
F.4 Volume Rendering Equation Approximation
Recall in Sec 3.3, the combined ren∫dering equation is written as:tf
Ĉcbi (r ) = T
cb
i i (t)σ
cb
i (t) c
cb
i (t)dt, , (F.11)
tn
where σcbi (t) c
cb
i (t) is a linear combination of static scene components c(ri(t),di)σ(ri(t))
and dynamic scene components ci(ri(t),di)σi(ri(t)), weighted by v(ri(t)):
σcb(t) ccbi i (t) = v(t) c(t)σ(t) + (1-v(t)) ci(t)σi(t). (F.12)
We approximate this combined rendering equation using the same quadrature ap-
proximations technique described in prior work [230, 241]. Suppose {tl}Ll=1 are the
points sampled within the near and far bounds and we denote the distance between every
sampled points δl = tl+1 − tl, the discrete approximation of Eq. F.11 is then written as:
∑L ( )
Ĉcbi (ri) = T
cb
i (t
l) v(tl)α(σ(tl)δl) c(tl) + (1− v(tl))α(σ li(t )δl) c li(t ) ,
l=1 ( ∑l−1 ( )′ ′ ′ ′ )
where T cbi (t
l) = exp − v(tl )σ(tl ) + (1− v(tl ))σi(tl ) δl′ ,
l′=1
and α(x) = 1− exp(−x) (F.13)
F.5 Network Architecture
Our network architecture is a variant of the original NeRF, which adopts MLPs as a
backbone. Our full model consists of two separate MLPs, corresponding to a static
(time-independent) scene representation (Fig. F.2) and a dynamic (time-dependent) scene
representation (Fig. F.3).
200
Figure F.2: Network architecture of static (time-invariant) scene representation.
Modified from the original NeRF architecture diagram. We predict an extra blending
weight field v from intermediate features along with opacity σ.
Figure F.3: Network Architecture of dynamic (time-variant) scene representation.
Modified from the original NeRF architecture diagram. We encode and input time indices
i into the MLP and predict time-dependent scene flow fields Fi and disocculusion weight
fieldsWi from the intermediate features along with opacity σi.
F.6 Implementation Details
Initialization. We denote the initialization stage as the first 1000N iterations during
training, where N is the number training views. To warm up the optimization, during the
201
initialization stage, we only compute all the temporal losses only in temporal window of
size 3, i.e. j ∈ {i, i ± 1}, and switch to a temporal window of size 5, i.e. j ∈ N (i) =
{i, i± 1, i± 2} after the initialization stage.
Additionally, as both of the data-driven priors are noisy (in that they rely on inaccurate
or incorrect predictions), we use these for initialization only, and linearly decay the weight
of Ldata to zero during training for a fixed number of iterations. In particular, we linearly
decrease the weight by factor of 10 every 1000N iterations.
Hard mining sampling. Optionally, to sufficiently initialize the depth and scenes flows
of small, fast moving objects such as the limbs of a person, we precompute a coarse
binary motion segmentation mask from each frame, and sample an additional 512 points
from the motion mask regions during the initialization stage. These additionally sampled
points are added to the loss used in our dynamic (time-variant) scene representation.
Similar to prior work [379], we compute the above coarse binary motion segmentation
using a combination of physical and semantic estimates of rigidity. In particular, the
physical mask is wherever the distance between the optical flow [146] at each pixel and its
corresponding epipolar line from the neighboring frame at time j ∈ N (i) is greater than
1 pixel, and the semantic mask is computed using an off-the-shelf instance segmentation
network [131] to label all pixels corresponding to possible moving objects such as people
and animals. Finally, We union the two masks followed by morphological dilation to
obtain the final binary mask. Note this coarse motion segmentation is mainly used to
slightly increase the number of samples for the data-driven priors during initialization
and does not need to be very accurate.
202
Hyperparameters and evaluation details. We implement our framework using Py-
Torch. We empirically set βcyc = 1, βdata = 0.04, βreg = 0.1 in our experiments. When
we evaluate all the baselines, we resize the sizes of their rendered images to be same as
our rendered images before performing evaluation. In addition, recall we evaluate all the
methods quantitatively on the entire scenes and in dynamic region only. In order to accu-
rately determine which region is moving, we compute ground truth dynamic masks for
numerical evaluation (Dynamic Only described in Chapter 7) from the multi-view videos
through optical flow between two consecutive time instances at the same viewpoint, and
segment out the dynamic region where the flow magnitude is larger than one pixel.
203
BIBLIOGRAPHY
[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov,
Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Pro-
ceedings of the 2016 ACM SIGSAC conference on computer and communications
security, pages 308–318, 2016.
[2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua,
and Sabine Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel
methods. Trans. on Pattern Analysis and Machine Intelligence, 34(11), 2012.
[3] Edward H. Adelson and James R. Bergen. The plenoptic function and the elements
of early vision. In Computational Models of Visual Processing, pages 3–20. MIT
Press, 1991.
[4] Edward H Adelson, James R Bergen, et al. The plenoptic function and the
elements of early vision, volume 2. Vision and Modeling Group, Media Laboratory,
Massachusetts Institute of . . . , 1991.
[5] Edward H Adelson and Alex P Pentland. The perception of shading and reflectance.
Perception as Bayesian inference, pages 409–423, 1996.
[6] Sadia Afroz and Rachel Greenstadt. Phishzoo: Detecting phishing websites
by looking at them. In 2011 IEEE fifth international conference on semantic
computing, pages 368–375. IEEE, 2011.
[7] A. Agarwal, A. Beygelzimer, M. Dudı́k, J. Langford, and H. Wallach. A reductions
approach to fair classification. ArXiv, abs/1803.02453, 2018.
[8] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski.
Building Rome in a day. In Proc. Int. Conf. on Computer Vision (ICCV), 2009.
[9] Shruti Agarwal, Hany Farid, Tarek El-Gaaly, and Ser-Nam Lim. Detecting deep-
fake videos from appearance and behavior. In 2020 IEEE International Workshop
on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2020.
[10] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give
a false sense of security: Circumventing defenses to adversarial examples. In
International Conference on Machine Learning, pages 274–283. PMLR, 2018.
[11] Kumar Ayush, Burak Uzkent, Marshall Burke, David Lobell, and Stefano Ermon.
Efficient poverty mapping using deep reinforcement learning. arXiv preprint
arXiv:2006.04224, 2020.
[12] Ho Bae, Jaehee Jang, Dahuin Jung, Hyemi Jang, Heonseok Ha, and Sungroh Yoon.
Security and privacy issues in deep learning. arXiv preprint arXiv:1807.11655,
2018.
[13] Mohammad Haris Baig and Lorenzo Torresani. Coupled depth learning. In Proc.
Winter Conf. on Computer Vision (WACV), 2016.
[14] Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan.
4d visualization of dynamic events from unconstrained multi-view videos. In Proc.
Computer Vision and Pattern Recognition (CVPR), pages 5366–5375, 2020.
[15] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-
Hsuan Yang. Depth-aware video frame interpolation. In Proc. Computer Vision
and Pattern Recognition (CVPR), June 2019.
[16] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev.,
104:671, 2016.
[17] Jonathan T Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. Fast
bilateral-space stereo for synthetic defocus. In Proc. Computer Vision and Pattern
Recognition (CVPR), pages 4466–4474, 2015.
[18] Jonathan T Barron and Jitendra Malik. Intrinsic scene properties from a single
RGB-D image. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
17–24, 2013.
[19] Jonathan T Barron and Jitendra Malik. Shape, illumination, and reflectance from
shading. Trans. on Pattern Analysis and Machine Intelligence, 37(8):1670–1687,
2015.
[20] Jonathan Tilton Barron. Shapes, Paint, and Light. University of California,
Berkeley, 2013.
[21] Tali Basha, Shai Avidan, Alexander Hornung, and Wojciech Matusik. Structure and
motion from scene registration. In Proc. Computer Vision and Pattern Recognition
(CVPR), 2012.
[22] Tali Basha, Yael Moses, and Nahum Kiryati. Multi-view scene flow estimation: A
view centered variational approach. Int. J. of Computer Vision, 2013.
205
[23] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and
Antonio Torralba. Understanding the role of individual units in a deep neural
network. Proceedings of the National Academy of Sciences, 2020.
[24] Haluk Bayram, Nikolaos Stefas, Kazim Selim Engin, and Volkan Isler. Tracking
wildlife with multiple uavs: System design, safety and field experiments. In 2017
International symposium on multi-robot and multi-agent systems (MRS), pages
97–103. IEEE, 2017.
[25] Shida Beigpour, Marc Serra, Joost van de Weijer, Robert Benavente, Marı́a Vanrell,
Olivier Penacchio, and Dimitris Samaras. Intrinsic image evaluation on synthetic
complex scenes. Int. Conf. on Image Processing, 2013.
[26] Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM
Trans. Graphics, 33(4):159, 2014.
[27] R. Bellamy, K. Dey, M. Hind, Samuel C. Hoffman, S. Houde, K. Kannan, Pranay
Lohia, J. Martino, Sameep Mehta, A. Mojsilovic, Seema Nagar, K. Ramamurthy,
J. Richards, Diptikalyan Saha, P. Sattigeri, M. Singh, Kush R. Varshney, and
Y. Zhang. Ai fairness 360: An extensible toolkit for detecting, understanding, and
mitigating unwanted algorithmic bias. ArXiv, abs/1810.01943, 2018.
[28] Mojtaba Bemana, Karol Myszkowski, Hans-Peter Seidel, and Tobias Ritschel.
X-fields: Implicit neural view-, light-and time-image interpolation. ACM Trans.
Graphics, 39(6), 2020.
[29] Alex Beutel, J. Chen, T. Doshi, Hai Qian, Allison Woodruff, Christine Luu, Pierre
Kreitmann, Jonathan Bischof, and Ed Huai hsin Chi. Putting fairness principles
into practice: Challenges, metrics, and improvements. Proceedings of the 2019
AAAI/ACM Conference on AI, Ethics, and Society, 2019.
[30] David Bholat. The impact of machine learning and ai on the uk economy-
conference overview. Available at SSRN 3602563, 2020.
[31] S. Bi, Zexiang Xu, Pratul P. Srinivasan, Ben Mildenhall, Kalyan Sunkavalli,
Milovs Havsan, Yannick Hold-Geoffroy, D. Kriegman, and R. Ramamoorthi.
Neural reflectance fields for appearance acquisition. ArXiv, abs/2008.03824, 2020.
[32] Sai Bi, Xiaoguang Han, and Yizhou Yu. An l1 image transform for edge-preserving
smoothing and scene-level intrinsic decomposition. ACM Trans. Graph., 34:78:1–
78:12, 2015.
206
[33] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy,
David Kriegman, and Ravi Ramamoorthi. Deep reflectance volumes: Relightable
reconstructions from multi-view photometric images. Proc. European Conf. on
Computer Vision (ECCV), 2020.
[34] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, and Ravi Ramamoorthi.
Deep 3d capture: Geometry and reflectance from sparse multi-view images. In
Proc. Computer Vision and Pattern Recognition (CVPR), pages 5960–5969, 2020.
[35] Michael Bleyer, Christoph Rhemann, and Carsten Rother. Patchmatch stereo-
stereo matching with slanted support windows. In Proc. British Machine Vision
Conf. (BMVC), 2011.
[36] Anthony E Boardman. Another analysis of the eeocc “four-fifths” rule. Manage-
ment Science, 25(8):770–776, 1979.
[37] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter V. Gehler, Javier
Romero, and Michael J. Black. Keep it SMPL: Automatic Estimation of 3D Human
Pose and Shape from a Single Image. In Proc. European Conf. on Computer Vision
(ECCV), pages 561–578, 2016.
[38] Nicolas Bonneel, Balazs Kovacs, Sylvain Paris, and Kavita Bala. Intrinsic decom-
positions for image editing. Computer Graphics Forum (Eurographics State of the
Art Reports 2017), 36(2), 2017.
[39] Ivaylo Boyadzhiev, Sylvain Paris, and Kavita Bala. User-assisted image composit-
ing for photographic lighting. ACM Trans. Graphics, 32:36:1–36:12, 2013.
[40] Aljaz Bozic, Michael Zollhofer, Christian Theobalt, and Matthias Niessner. Deep-
deform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In
Proc. Computer Vision and Pattern Recognition (CVPR), June 2020.
[41] Ali Breland. The bizarre and terrifying case of the “deepfake” video that helped
bring an african nation to the brink. Mother Jones, 2019.
[42] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial
attacks: Reliable attacks against black-box machine learning models. arXiv
preprint arXiv:1712.04248, 2017.
[43] Fabian Brickwedde, Steffen Abraham, and Rudolf Mester. Mono-sf: Multi-view
geometry meets single-view depth for monocular scene flow estimation of dynamic
207
traffic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
2780–2790, 2019.
[44] Nina Brown. Congress wants to solve deepfakes by 2020. https:
//slate.com/technology/congress-deepfake-regulation-
230-2020.html, 2020.
[45] Michael Broxton, Jay Busch, Jason Dourgarian, Matthew DuVall, Daniel Erickson,
Dan Evangelakos, John Flynn, Ryan Overbeck, Matt Whalen, and Paul Debevec. A
low cost multi-camera array for panoramic light field video capture. In SIGGRAPH
Asia 2019 Posters, SA ’19, New York, NY, USA, 2019. Association for Computing
Machinery.
[46] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman,
Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec.
Immersive light field video with a layered mesh representation. ACM Trans.
Graph., 39(4), July 2020.
[47] Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael
Cohen. Unstructured lumigraph rendering. In Proceedings of the 28th annual
conference on Computer graphics and interactive techniques, pages 425–432,
2001.
[48] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy dispar-
ities in commercial gender classification. In Conference on fairness, accountability
and transparency, pages 77–91. PMLR, 2018.
[49] Marshall Burke, Anne Driscoll, David Lobell, and Stefano Ermon. Using satellite
imagery to understand and promote sustainable development. Technical report,
National Bureau of Economic Research, 2020.
[50] TJ Burke and S Trazo. Emerging legal issues in an ai-driven world. gowling wlg,
2018.
[51] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source
movie for optical flow evaluation. In Proc. European Conf. on Computer Vision
(ECCV), 2012.
[52] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic
open source movie for optical flow evaluation. In Proc. European Conf. on
Computer Vision (ECCV), pages 611–625, 2012.
208
[53] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural
networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57.
IEEE, 2017.
[54] Hervé Chabanne, Amaury de Wargny, Jonathan Milgram, Constance Morel, and
Emmanuel Prouff. Privacy-preserving classification on deep neural network. IACR
Cryptol. ePrint Arch., 2017:35, 2017.
[55] Jin-Xiang Chai, Xin Tong, Shing-Chow Chan, and Heung-Yeung Shum. Plenoptic
sampling. In Proceedings of the 27th Annual Conference on Computer Graphics
and Interactive Techniques, SIGGRAPH ’00, page 307–318, USA, 2000. ACM
Press/Addison-Wesley Publishing Co.
[56] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner,
Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D:
Learning from RGB-D data in indoor environments. Int. Conf. on 3D Vision
(3DV), pages 667–676, 2017.
[57] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qix-
ing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su,
et al. ShapeNet: An information-rich 3D model repository. arXiv preprint
arXiv:1512.03012, 2015.
[58] Gaurav Chaurasia, Sylvain Duchene, Olga Sorkine-Hornung, and George Drettakis.
Depth synthesis and local warps for plausible image-based navigation. ACM Trans.
Graphics, 32(3):1–12, 2013.
[59] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam.
Rethinking atrous convolution for semantic image segmentation. arXiv preprint
arXiv:1706.05587, 2017.
[60] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo:
Zeroth order optimization based black-box attacks to deep neural networks without
training substitute models. In Proceedings of the 10th ACM workshop on artificial
intelligence and security, pages 15–26, 2017.
[61] Qifeng Chen and Vladlen Koltun. A simple model for intrinsic image decomposi-
tion with depth cues. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 241–248, 2013.
[62] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded
refinement networks. In Proc. Int. Conf. on Computer Vision (ICCV), pages
1511–1520, 2017.
209
[63] Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth per-
ception in the wild. In Neural Information Processing Systems, pages 730–738,
2016.
[64] Weifeng Chen, Donglai Xiang, and Jia Deng. Surface normals in the wild. Proc.
Int. Conf. on Computer Vision (ICCV), pages 1557–1566, 2017.
[65] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted back-
door attacks on deep learning systems using data poisoning. arXiv preprint
arXiv:1712.05526, 2017.
[66] Zhang Chen, Anpei Chen, Guli Zhang, Chengyuan Wang, Yu Ji, Kiriakos N
Kutulakos, and Jingyi Yu. A neural rendering framework for free-viewpoint
relighting. arXiv preprint arXiv:1911.11530, 2019.
[67] Long Cheng, Fang Liu, and Danfeng Yao. Enterprise data breach: causes, chal-
lenges, prevention, and future directions. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, 7(5):e1211, 2017.
[68] M Chessen. Encoded laws, policies, and virtues. Cornell Policy Review, 2018.
[69] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, and Jan Kautz.
Extreme view synthesis. In Proc. Int. Conf. on Computer Vision (ICCV), pages
7781–7790, 2019.
[70] N Christopher. We’ve just seen the first use of deepfakes in an indian election
campaign. Vice News, 2020.
[71] Iain M Cockburn, Rebecca Henderson, and Scott Stern. The impact of artificial
intelligence on innovation. Technical report, National bureau of economic research,
2018.
[72] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas A
Funkhouser, and Matthias Niessner. ScanNet: Richly-annotated 3D reconstruc-
tions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 5828–5839, 2017.
[73] Abe Davis, Marc Levoy, and Frédo Durand. Unstructured light fields. Comput.
Graph. Forum, 31:305–314, 2012.
[74] Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham J Mysore, Fredo Durand,
210
and William T Freeman. The visual microphone: Passive recovery of sound from
video. ACM Trans. Graphics (SIGGRAPH), 2014.
[75] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and rendering
architecture from photographs: A hybrid geometry-and image-based approach. In
Proceedings of the 23rd annual conference on Computer graphics and interactive
techniques, pages 11–20, 1996.
[76] Tali Dekel, Michael Rubinstein, Ce Liu, and William T Freeman. On the effective-
ness of visible watermarks. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 2146–2154, 2017.
[77] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee, 2009.
[78] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable
machine learning. arXiv: Machine Learning, 2017.
[79] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip L. Davidson, Sean Ryan
Fanello, Adarsh Kowdle, Sergio Orts, Christoph Rhemann, David Kim, Jonathan
Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. Fusion4D:
real-time performance capture of challenging scenes. ACM Trans. Graphics,
35:114:1–114:13, 2016.
[80] Kevin Drum. The ai revolution is coming—and it will take your job
sooner than you think. https://www.motherjones.com/kevin-
drum/2017/10/you-will-lose-your-job-to-a-robot-and-
sooner-than-you-think-2/, 2018.
[81] Chris Dulhanty and A. Wong. Auditing imagenet: Towards a model-driven
framework for annotating demographic attributes of large-scale image datasets.
ArXiv, abs/1905.01347, 2019.
[82] Cynthia Dwork. Differential privacy: A survey of results. In International
conference on theory and applications of models of computation, pages 1–19.
Springer, 2008.
[83] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard
Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in
theoretical computer science conference, pages 214–226, 2012.
211
[84] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic
labels with a common multi-scale convolutional architecture. In Proc. Int. Conf.
on Computer Vision (ICCV), pages 2650–2658, 2015.
[85] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a
single image using a multi-scale deep network. In Neural Information Processing
Systems, pages 2366–2374, 2014.
[86] Selim Engin and Volkan Isler. Active localization of multiple targets from noisy
relative measurements. In Algorithmic Foundations of Robotics XIV: Proceedings
of the Fourteenth Workshop on the Algorithmic Foundations of Robotics 14, pages
398–413. Springer International Publishing, 2021.
[87] D. Erhan, Yoshua Bengio, Aaron C. Courville, and P. Vincent. Visualizing higher-
layer features of a deep network. In Technical Report, Univeristé de Montréal,
2009.
[88] Stefano Ermon, Jon Conrad, Carla Gomes, and Bart Selman. Risk-sensitive poli-
cies for sustainable renewable resource allocation. In Twenty-Second International
Joint Conference on Artificial Intelligence. Citeseer, 2011.
[89] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Mor-
cos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol
Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204–
1210, 2018.
[90] John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe,
Ryan Overbeck, Noah Snavely, and Richard Tucker. DeepView: View synthesis
with learned gradient descent. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 2367–2376, 2019.
[91] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo:
Learning to predict new views from the world’s imagery. In Proc. Computer
Vision and Pattern Recognition (CVPR), pages 5515–5524, 2016.
[92] Jan-Michael Frahm, Pierre Fite Georgel, David Gallup, Tim Johnson, Rahul
Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, and Svetlana
Lazebnik. Building Rome on a cloudless day. In Proc. European Conf. on
Computer Vision (ECCV), 2010.
[93] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks
that exploit confidence information and basic countermeasures. In Proceedings of
212
the 22nd ACM SIGSAC Conference on Computer and Communications Security,
pages 1322–1333, 2015.
[94] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng
Tao. Deep ordinal regression network for monocular depth estimation. In Proc.
Computer Vision and Pattern Recognition (CVPR), 2018.
[95] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards
internet-scale multi-view stereo. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 1434–1441, 2010.
[96] A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object
tracking analysis. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 4340–4349, 2016.
[97] Pratik Gajane. On formalizing fairness in prediction with machine learning. ArXiv,
abs/1710.03184, 2017.
[98] Pratik Gajane and Mykola Pechenizkiy. On formalizing fairness in prediction with
machine learning. arXiv preprint arXiv:1710.03184, 2017.
[99] Elena Garces, Adolfo Munoz, Jorge Lopez-Moreno, and Diego Gutierrez. Intrinsic
images by clustering. Computer Graphics Forum (Proc. EGSR 2012), 31(4), 2012.
[100] Rahul Garg, Hao Du, Steven M Seitz, and Noah Snavely. The dimensionality
of scene appearance. In Proc. Int. Conf. on Computer Vision (ICCV), pages
1917–1924. IEEE, 2009.
[101] Ravi Garg, Gustavo Carneiro, and Ian Reid. Unsupervised CNN for single view
depth estimation: Geometry to the rescue. In Proc. European Conf. on Computer
Vision (ECCV), pages 740–756, 2016.
[102] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using
convolutional neural networks. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 2414–2423, 2016.
[103] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Erez Lieber-
man Aiden, and Li Fei-Fei. Using deep learning and google street view to estimate
the demographic makeup of neighborhoods across the united states. Proceedings
of the National Academy of Sciences, 114(50):13108–13113, 2017.
[104] Andreas Geiger. Are we ready for autonomous driving? The KITTI Vision
213
Benchmark Suite. In Proc. Computer Vision and Pattern Recognition (CVPR),
2012.
[105] Marcel Geppert, Viktor Larsson, Pablo Speciale, Johannes L Schönberger, and
Marc Pollefeys. Privacy preserving structure-from-motion. In Proc. European
Conf. on Computer Vision (ECCV), pages 333–350. Springer, 2020.
[106] Gabriel Ghinita. Understanding the privacy-efficiency trade-off in location based
queries. In Proceedings of the SIGSPATIAL ACM GIS 2008 International Work-
shop on Security and Privacy in GIS and LBS, pages 1–5, 2008.
[107] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig,
and John Wernsing. Cryptonets: Applying neural networks to encrypted data with
high throughput and accuracy. In Proc. Int. Conf. on Machine Learning, pages
201–210. PMLR, 2016.
[108] Yotam I. Gingold, Ariel Shamir, and Daniel Cohen-Or. Micro perceptual human
computation for visual tasks. ACM Trans. Graphics, 2012.
[109] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised
monocular depth estimation with left-right consistency. In Proc. Computer Vision
and Pattern Recognition (CVPR), 2017.
[110] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised
monocular depth estimation with left-right consistency. In Proc. Computer Vision
and Pattern Recognition (CVPR), 2017.
[111] Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M
Seitz. Multi-view stereo for community photo collections. In Proc. Int. Conf. on
Computer Vision (ICCV), pages 1–8, 2007.
[112] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.
In Neural Information Processing Systems, pages 2672–2680, 2014.
[113] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and
harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[114] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The
lumigraph. In Proceedings of the 23rd annual conference on Computer graphics
and interactive techniques, pages 43–54, 1996.
214
[115] Roger Grosse, Micah K Johnson, Edward H Adelson, and William T Freeman.
Ground truth dataset and baseline evaluations for intrinsic image algorithms. In
Proc. Int. Conf. on Computer Vision (ICCV), 2009.
[116] Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. DensePose: Dense Hu-
man Pose Estimation In The Wild. Proc. Computer Vision and Pattern Recognition
(CVPR), 2018.
[117] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C
Courville. Improved training of wasserstein gans. In Neural Information Process-
ing Systems, pages 5767–5777, 2017.
[118] Chuan Guo, Jacob Gardner, Yurong You, Andrew Gordon Wilson, and Kilian
Weinberger. Simple black-box adversarial attacks. In International Conference on
Machine Learning, pages 2484–2493. PMLR, 2019.
[119] Trung Ha, Tran Khanh Dang, Hieu Le, and Tuan Anh Truong. Security and privacy
issues in deep learning: a brief review. SN Computer Science, 1(5):1–15, 2020.
[120] Marc Habermann, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Chris-
tian Theobalt. Livecap: Real-time human performance capture from monocular
video. ACM Transactions on Graphics (TOG), 38(2):14, 2019.
[121] Nicolai Häni, Pravakar Roy, and Volkan Isler. Apple counting using convolutional
neural networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 2559–2565. IEEE, 2018.
[122] Mark Hansen, Meritxell Roca-Sales, Jonathan M Keegan, and George King.
Artificial intelligence: Practice and implications for journalism. Tow Center for
Digital Journalism, Columbia University, 2017.
[123] Moritz Hardt, E. Price, and Nathan Srebro. Equality of opportunity in supervised
learning. In NIPS, 2016.
[124] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised
learning. arXiv preprint arXiv:1610.02413, 2016.
[125] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer
vision. Cambridge university press, 2003.
[126] Woodrow Hartzog and Evan Selinger. Facial recognition is the perfect tool for
oppression. Medium, 2018.
215
[127] D Harvey. Help us shape our approach to synthetic and manipulated media. twitter,
2019.
[128] Daniel Hauagge, Scott Wehrwein, Kavita Bala, and Noah Snavely. Photometric
ambient occlusion. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 2515–2522, 2013.
[129] Daniel Cabrini Hauagge, Scott Wehrwein, Paul Upchurch, Kavita Bala, and Noah
Snavely. Reasoning about photo collections using models of outdoor illumination.
In Proc. British Machine Vision Conf. (BMVC), 2014.
[130] James Hays and Alexei A Efros. Scene completion using millions of photographs.
ACM Trans. Graphics, 26(3):4–es, 2007.
[131] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In
Proc. Int. Conf. on Computer Vision (ICCV), pages 2961–2969, 2017.
[132] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into
rectifiers: Surpassing human-level performance on ImageNet classification. In
Proc. Int. Conf. on Computer Vision (ICCV), 2015.
[133] Peter Hedman, Suhib Alsisan, Richard Szeliski, and Johannes Kopf. Casual 3d
photography. ACM Trans. Graphics, 36:234:1–234:15, 2017.
[134] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis,
and Gabriel Brostow. Deep blending for free-viewpoint image-based rendering.
ACM Trans. Graphics, 37(6):1–15, 2018.
[135] Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna
Rohrbach. Women also snowboard: Overcoming bias in captioning models. In
Proceedings of the European Conference on Computer Vision (ECCV), pages
771–787, 2018.
[136] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. Cryptodl: Deep neural
networks over encrypted data. arXiv preprint arXiv:1711.05189, 2017.
[137] Joel A Hesch and Stergios I Roumeliotis. An indoor localization aid for the visually
impaired. In Proceedings 2007 IEEE International Conference on Robotics and
Automation, pages 3545–3551. IEEE, 2007.
[138] Joel A Hesch and Stergios I Roumeliotis. Design and analysis of a portable indoor
216
localization aid for the visually impaired. The International Journal of Robotics
Research, 29(11):1400–1415, 2010.
[139] Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a
single image. In Proc. Int. Conf. on Computer Vision (ICCV), volume 1, pages
654–661, 2005.
[140] Ian P Howard. Seeing in depth, Vol. 1: Basic mechanisms. University of Toronto
Press, 2002.
[141] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin
Huang. DeepMVS: Learning multi-view stereopsis. In Proc. Computer Vision
and Pattern Recognition (CVPR), 2018.
[142] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive
instance normalization. In Proc. Int. Conf. on Computer Vision (ICCV), pages
1501–1510, 2017.
[143] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsuper-
vised image-to-image translation. In Proc. European Conf. on Computer Vision
(ECCV), pages 172–189, 2018.
[144] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. Fighting fake
news: Image splice detection via learned self-consistency. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 101–117, 2018.
[145] Junhwa Hur and Stefan Roth. Self-supervised monocular scene flow estimation.
In Proc. Computer Vision and Pattern Recognition (CVPR), pages 7396–7405,
2020.
[146] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy,
and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep
networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
2462–2470, 2017.
[147] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adver-
sarial attacks with limited queries and information. In International Conference
on Machine Learning, pages 2137–2146. PMLR, 2018.
[148] Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-
box adversarial attacks with bandits and priors. arXiv preprint arXiv:1807.07978,
2018.
217
[149] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon
Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features.
arXiv preprint arXiv:1905.02175, 2019.
[150] Matthias Innmann, Michael Zollhöfer, Matthias Niessner, Christian Theobalt, and
Marc Stamminger. VolumeDeform: Real-time volumetric non-rigid reconstruction.
In Proc. European Conf. on Computer Vision (ECCV), 2016.
[151] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In Proc. Int. Conf. on
Machine Learning, pages 448–456, 2015.
[152] Michal Irani and Prabu Anandan. Parallax geometry of pairs of points for 3d scene
analysis. In Proc. European Conf. on Computer Vision (ECCV), pages 17–30,
1996.
[153] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 6967–5976, 2017.
[154] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 1125–1134, 2017.
[155] Nathan Jacobs, Nathaniel Roman, and Robert Pless. Consistent temporal variations
in many outdoor scenes. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 1–6, 2007.
[156] Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.
[157] Michael Janner, Jiajun Wu, Tejas Kulkarni, Ilker Yildirim, and Joshua B Tenen-
baum. Self-Supervised Intrinsic Image Decomposition. In Neural Information
Processing Systems, 2017.
[158] Junho Jeon, Sunghyun Cho, Xin Tong, and Seungyong Lee. Intrinsic image
decomposition using structure-texture separation and surface normals. In Proc.
European Conf. on Computer Vision (ECCV), 2014.
[159] Hanqing Jiang, Haomin Liu, Ping Tan, Guofeng Zhang, and Hujun Bao. 3d
reconstruction of dynamic scenes with multiple handheld cameras. In Proc.
European Conf. on Computer Vision (ECCV), 2012.
218
[160] Huaizu Jiang, Deqing Sun, Varun Jampani, Zhaoyang Lv, Erik Learned-Miller,
and Jan Kautz. Sense: A shared encoder network for scene-flow estimation. In
Proc. Computer Vision and Pattern Recognition (CVPR), pages 3195–3204, 2019.
[161] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik G. Learned-
Miller, and Jan Kautz. Super slomo: High quality estimation of multiple inter-
mediate frames for video interpolation. In Proc. Computer Vision and Pattern
Recognition (CVPR), pages 9000–9008, 2018.
[162] Nicola Jones. How to stop data centres from gobbling up the world’s electricity.
Nature, 561(7722):163–167, 2018.
[163] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate-gan: Generating
synthetic data with differential privacy guarantees. In International Conference on
Learning Representations, 2018.
[164] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-
based view synthesis for light field cameras. ACM Trans. Graphics, 35(6):1–10,
2016.
[165] F. Kamiran and T. Calders. Data preprocessing techniques for classification
without discrimination. Knowledge and Information Systems, 33:1–33, 2011.
[166] Toshihiro Kamishima, S. Akaho, Hideki Asoh, and J. Sakuma. Fairness-aware
classifier with prejudice remover regularizer. In ECML/PKDD, 2012.
[167] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-
to-end recovery of human shape and pose. In Proc. Computer Vision and Pattern
Recognition (CVPR), 2018.
[168] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive
growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196, 2017.
[169] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for
generative adversarial networks. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 4401–4410, 2019.
[170] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth extraction from video using
non-parametric sampling. In Proc. European Conf. on Computer Vision (ECCV),
pages 775–788, 2012.
219
[171] Kaspersky. Deepfake and fake videos-how to protect yourself? https://usa.
kaspersky.com/resource-center/threats, 2013.
[172] Steven M Kay. Fundamentals of statistical signal processing. Prentice Hall PTR,
1993.
[173] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep
learning for computer vision? arXiv preprint arXiv:1703.04977, 2017.
[174] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncer-
tainty to weigh losses for scene geometry and semantics. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7482–7491,
2018.
[175] Iphigenia Keramitsoglou, Constantinos Cartalis, and Chris T Kiranoudis. Auto-
matic identification of oil spills on satellite images. Environmental modelling &
software, 21(5):640–652, 2006.
[176] Seungryong Kim, Kihong Park, Kwanghoon Sohn, and Stephen Lin. Unified
depth prediction and intrinsic image decomposition from a single image via joint
convolutional neural fields. In Proc. European Conf. on Computer Vision (ECCV),
pages 143–159, 2016.
[177] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR, abs/1412.6980, 2014.
[178] Svetlana Kiritchenko and Saif M. Mohammad. Examining gender and race bias in
two hundred sentiment analysis systems. In *SEM@NAACL-HLT, 2018.
[179] Felix Klose, Oliver Wang, Jean-Charles Bazin, Marcus Magnor, and Alexander
Sorkine-Hornung. Sampling based scene-space video processing. ACM Transac-
tions on Graphics (TOG), 34(4):67, 2015.
[180] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and
temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graphics,
36(4), 2017.
[181] Philip A Knight, Daniel Ruiz, and Bora Uçar. A symmetry preserving algorithm
for matrix scaling. SIAM Journal on Matrix Analysis and Applications, 35(3):931–
955, 2014.
[182] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas,
220
Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, et al. Spectre
attacks: Exploiting speculative execution. In 2019 IEEE Symposium on Security
and Privacy (SP), pages 1–19. IEEE, 2019.
[183] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence
functions. In International Conference on Machine Learning, pages 1885–1894.
PMLR, 2017.
[184] Kirsten Korosec. Deepfake’ revenge porn is now illegal in Virginia.
https://techcrunch.com/2019/07/01/deepfake-revenge-
porn-is-now-illegal-in-virginia/, 2019.
[185] Balazs Kovacs, Sean Bell, Noah Snavely, and Kavita Bala. Shading annotations
in the wild. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
850–859, 2017.
[186] Suryansh Kumar, Yuchao Dai, and Hongdong Li. Monocular dense 3d reconstruc-
tion of a complex dynamic scene from two perspective frames. Proc. Int. Conf. on
Computer Vision (ICCV), pages 4659–4667, 2017.
[187] Pierre-Yves Laffont and Jean-Charles Bazin. Intrinsic decomposition of image
sequences from local temporal variations. In Proc. Int. Conf. on Computer Vision
(ICCV), pages 433–441, 2015.
[188] Pierre-Yves Laffont, Adrien Bousseau, Sylvain Paris, Frédo Durand, and George
Drettakis. Coherent intrinsic images from photo collections. In ACM Trans.
Graphics (SIGGRAPH), 2012.
[189] Pierre-Yves Laffont, Adrien Bousseau, Sylvain Paris, Frédo Durand, and George
Drettakis. Coherent intrinsic images from photo collections. ACM Trans. Graphics,
31:202:1–202:11, 2012.
[190] Preethi Lahoti, Krishna P Gummadi, and Gerhard Weikum. ifair: Learning
individually fair data representations for algorithmic decision making. In 2019
IEEE 35th International Conference on Data Engineering (ICDE), pages 1334–
1345. IEEE, 2019.
[191] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and
Nassir Navab. Deeper depth prediction with fully convolutional residual networks.
In Int. Conf. on 3D Vision (3DV), pages 239–248, 2016.
221
[192] Edwin H Land and John J McCann. Lightness and retinex theory. Josa, 61(1):1–11,
1971.
[193] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black,
and Peter V. Gehler. Unite the people: Closing the loop between 3D and 2D human
representations. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017.
[194] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunning-
ham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan
Wang, et al. Photo-realistic single image super-resolution using a generative
adversarial network. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 4681–4690, 2017.
[195] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan
Yang. Diverse image-to-image translation via disentangled representations. In
Proc. European Conf. on Computer Vision (ECCV), pages 35–51, 2018.
[196] Jihyeon Janel Lee, Dylan Grosz, Sicheng Zheng, Burak Uzkent, Marshall Burke,
David Lobell, and Stefano Ermon. Predicting livelihood indicators from crowd-
sourced street level images. arXiv preprint arXiv:2006.08661, 2020.
[197] K Lee and Paul Triolo. China’s artificial intelligence revolution: Understanding
beijing’s structural advantages. China Embraces AI,” Eurasia Group, 2017.
[198] Anat Levin and Frédo Durand. Linear view synthesis using a dimensionality gap
light field prior. Proc. Computer Vision and Pattern Recognition (CVPR), pages
1831–1838, 2010.
[199] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedings of the 23rd
annual conference on Computer graphics and interactive techniques, pages 31–42,
1996.
[200] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, and Mingyi He. Depth
and surface normal estimation from monocular images using regression on deep
features and hierarchical CRFs. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 1119–1127, 2015.
[201] Yunpeng Li, Noah Snavely, Daniel P. Huttenlocher, and Pascal Fua. Worldwide
pose estimation using 3D point clouds. In Large-Scale Visual Geo-Localization,
pages 147–163. Springer, 2016.
[202] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, and Noah
222
Snavely. MannequinChallenge Dataset. https://google.github.io/
mannequinchallenge/, 2019.
[203] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu,
and William T Freeman. Learning the depths of moving people by watching
frozen people. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
4521–4530, 2019.
[204] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene
flow fields for space-time view synthesis of dynamic scenes. arXiv preprint
arXiv:2011.13084, 2020.
[205] Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from
watching the world. In Proc. Computer Vision and Pattern Recognition (CVPR),
2018.
[206] Zhengqi Li and Noah Snavely. Learning Intrinsic Image Decomposition from
Watching the World. In Proc. Computer Vision and Pattern Recognition (CVPR),
2018.
[207] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction
from internet photos. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 2041–2050, 2018.
[208] Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the
plenoptic function. In Proc. European Conf. on Computer Vision (ECCV), 2020.
[209] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation
with latent diversity. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 577–585, 2018.
[210] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects
in context. In European conference on computer vision, pages 740–755. Springer,
2014.
[211] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas,
Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, et al.
Meltdown: Reading kernel memory from user space. In 27th {USENIX} Security
Symposium ({USENIX} Security 18), pages 973–990, 2018.
223
[212] Zachary Chase Lipton. The mythos of model interpretability. Queue, 16:31 – 57,
2018.
[213] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields
for depth estimation from a single image. In Proc. Computer Vision and Pattern
Recognition (CVPR), 2015.
[214] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from
single monocular images using deep convolutional neural fields. Trans. on Pattern
Analysis and Machine Intelligence, 38:2024–2039, 2016.
[215] Jian Liu, Mika Juuti, Yao Lu, and Nadarajah Asokan. Oblivious neural network
predictions via minionn transformations. In Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security, pages 619–631, 2017.
[216] Miaomiao Liu, Mathieu Salzmann, and Xuming He. Discrete-continuous depth
estimation from a single image. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 716–723, 2014.
[217] Ximeng Liu, Lehui Xie, Yaopeng Wang, Jian Zou, Jinbo Xiong, Zuobin Ying, and
Athanasios V Vasilakos. Privacy and security issues in deep learning: A survey.
IEEE Access, 2020.
[218] Yingqi Liu, Shiqing Ma, Yousra Aafer, W. Lee, Juan Zhai, Weihang Wang, and
X. Zhang. Trojaning attack on neural networks. In NDSS, 2018.
[219] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas
Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable
volumes from images. ACM Trans. Graphics, 38(4):65, 2019.
[220] Yunhui Long, Vincent Bindschaedler, and Carl A Gunter. Towards measuring
membership privacy. arXiv preprint arXiv:1712.09136, 2017.
[221] Erika Lu, F. Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, D. Salesin, W. Free-
man, and M. Rubinstein. Layered neural rendering for retiming people in video.
ACM Trans. Graphics (SIGGRAPH Asia), 2020.
[222] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf.
Consistent video depth estimation. ACM Trans. Graph., 39(4), July 2020.
[223] Zhaoyang Lv, Kihwan Kim, Alejandro Troccoli, Deqing Sun, James M Rehg, and
Jan Kautz. Learning rigidity in dynamic scenes with a moving camera for 3d
224
motion field estimation. In Proc. European Conf. on Computer Vision (ECCV),
pages 468–484, 2018.
[224] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv
preprint arXiv:1706.06083, 2017.
[225] Aravindh Mahendran and A. Vedaldi. Understanding deep image representations
by inverting them. 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 5188–5196, 2015.
[226] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised Learning of
Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints.
In Proc. Computer Vision and Pattern Recognition (CVPR), 2018.
[227] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Paul Smolley. Least squares generative adversarial networks. In Proc. Int. Conf.
on Computer Vision (ICCV), pages 2794–2802, 2017.
[228] Ricardo Martin-Brualla, David Gallup, and Steven M Seitz. 3d time-lapse recon-
struction from internet photos. In Proc. Int. Conf. on Computer Vision (ICCV),
pages 1332–1340, 2015.
[229] Ricardo Martin-Brualla, David Gallup, and Steven M Seitz. Time-lapse mining
from internet photos. ACM Trans. Graphics, 34(4):1–8, 2015.
[230] Ricardo Martin-Brualla, N. Radwan, Mehdi S. M. Sajjadi, J. Barron, A. Doso-
vitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for
unconstrained photo collections. ArXiv, abs/2008.02268, 2020.
[231] Yasuyuki Matsushita, Stephen Lin, Sing Bing Kang, and Heung-Yeung Shum.
Estimating intrinsic images from image sequences with biased illumination. In
Proc. European Conf. on Computer Vision (ECCV), pages 274–286, 2004.
[232] Kevin Matzen and Noah Snavely. Scene chronology. In Proc. European Conf. on
Computer Vision (ECCV), pages 615–630. Springer, 2014.
[233] Oier Mees, Andreas Eitel, and Wolfram Burgard. Choosing Smartly: Adaptive
Multimodal Fusion for Object Detection in Changing Environments. In Int. Conf.
on Intelligent Robots and Systems (IROS), 2016.
225
[234] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Moham-
mad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt.
VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM
Trans. Graphics, 36:44:1–44:14, 2017.
[235] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Unsupervised learning of
optical flow with a bidirectional census loss. arXiv preprint arXiv:1711.07837,
2017.
[236] Dongyu Meng and Hao Chen. Magnet: a two-pronged defense against adversarial
examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and
communications security, pages 135–147, 2017.
[237] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles.
In Proc. Computer Vision and Pattern Recognition (CVPR), 2015.
[238] Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey,
Noah Snavely, and Ricardo Martin-Brualla. Neural rerendering in the wild. Proc.
Computer Vision and Pattern Recognition (CVPR), pages 6871–6880, 2019.
[239] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On
detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
[240] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalan-
tari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Prac-
tical view synthesis with prescriptive sampling guidelines. ACM Trans. Graphics,
38(4):1–14, 2019.
[241] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi
Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields
for view synthesis. Proc. European Conf. on Computer Vision (ECCV), 2020.
[242] Himangi Mittal, Brian Okorn, and David Held. Just go with the flow: Self-
supervised scene flow estimation. In Proc. Computer Vision and Pattern Recogni-
tion (CVPR), pages 11177–11185, 2020.
[243] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adver-
sarial training: a regularization method for supervised and semi-supervised learn-
ing. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–
1993, 2018.
[244] Abdolreza Mohamadi, Zahedeh Heidarizadi, Hadi Nourollahi, et al. Assessing
226
the desertification trend using neural network classification and object-oriented
techniques. J. Fac. Istanb. Univ, 66:683–690, 2016.
[245] Sharada P Mohanty, David P Hughes, and Marcel Salathé. Using deep learning
for image-based plant disease detection. Frontiers in plant science, 7:1419, 2016.
[246] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-
preserving machine learning. In 2017 IEEE Symposium on Security and Privacy
(SP), pages 19–38. IEEE, 2017.
[247] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal
Frossard. Universal adversarial perturbations. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 1765–1773, 2017.
[248] Paul Mozur. Inside china’s dystopian dreams: Ai, shame and lots of cameras. The
New York Times, 8:2018, 2018.
[249] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin
Wongrassamee, Emil C Lupu, and Fabio Roli. Towards poisoning of deep learning
algorithms with back-gradient optimization. In Proceedings of the 10th ACM
Workshop on Artificial Intelligence and Security, pages 27–38, 2017.
[250] Luis Muñoz-González, Bjarne Pfitzner, Matteo Russo, Javier Carnerero-Cano, and
Emil C Lupu. Poisoning attacks with generative adversarial nets. arXiv preprint
arXiv:1906.07773, 2019.
[251] Raul Mur-Artal and Juan D Tardós. Orb-Slam2: An open-source slam system
for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics,
33(5):1255–1262, 2017.
[252] Takuya Narihira, Michael Maire, and Stella X Yu. Direct intrinsics: Learning
albedo-shading decomposition by convolutional regression. In Proc. Int. Conf. on
Computer Vision (ICCV), pages 2992–2992, 2015.
[253] Takuya Narihira, Michael Maire, and Stella X Yu. Learning lightness from
human judgement on relative reflectance. In Proc. Computer Vision and Pattern
Recognition (CVPR), pages 2965–2973, 2015.
[254] Thomas Nestmeyer and Peter V Gehler. Reflectance adaptive filtering improves
intrinsic image estimation. In Proc. Computer Vision and Pattern Recognition
(CVPR), 2017.
227
[255] Richard A Newcombe, Dieter Fox, and Steven M Seitz. DynamicFusion: Recon-
struction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision
and Pattern Recognition (CVPR), 2015.
[256] Bingbing Ni, Gang Wang, and Pierre Moulin. RGBD-HuDaAct: A color-depth
video database for human daily activity recognition. In Proc. ICCV Workshops,
2011.
[257] Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpola-
tion. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1701–1710,
2018.
[258] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In
Proc. Computer Vision and Pattern Recognition (CVPR), pages 5436–5445, 2020.
[259] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive
convolution. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
2270–2279, 2017.
[260] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive
separable convolution. In Proc. Int. Conf. on Computer Vision (ICCV), pages
261–270, 2017.
[261] Simon Niklaus, Long Mai, and Oliver Wang. Revisiting adaptive convolutions for
video frame interpolation. arXiv preprint arXiv:2011.01280, 2020.
[262] Simon Niklaus, Long Mai, Jimei Yang, and F. Liu. 3d ken burns effect from a
single image. ACM Trans. Graphics, 38:1 – 15, 2019.
[263] David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3d object categories
by looking around them. Proc. Int. Conf. on Computer Vision (ICCV), pages
5218–5227, 2017.
[264] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.
Distill, 2017. https://distill.pub/2017/feature-visualization.
[265] Tianyu Pang, Chao Du, Yinpeng Dong, and Jun Zhu. Towards robust detection of
adversarial examples. arXiv preprint arXiv:1706.00633, 2017.
[266] Nicolas Papernot, Martı́n Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal
Talwar. Semi-supervised knowledge transfer for deep learning from private training
data. arXiv preprint arXiv:1610.05755, 2016.
228
[267] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in
machine learning: from phenomena to black-box attacks using adversarial samples.
arXiv preprint arXiv:1605.07277, 2016.
[268] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay
Celik, and Ananthram Swami. The limitations of deep learning in adversarial
settings. In 2016 IEEE European symposium on security and privacy (EuroS&P),
pages 372–387. IEEE, 2016.
[269] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami.
Distillation as a defense to adversarial perturbations against deep neural networks.
In 2016 IEEE symposium on security and privacy (SP), pages 582–597. IEEE,
2016.
[270] Hyun Soo Park, Takaaki Shiratori, Iain Matthews, and Yaser Sheikh. 3d recon-
struction of a moving point from a series of 2d projections. In Proc. European
Conf. on Computer Vision (ECCV), pages 158–171. Springer, 2010.
[271] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image
synthesis with spatially-adaptive normalization. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 2337–2346, 2019.
[272] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-
gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
Pytorch: An imperative style, high-performance deep learning library. arXiv
preprint arXiv:1912.01703, 2019.
[273] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Dani-
ilidis. Coarse-to-fine volumetric prediction for single-image 3D human pose. Proc.
Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, 2017.
[274] Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis. ACM Trans.
Graphics, 36(6):1–11, 2017.
[275] Georgios Petropoulos. The impact of artificial intelligence on employment. Praise
for Work in the Digital Age, 119, 2018.
[276] NhatHai Phan, Yue Wang, Xintao Wu, and Dejing Dou. Differential privacy
preservation for deep auto-encoders: an application of human behavior prediction.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[277] Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, and George Drettakis.
229
Multi-view relighting using a geometry-aware network. ACM Trans. Graphics,
38(4):1–14, 2019.
[278] Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha.
Revealing scenes by inverting structure from motion reconstructions. In Proc.
Computer Vision and Pattern Recognition (CVPR), pages 145–154, 2019.
[279] Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvi-
jotham, Alhussein Fawzi, Soham De, Robert Stanforth, and Pushmeet Kohli. Ad-
versarial robustness through local linearization. arXiv preprint arXiv:1907.02610,
2019.
[280] Ravi Ramamoorthi and Pat Hanrahan. A signal-processing framework for inverse
rendering. In Proceedings of the 28th annual conference on Computer graphics
and interactive techniques, pages 117–128, 2001.
[281] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen
Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-
shot cross-dataset transfer. Trans. on Pattern Analysis and Machine Intelligence,
2020.
[282] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun. Dense monocular
depth estimation in complex dynamic scenes. In Proc. Computer Vision and
Pattern Recognition (CVPR), 2016.
[283] Erik Reinhard, Michael Stark, Peter Shirley, and James Ferwerda. Photographic
tone reproduction for digital images. In ACM Trans. Graphics (SIGGRAPH),
2002.
[284] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve
Seitz. Soccer on your tabletop. In Proc. Computer Vision and Pattern Recognition
(CVPR), June 2018.
[285] Christian Richardt, Hyeongwoo Kim, Levi Valgaerts, and Christian Theobalt.
Dense wide-baseline scene flow from two handheld video cameras. In 2016 Fourth
International Conference on 3D Vision (3DV), pages 276–285. IEEE, 2016.
[286] Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks.
In Proc. Int. Conf. on Computer Vision (ICCV), pages 2232–2241, 2017.
[287] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for
230
data: Ground truth from computer games. In Proc. European Conf. on Computer
Vision (ECCV), pages 102–118, 2016.
[288] Mark O Riedl. Human-centered artificial intelligence and machine learning.
Human Behavior and Emerging Technologies, 1(1):33–36, 2019.
[289] Gernot Riegler and V. Koltun. Free view synthesis. Proc. European Conf. on
Computer Vision (ECCV), 2020.
[290] Greg Robinson. Are we prepared for the rise of automation?
https://info.aiim.org/aiim-blog/are-we-prepared-for-
the-rise-of-automation, 2018.
[291] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional
networks for biomedical image segmentation. In Int. Conf. on Medical Image
Computing and Computer-Assisted Intervention, pages 234–241, 2015.
[292] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M.
Lopez. The SYNTHIA Dataset: A large collection of synthetic images for semantic
segmentation of urban scenes. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 3234–3243, 2016.
[293] A. Ross, M. Hughes, and Finale Doshi-Velez. Right for the right reasons: Training
differentiable models by constraining their explanations. In IJCAI, 2017.
[294] Andrew Ross and Finale Doshi-Velez. Improving the adversarial robustness and
interpretability of deep neural networks by regularizing their input gradients. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[295] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies,
and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial
images. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 1–11, 2019.
[296] Carsten Rother, Martin Kiefel, Lumin Zhang, Bernhard Schölkopf, and Peter V
Gehler. Recovering intrinsic images with a global sparsity prior on reflectance. In
Neural Information Processing Systems, pages 765–773, 2011.
[297] Bita Darvish Rouhani, M Sadegh Riazi, and Farinaz Koushanfar. Deepsecure:
Scalable provably-secure deep learning. In Proceedings of the 55th Annual Design
Automation Conference, pages 1–6, 2018.
231
[298] Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural
regression forest. In Proc. Computer Vision and Pattern Recognition (CVPR),
2016.
[299] Chris Russell, Rui Yu, and Lourdes Agapito. Video pop-up: Monocular 3d
reconstruction of dynamic scenes. In Proc. European Conf. on Computer Vision
(ECCV), pages 583–598. Springer, 2014.
[300] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Scribbler:
Controlling deep image synthesis with sketch and color. In Proc. Computer Vision
and Pattern Recognition (CVPR), pages 5400–5409, 2017.
[301] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single
monocular images. In Neural Information Processing Systems, volume 18, pages
1–8, 2005.
[302] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3D: Learning 3D scene
structure from a single still image. Trans. on Pattern Analysis and Machine
Intelligence, 31(5), 2009.
[303] Carolin Schmitt, Simon Donné, Gernot Riegler, Vladlen Koltun, and Andreas
Geiger. On joint estimation of pose, geometry and svbrdf from a handheld scanner.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 3493–3503, 2020.
[304] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited.
In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4104–4113,
2016.
[305] Johannes L. Schönberger, Filip Radenovic, Ondrej Chum, and Jan-Michael Frahm.
From single image query to detailed 3D reconstruction. In Proc. Computer Vision
and Pattern Recognition (CVPR), 2015.
[306] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys.
Pixelwise view selection for unstructured multi-view stereo. In Proc. European
Conf. on Computer Vision (ECCV), pages 501–518, 2016.
[307] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan-
tam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep
networks via gradient-based localization. In Proceedings of the IEEE international
conference on computer vision, pages 618–626, 2017.
232
[308] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer,
Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning
attacks on neural networks. arXiv preprint arXiv:1804.00792, 2018.
[309] Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa, and Steven M. Seitz.
The visual turing test for scene reconstruction. Int. Conf. on 3D Vision (3DV),
pages 25–32, 2013.
[310] Li Shen, Ping Tan, and Stephen Lin. Intrinsic image decomposition with non-local
texture cues. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
1–7, 2008.
[311] Li Shen and Chuohao Yeo. Intrinsic images decomposition using a local and
global sparse representation of reflectance. In Proc. Computer Vision and Pattern
Recognition (CVPR), pages 697–704, 2011.
[312] Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang Wang. Avatar-net: Multi-scale
zero-shot style transfer by feature decoration. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 1–9, 2018.
[313] Jian Shi, Yue Dong, Hao Su, and Stella X Yu. Learning non-Lambertian object
intrinsics across ShapeNet categories. In Proc. Computer Vision and Pattern
Recognition (CVPR), pages 5844–5853, 2017.
[314] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Frédo Durand. Light
field reconstruction using sparsity in the continuous fourier domain. ACM Trans.
Graphics, 34:12:1–12:13, 2014.
[315] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and Fredo Durand. Light
field reconstruction using sparsity in the continuous fourier domain. ACM Trans.
Graphics, 34(1), December 2015.
[316] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3d photography
using context-aware layered depth inpainting. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 8028–8038, 2020.
[317] Yichang Shih, Sylvain Paris, Frédo Durand, and William T Freeman. Data-
driven hallucination of different times of day from a single outdoor photo. ACM
Transactions on Graphics (TOG), 32(6):200, 2013.
[318] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Member-
233
ship inference attacks against machine learning models. In 2017 IEEE Symposium
on Security and Privacy (SP), pages 3–18. IEEE, 2017.
[319] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang,
and Russell Webb. Learning from Simulated and Unsupervised Images through
Adversarial Training. In Proc. Computer Vision and Pattern Recognition (CVPR),
2017.
[320] Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and
Dimitris Samaras. Neural face editing with intrinsic image disentangling. In Proc.
Computer Vision and Pattern Recognition (CVPR), pages 5444–5453, 2017.
[321] Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured
light sensor. In ICCV Workshops, 2011.
[322] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor
segmentation and support inference from RGBD images. In Proc. European Conf.
on Computer Vision (ECCV), pages 746–760, 2012.
[323] Ian Simon, Noah Snavely, and Steven M Seitz. Scene summarization for online
image collections. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1–8.
IEEE, 2007.
[324] Tomas Simon, Jack Valmadre, Iain A. Matthews, and Yaser Sheikh. Kronecker-
Markov Prior for Dynamic 3D Reconstruction. Trans. on Pattern Analysis and
Machine Intelligence, 39:2201–2214, 2017.
[325] K. Simonyan, A. Vedaldi, and Andrew Zisserman. Deep inside convolutional
networks: Visualising image classification models and saliency maps. CoRR,
abs/1312.6034, 2014.
[326] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein,
and Michael Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings.
In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2437–2446,
2019.
[327] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation
networks: Continuous 3d-structure-aware neural scene representations. In Neural
Information Processing Systems, pages 1119–1130, 2019.
[328] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring
photo collections in 3D. In ACM Trans. Graphics (SIGGRAPH), 2006.
234
[329] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring
photo collections in 3D. In ACM Trans. Graphics (SIGGRAPH), 2006.
[330] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas
Funkhouser. Semantic scene completion from a single depth image. Proc. Com-
puter Vision and Pattern Recognition (CVPR), 2017.
[331] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas
Funkhouser. Semantic scene completion from a single depth image. In Proc.
Computer Vision and Pattern Recognition (CVPR), pages 190–198, 2017.
[332] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman.
Pixeldefend: Leveraging generative models to understand and defend against
adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
[333] Pablo Speciale, Johannes L Schonberger, Sudipta N Sinha, and Marc Pollefeys.
Privacy preserving image queries for camera localization. In Proc. Int. Conf. on
Computer Vision (ICCV), pages 1486–1496, 2019.
[334] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Ried-
miller. Striving for simplicity: The all convolutional net. arXiv preprint
arXiv:1412.6806, 2014.
[335] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi,
Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with
multiplane images. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 175–184, 2019.
[336] Pratul P Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi Ramamoorthi, and
Ren Ng. Learning to synthesize a 4d rgbd light field from a single image. In Proc.
Int. Conf. on Computer Vision (ICCV), pages 2243–2251, 2017.
[337] Mirjana Stankovic, Ravi Gupta, RB Andre, G Myers, and Marco Nicoli. Exploring
legal, ethical and policy implications of artificial intelligence. White paper of the
global forum on law justice and development., 2017.
[338] Timo Stich, Christian Linz, Georgia Albuquerque, and Marcus Magnor. View
and time interpolation in image space. In Computer Graphics Forum, volume 27,
pages 1781–1787. Wiley Online Library, 2008.
[339] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel
Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ
235
International Conference on Intelligent Robots and Systems (IROS), pages 573–
580, 2012.
[340] Deqing Sun, Erik B Sudderth, and Hanspeter Pfister. Layered rgbd scene flow
estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
548–556, 2015.
[341] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep
networks. In International Conference on Machine Learning, pages 3319–3328.
PMLR, 2017.
[342] Kalyan Sunkavalli, Wojciech Matusik, Hanspeter Pfister, and Szymon
Rusinkiewicz. Factored time-lapse video. In ACM Transactions on Graphics
(TOG), volume 26, page 101. ACM, 2007.
[343] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Syn-
thesizing obama: learning lip sync from audio. ACM Transactions on Graphics
(ToG), 36(4):1–13, 2017.
[344] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 1–9, 2015.
[345] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv
preprint arXiv:1312.6199, 2013.
[346] Richard Szeliski and Polina Golland. Stereo matching with transparency and
matting. Int. J. of Computer Vision, 1998.
[347] Dean Takahashi. How Pixar made Monsters University, its latest technological
marvel. https://venturebeat.com/2013/04/24/the-making-
of-pixars-latest-technological-marvel-monsters-
university/, 2013.
[348] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering:
Image synthesis using neural textures. ACM Trans. Graphics, 38(4):1–12, 2019.
[349] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and
Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb
236
videos. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 2387–
2395, 2016.
[350] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-
Hsuan Yang. Deep image harmonization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3789–3797, 2017.
[351] Richard Tucker and Noah Snavely. Single-view view synthesis with multiplane
images. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2020.
[352] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization:
The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022,
2016.
[353] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved texture net-
works: Maximizing quality and diversity in feed-forward stylization and texture
synthesis. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
6924–6932, 2017.
[354] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg,
Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for
learning monocular stereo. In Proc. Computer Vision and Pattern Recognition
(CVPR), 2017.
[355] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg,
Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for
learning monocular stereo. In Proc. Computer Vision and Pattern Recognition
(CVPR), pages 5622–5631, 2017.
[356] Suren Vagharshakyan, Robert Bregovic, and Atanas P. Gotchev. Light field
reconstruction using shearlet transform. Trans. on Pattern Analysis and Machine
Intelligence, 40:133–147, 2015.
[357] Jack Valmadre and S. Lucey. General trajectory prior for non-rigid reconstruction.
In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1394–1401,
2012.
[358] Sahil Verma and Julia Rubin. Fairness definitions explained. In 2018 ieee/acm
international workshop on software fairness (fairware), pages 1–7. IEEE, 2018.
[359] Ricardo Vinuesa, L Fdez. de Arévalo, M Luna, and H Cachafeiro. Simulations
and experiments of heat loss from a parabolic trough absorber tube over a range of
237
pressures and gas compositions in the vacuum chamber. Journal of Renewable
and Sustainable Energy, 8(2):023701, 2016.
[360] Minh Vo, S. Narasimhan, and Yaser Sheikh. Spatiotemporal bundle adjustment
for dynamic 3d reconstruction. Proc. Computer Vision and Pattern Recognition
(CVPR), pages 1710–1718, 2016.
[361] Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation
(gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing,
10:3152676, 2017.
[362] Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning
depth from monocular videos using direct methods. In Proc. Computer Vision and
Pattern Recognition (CVPR), 2018.
[363] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard Zhang, and Alexei A
Efros. Detecting photoshopped faces by scripting photoshop. In Proceedings of
the IEEE/CVF International Conference on Computer Vision, pages 10072–10081,
2019.
[364] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A
Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 8695–8704, 2020.
[365] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez.
Balanced datasets are not enough: Estimating and mitigating gender bias in deep
image representations. 2019 IEEE/CVF International Conference on Computer
Vision (ICCV), pages 5309–5318, 2019.
[366] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez.
Balanced datasets are not enough: Estimating and mitigating gender bias in deep
image representations. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 5310–5319, 2019.
[367] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan
Catanzaro. High-resolution image synthesis and semantic manipulation with
conditional gans. In Proc. Computer Vision and Pattern Recognition (CVPR),
pages 8798–8807, 2018.
[368] Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and
structure adversarial networks. In Proc. European Conf. on Computer Vision
(ECCV), pages 318–335. Springer, 2016.
238
[369] Andreas Wedel, Thomas Brox, Tobi Vaudrey, Clemens Rabe, Uwe Franke, and
Daniel Cremers. Stereoscopic scene flow computation for 3d motion understanding.
Int. J. of Computer Vision, 95(1):29–51, 2011.
[370] Yair Weiss. Deriving intrinsic images from image sequences. In Proc. Int. Conf.
on Computer Vision (ICCV), volume 2, pages 68–75, 2001.
[371] Alma Whitten and J Doug Tygar. Why johnny can’t encrypt: A usability evaluation
of pgp 5.0. In USENIX Security Symposium, volume 348, pages 169–184, 1999.
[372] Wikipedia. Death of Elaine Herzberg. https://en.wikipedia.org/
wiki/Death_of_Elaine_Herzberg, 2018.
[373] Wikipedia. Deepfake. https://en.wikipedia.org/wiki/Deepfake,
2018.
[374] Wikipedia. Mannequin Challenge. https://en.wikipedia.org/wiki/
Mannequin_Challenge, 2018.
[375] Olivia Wiles, Georgia Gkioxari, R. Szeliski, and J. Johnson. Synsin: End-to-
end view synthesis from a single image. Proc. Computer Vision and Pattern
Recognition (CVPR), pages 7465–7475, 2020.
[376] Bernd W Wirtz, Jan C Weyerer, and Carolin Geyer. Artificial intelligence and
the public sector—applications and challenges. International Journal of Public
Administration, 42(7):596–615, 2019.
[377] Changchang Wu. Towards linear-time incremental structure from motion. In Int.
Conf. on 3D Vision (3DV), 2013.
[378] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. Mantra-net: Manip-
ulation tracing network for detection and localization of image forgeries with
anomalous features. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9543–9552, 2019.
[379] Jonas Wulff, Laura Sevilla-Lara, and Michael J Black. Optical flow in mostly
rigid scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
4671–4680, 2017.
[380] Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo
Luo. Monocular relative depth perception with web stereo data supervision. In
Proc. Computer Vision and Pattern Recognition (CVPR), 2018.
239
[381] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang,
Fisher Yu, and James Hays. TextureGAN: Controlling deep image synthesis with
texture patches. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
8456–8465, 2018.
[382] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3D: A database of big
spaces reconstructed using sfm and object labels. In Proc. Int. Conf. on Computer
Vision (ICCV), pages 1625–1632, 2013.
[383] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating
adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
[384] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L Yuille, and Kaiming He.
Feature denoising for improving adversarial robustness. In Proc. Computer Vision
and Pattern Recognition (CVPR), pages 501–509, 2019.
[385] Junyuan Xie, Ross B. Girshick, and Ali Farhadi. Deep3D: Fully automatic 2D-to-
3D video conversion with deep convolutional neural networks. In Proc. European
Conf. on Computer Vision (ECCV), 2016.
[386] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Multi-scale
continuous CRFs as sequential deep networks for monocular depth estimation.
Proc. Computer Vision and Pattern Recognition (CVPR), 2017.
[387] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Monocular
depth estimation using multi-scale continuous crfs as sequential deep networks.
Trans. on Pattern Analysis and Machine Intelligence, 2018.
[388] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial
examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
[389] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge Rhodin, Dushyant
Mehta, Hans-Peter Seidel, and Christian Theobalt. MonoPerfCap: Human per-
formance capture from monocular video. ACM Transactions on Graphics (ToG),
37(2):27, 2018.
[390] Yi Xu, Jared Heinly, Andrew M White, Fabian Monrose, and Jan-Michael Frahm.
Seeing double: Reconstructing obscured typed input from repeated compromising
reflections. In Proceedings of the 2013 ACM SIGSAC conference on Computer &
communications security, pages 1063–1074, 2013.
[391] Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Su, and Ravi Ra-
240
mamoorthi. Deep view synthesis from sparse photometric images. ACM Trans.
Graphics, 38(4), 2019.
[392] Yexiang Xue, Stefano Ermon, Carla P Gomes, and Bart Selman. Uncovering
hidden structure through parallel problem decomposition for the set basis prob-
lem: Application to materials discovery. In Twenty-Fourth International Joint
Conference on Artificial Intelligence, 2015.
[393] Chaofei Yang, Qing Wu, Hai Li, and Yiran Chen. Generative poisoning attack
method against neural networks. arXiv preprint arXiv:1703.01340, 2017.
[394] Yingzhi Yang, B Gog, and E Gibbs. China seeks to root out fake
news and deepfakes with new online content rules. online], https://www.
reuters. com/article/us-china-technology/china-seeks-to-root-out-fake-news-and-
deepfakes-with-newonline-content-rules-idUSKBN1Y30VU, 2019.
[395] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth
Inference for Unstructured Multi-view Stereo. Proc. European Conf. on Computer
Vision (ECCV), pages 767–783, 2018.
[396] Mao Ye and Ruigang Yang. Real-time simultaneous pose and shape estimation
for articulated objects using a single depth camera. In Proc. Computer Vision and
Pattern Recognition (CVPR), 2014.
[397] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk
in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st
Computer Security Foundations Symposium (CSF), pages 268–282. IEEE, 2018.
[398] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised Learning of Dense Depth,
Optical Flow and Camera Pose. In Proc. Computer Vision and Pattern Recognition
(CVPR), 2018.
[399] Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel
view synthesis of dynamic scenes with globally coherent depths from a monocular
camera. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5336–
5345, 2020.
[400] Ye Yu and William AP Smith. Inverserendernet: Learning single image inverse
rendering. In Proc. Computer Vision and Pattern Recognition (CVPR), pages
3155–3164, 2019.
241
[401] Matthew D. Zeiler and R. Fergus. Visualizing and understanding convolutional
networks. In ECCV, 2014.
[402] Chenxi Zhang, Jizhou Gao, Oliver Wang, Pierre Fite Georgel, Ruigang Yang,
James Davis, Jan-Michael Frahm, and Marc Pollefeys. Personal photograph
enhancement using internet photo collections. IEEE Trans. on Visualization and
Computer Graphics, 2014.
[403] Maggie Zhang. Google photos tags two african-americans as gorillas through
facial recognition software. Forbes, July, 2015.
[404] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In
European conference on computer vision, pages 649–666. Springer, 2016.
[405] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual metric. In Proc.
Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018.
[406] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin
Jin, and Thomas Funkhouser. Physically-based rendering for indoor scene un-
derstanding using convolutional neural networks. In Proc. Computer Vision and
Pattern Recognition (CVPR), pages 5057–5065, 2017.
[407] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia.
Pyramid scene parsing network. Proc. Computer Vision and Pattern Recognition
(CVPR), 2017.
[408] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.
Men also like shopping: Reducing gender bias amplification using corpus-level
constraints. In EMNLP, 2017.
[409] Qi Zhao, Ping Tan, Qiang Dai, Li Shen, Enhua Wu, and Stephen Lin. A closed-
form solution to retinex with nonlocal texture constraints. Trans. on Pattern
Analysis and Machine Intelligence, 34(7):1437–1444, 2012.
[410] Yinan Zhao, Brian Price, Scott Cohen, and Danna Gurari. Guided image inpainting:
Replacing an image region by pulling content from another image. In Proc. Winter
Conf. on Computer Vision (WACV), pages 1514–1523. IEEE, 2019.
[411] Enliang Zheng, Dinghuang Ji, Enrique Dunn, and Jan-Michael Frahm. Sparse
Dynamic 3D Reconstruction from Unsynchronized Videos. Proc. Int. Conf. on
Computer Vision (ICCV), pages 4435–4443, 2015.
242
[412] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio
Torralba. Scene parsing through ade20k dataset. In Proc. Computer Vision and
Pattern Recognition (CVPR), 2017.
[413] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. DeepTAM: Deep
Tracking and Mapping. In Proc. European Conf. on Computer Vision (ECCV),
2018.
[414] Peng Zhou, Ning Yu, Zuxuan Wu, Larry S Davis, Abhinav Shrivastava, and Ser-
Nam Lim. Deep video inpainting detection. arXiv preprint arXiv:2101.11080,
2021.
[415] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised
learning of depth and ego-motion from video. Proc. Computer Vision and Pattern
Recognition (CVPR), 2017.
[416] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised
learning of depth and ego-motion from video. In Proc. Computer Vision and
Pattern Recognition (CVPR), 2017.
[417] Tinghui Zhou, Philipp Krahenbuhl, and Alexei A Efros. Learning data-driven re-
flectance priors for intrinsic image decomposition. In Proc. Int. Conf. on Computer
Vision (ICCV), pages 3469–3477, 2015.
[418] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely.
Stereo magnification: learning view synthesis using multiplane images. ACM
Trans. Graphics, 37:1 – 12, 2018.
[419] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros.
View synthesis by appearance flow. In Proc. European Conf. on Computer Vision
(ECCV), pages 286–301. Springer, 2016.
[420] Chen Zhu, W Ronny Huang, Hengduo Li, Gavin Taylor, Christoph Studer, and
Tom Goldstein. Transferable clean-label poisoning attacks on deep neural nets. In
International Conference on Machine Learning, pages 7614–7623. PMLR, 2019.
[421] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A Efros. Learning
a discriminative model for the perception of realism in composite images. In
Proceedings of the IEEE International Conference on Computer Vision, pages
3943–3951, 2015.
[422] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
243
image translation using cycle-consistent adversarial networks. In Proc. Int. Conf.
on Computer Vision (ICCV), pages 2223–2232, 2017.
[423] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros,
Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation.
In Neural Information Processing Systems, pages 465–476, 2017.
[424] Yu Zhu, Wenbin Chen, and Guodong Guo. Evaluating spatiotemporal interest
point features for depth-based action recognition. Image and Vision Computing,
32(8):453–464, 2014.
[425] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder,
and Richard Szeliski. High-quality video view interpolation using a layered
representation. ACM Trans. Graphics, 23:600–608, 2004.
[426] C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon A. J. Winder,
and Richard Szeliski. High-quality video view interpolation using a layered
representation. In SIGGRAPH 2004, 2004.
[427] Michael Zollhöfer, Matthias Niessner, Shahram Izadi, Christoph Rehmann,
Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles
Loop, Christian Theobalt, et al. Real-time non-rigid reconstruction using an
RGB-D camera. ACM Trans. Graphics, 33(4):156, 2014.
[428] Daniel Zoran, Phillip Isola, Dilip Krishnan, and William T Freeman. Learning
ordinal relationships for mid-level vision. In Proc. Int. Conf. on Computer Vision
(ICCV), pages 388–396, 2015.
[429] Shoshana Zuboff. Big other: surveillance capitalism and the prospects of an
information civilization. Journal of Information Technology, 30(1):75–89, 2015.
[430] Shoshana Zuboff, Norma Möllers, David Murakami Wood, and David Lyon.
Surveillance capitalism: An interview with shoshana zuboff. Surveillance &
Society, 17(1/2):257–266, 2019.
[431] Frederik Zuiderveen Borgesius, Damian Trilling, Judith Möller, Balázs Bodó,
Claes H De Vreese, and Natali Helberger. Should we worry about filter bubbles?
Internet Policy Review. Journal on Internet Regulation, 5(1), 2016.
244