PERCEPTION ENABLED PLANNING FOR
AUTONOMOUS SYSTEMS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Yutao Han
August 2021
© 2021 Yutao Han
ALL RIGHTS RESERVED
PERCEPTION ENABLED PLANNING FOR AUTONOMOUS SYSTEMS
Yutao Han, Ph.D.
Cornell University 2021
As we move towards fully autonomous robotic systems, one of the key chal-
lenges is the integration of state-of-the-art perception techniques in machine
learning and computer vision with decision making for robots. While there
have been tremendous strides in the deep learning community, there are
still challenges in how to apply these technological advances to extend the
boundaries of autonomous planning.
In this thesis, I will discuss my research which attempts to bridge ad-
vances in computer vision to practical applications in robotics. In the first
part of this thesis, I will present a pedestrian motion model which combines
high level decision making with low level motion patterns. Next, I will
present my work on managing scene uncertainties with a focus on semantic
surface types. Next-Bext-View measurements are taken to iteratively de-
crease uncertainty for a multi-hypothesis planner. Then, I will present my
work on planning through occluded spaces. A GAN-based image inpaint-
ing network fills in unknown spaces for planning and I demonstrate the
efficacy of my method compared to the traditional paradigm for managing
occluded spaces. In the last part, I will discuss a planner which leverages
instance segmentation and ordinal information to extend the boundaries of
traditional metrical planners to long distance planning.
BIOGRAPHICAL SKETCH
Yutao is a researcher in robotics at Cornell University. He is interested in
the applications of computer vision and machine learning to enable robots
to make more intelligent decisions in a complex world. He is advised by
Professor Mark Campbell who is an expert in autonomy for robotics. Yutao
completed his Bachelors degree in Mechanical Engineering from the Uni-
versity of Illinois at Urbana-Champaign (UIUC).
iii
献给我的外公外婆。
iv
ACKNOWLEDGEMENTS
My research is funded by the ONR under the PERISCOPE MURI Grant
N00014-17-1-2699.
Thank you to my advisor Mark, and my other committee members KB and
Hadas.
Thank you to my many labmates and friends.
Thank you to my family.
Most importantly, thank you to everyone who has laughed with me, I
would not have finished my PhD without you.
v
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction 1
2 Pedestrian Motion Model Using Non-Parametric Trajectory Clus-
tering and Discrete Transition Points 10
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 PEDESTRIAN MOTION MODEL . . . . . . . . . . . . . . . . . 15
2.4 MOTION & TRANSITION POINT MODELING . . . . . . . . 16
2.4.1 Transition Point Estimation . . . . . . . . . . . . . . . . 21
2.4.2 Iterative Clustering . . . . . . . . . . . . . . . . . . . . . 24
2.4.3 Transition Probabilities . . . . . . . . . . . . . . . . . . . 26
2.5 Online Prediction of Motion and Anomaly Detection . . . . . 30
2.5.1 Predictive Modeling . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 31
2.6 VALIDATION OF THE MOTION MODEL . . . . . . . . . . . 31
2.6.1 Robustness of the Iterative Clustering Algorithm to
Anomalous Training Data . . . . . . . . . . . . . . . . . 32
2.6.2 Pedestrian Prediction Accuracy . . . . . . . . . . . . . . 35
2.6.3 Online Anomaly Detection . . . . . . . . . . . . . . . . 38
2.7 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 TAKEAWAYS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 DeepSemanticHPPC: Hypothesis-based Planning over Uncertain
Semantic Point Clouds 41
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 RRT-Based Non-Holonomic Planning over a Point
Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Segmentation with Bayesian Neural Networks . . . . 47
3.3 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vi
3.4.1 Predicting Semantic Labels . . . . . . . . . . . . . . . . 53
3.4.2 Associating Semantic Labels to a Point Cloud . . . . . 54
3.4.3 RRT-Based Multi-hypothesis Planner . . . . . . . . . . 56
3.4.4 Next-Best-View (NBV) Planning . . . . . . . . . . . . . 57
3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Validation Scenes and Overview . . . . . . . . . . . . . 60
3.5.2 NBV evaluation . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.3 Path Safety Evaluation . . . . . . . . . . . . . . . . . . . 65
3.6 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4 Planning Through Unknown Space by Imagining What Lies
Therein 68
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 Image Inpainting . . . . . . . . . . . . . . . . . . . . . . 72
4.3.2 Path Planner . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.3 Planning with Lidar Data . . . . . . . . . . . . . . . . . 75
4.3.4 Planning with Stereo Camera Data . . . . . . . . . . . . 79
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 81
4.4.1 Network Training . . . . . . . . . . . . . . . . . . . . . . 81
4.4.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . 83
4.4.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . 84
4.5 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Semantically Enabled Long Horizon Planning 90
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Image Based Planning . . . . . . . . . . . . . . . . . . . 95
5.3.2 Instance Segmentation . . . . . . . . . . . . . . . . . . . 96
5.3.3 Long Distance Map . . . . . . . . . . . . . . . . . . . . . 99
5.3.4 Receding Horizon Planner . . . . . . . . . . . . . . . . . 103
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 Real World Experiment . . . . . . . . . . . . . . . . . . 106
5.5 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Summary and Conclusions 114
vii
Bibliography 116
viii
LIST OF TABLES
2.1 Transition Probabilities Between Transition Points . . . . . . . 29
2.2 % Error for Different Amounts of Anomalies . . . . . . . . . . 34
2.3 K-folds Cross Validation Results . . . . . . . . . . . . . . . . . 37
3.1 500 trials of path safety evaluation. The columns are the path
planning methods used: B1 is the planner based on [48], B2
is the variant of our framework which does not utilize any
NBVs, XN is DeepSemanticHPPC (ours) with X NBVs. The
rows are the metrics: Safe is the number of trials where a
safe path is selected, Unsafe is the number of trials where an
unsafe path is selected (lower is better), CS is the number of
trials where a safe path is confirmed, CN is the number of
trials where all multipaths are confirmed as unsafe (and no
path is selected). . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Evaluation of inpainting for planning with the baseline and
ground truth (GT). Metrics are given as mean (std). Each
study has 50 trials with goal positions chosen randomly.
Smaller numbers are better. . . . . . . . . . . . . . . . . . . . . 87
5.1 Number of trials out of 1000, where the baseline took a
greater number of steps to reach the goal, less steps, and
an equal number of steps compared with the long horizon
planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
ix
LIST OF FIGURES
2.1 A motivating example for the motion model in this paper.
The local motion patterns are labeled m1,m2,m3 and shown
in purple, light blue, and green respectively. There are four
transition points annotated with blue bubbles. There is an
anomalous trajectory shown in red. Background image from
[8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Algorithmic flow of iterative DPGP clustering and hypoth-
esis test transition point estimation (black) of the proposed
pedestrian model (blue) for both low level trajectories and
high level transitions. The model is then used for online in-
ference (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The  neighborhood of point {xki , yki }. The  neighborhood is
in blue, m is in green, {xkj i , yki } is the red dot, and tk is in red.
The percentage of trajectories in m j that pass through the 
neighborhood of each point in tk is used to weigh if the po-
sitional distribution of tk matches the positional distribution
of trajectories in m j . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Discovery of sub-trajectories for the case of merging motion
patterns. There are two motion patterns in red and blue. Ar-
rows in the top left show the directionality of the motion pat-
terns. Circles on the trajectories indicate where the trajectory
data ends. Potential sub-trajectory points in the red pattern
are indicated with red crosses. The data is from video 3 of
the DukeMTMC project [71]. . . . . . . . . . . . . . . . . . . . 25
2.5 Results of the iterative DPGP trajectory clustering algorithm
and transition point estimation on simulated data with 90
trajectories. The trajectories move in two directions, and the
top and bottom rows correspond to each of the directions
respectively. 2.5(a) is the initial clustering result before any
splitting of trajectories. 2.5(b) shows the iterative clustering
result after convergence. 2.5(c) shows the transition points
found with 95% confidence ellipses. The blue bubbles anno-
tating 2.5(c) (bottom) correspond to the states in Table I. The
numbers 1 − 6 in the blue bubbles correspond to the states
s1 − s6 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 27
x
2.6 Simulated dataset of 90 trajectories with 30% added anoma-
lies. The original trajectories are in blue and the anomalous
trajectories are in red. The circles signify the ends of trajec-
tories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Trajectories used from the Duke MTMC dataset, video num-
ber 8. The 191 trajectories used to train the offline model in
k-folds validation are shown in blue with circles signifying
the ends of trajectories. 40 anomalous trajectories used for
online anomaly detection analysis are shown in red. . . . . . 34
2.8 An example of the online predictive model. The observed
trajectory is shown in magenta. The current state, which
includes both transition points and motion patterns, is in
green. Possible future states are in shades of blue, scaled
by probability, with darker shades corresponding to higher
probability of transition, and improbable future states are
shaded in gray. In 2.8(a), the trajectory has just started out
so there are a wide range of possible future states. In 2.8(b),
the trajectory moves within the overlap of clusters, and the
probability of transitioning to some states increases with a
darker shade of blue, while other states are eliminated. In
2.8(c) the trajectory has only one possible future state to tran-
sition to, which is the confidence ellipse shaded in black, the
color corresponding to the highest transition probability of
one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Online anomaly detection. The trained model is shown in
gray. If the trajectory is found to be aligned with a learned
motion pattern, it is highlighted in green. If the sliding win-
dow is detected as an anomaly, it is highlighted in red. The
segment of the observed trajectory that is no longer in the
sliding window is highlighted in black. . . . . . . . . . . . . . 38
3.1 The DeepSemanticHPPC pipeline. (1) In the first stage, ini-
tial inputs generate a multi-hypothesis graph of possible
paths. (2) In the second stage, the uncertainty in the scene
is reduced and path costs are updated. (3) The second stage
is repeated for a set number of iterations. This is terminated
early if a safe path is confirmed or all paths are confirmed as
unsafe. (4) A path is selected. . . . . . . . . . . . . . . . . . . 42
xi
3.2 An example point cloud. (a) Image view of a portion of the
environment. (b) Point cloud colored with the most likely
class predicted from image (a) (bright green: “grass”; dark
green: “tree”; purple: “sidewalk”; dark grey: “road”; light
grey: no information available). All the classes except “tree”
belong to the set S . The region around the tree is actually
mulch/woodchips, which should be classified as “dirt” (be-
longing to U). (c) Point cloud colored to show safe (white),
unsafe (black), unclear regions R partitioned according to
most likely class (random colors), and points not associated
with any class (orange). . . . . . . . . . . . . . . . . . . . . . . 49
3.3 An example graph G = (V, A), with poses. Vertices with
c(v) = 0 are in green, while vertices with 0 < c(v) ≤ 1 are
in red (the darker, the closer to 1). The blue, green, and red
axes indicate the pose of the robot. . . . . . . . . . . . . . . . 57
3.4 The prebuilt Africa environment from Airsim used for test-
ing the multipath planner. . . . . . . . . . . . . . . . . . . . . 61
3.5 Left: Image from Cass Park in Ithaca. There are multiple dif-
ferent terrains in the scene including grass, mud, and water.
Right: Annotated safe (blue) and unsafe (red) regions for the
Cass Park point cloud. . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Change in uncertainty of path vertices (y-axis) as the number
of NBV measurements increase (x-axis). Left: Mann Library.
Right: Cass Park. . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 from left to right. Current partial map represented as a point
cloud annotated with semantic labels (e.g. purple = road);
bird’s eye view of the annotated point cloud, where white
corresponds to unlabeled pixels whose semantic class will
be predicted via image inpainting; inpainted image; obstacle
map associated with the inpainted image (white = free space,
black = obstacles), shown along with obstacles discovered
online (red) and final path to the goal (blue) in one of our
simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Left: The semantic input point cloud of an initial lidar scan.
Right: The ground truth completed semantic map. . . . . . . 77
xii
4.3 An example from the generated dataset of point clouds ren-
dered from a BEV. Left: Input semantically labeled point
cloud with occluded and unknown spaces in white. Middle:
Mask of image to be inpainted. Known pixels are black and
pixels to be filled are white. Note that grey (unobserved) pix-
els in the left image are labeled as known in the image mask
as they are not the targets for inpainting. Right: Ground
truth labeled semantic point cloud. . . . . . . . . . . . . . . . 78
4.4 A distribution of the semantic pixel labels of the training
dataset of the first ten sequences. Each semantic class is
shown in the same color as displayed in the dataset. Note
that the y-axis is log scale. . . . . . . . . . . . . . . . . . . . . . 79
4.5 Qualitative evaluation of the inpainting network perfor-
mance. Odd rows are the RGB semantic maps and even
rows are the obstacle maps generated from the semantic
BEV images. Columns from left to right: (1) lidar input, (2)
inpainted image, (3) stereo input, (4) inpainted image, (5)
ground truth (GT). For the obstacle maps, black pixels are
obstacles and white pixels are free space. . . . . . . . . . . . . 82
4.6 A comparison of a path planner trial on the lidar input obsta-
cle map, the inpainted obstacle map, and the GT map shown
in figure 4.5 (image 150). The path is in blue pixels. The start
is in the middle left and the goal is in the top right. Red
pixels are obstacles the robot observes online. Green pixels
show where the path overlaps itself. The light blue circles
show a gap in the lidar input map which is filled in on the
inpainted image. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1 Motivating scenario taken at the Mann Library at Cornell
University. The purple region represents where the robot
is able to get accurate depth information and perform plan-
ning using traditional metrical planning methods. The red
arrows represent far away obstacles which the robot must
plan around. The orange represents the region where the
robot is able to capture semantic but not depth information
in the scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Example of generating a point cloud for an image based
planner. (a) Raw image of the environment. (b) A seman-
tically segmented image using [14]. (c) A 3D point cloud
projected using depth classification from [56]. . . . . . . . . . 97
xiii
5.3 An example of an RRT image space based planner. (a) Sam-
ple space with plan hypotheses. The start is in the bottom
center and the goal is in the top right. The start and goal are
both annotated as red spheres. RRT nodes are shown as blue
spheres and edges shown as lines. Blue lines are visible and
red lines are occluded. (b) The planner back-projected into
the image space. . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 A sample image from the manually labeled dataset for in-
stance segmentations. The white polygons represent indi-
vidual instances of trees. . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Mask R-CNN detections and distance classifications. Tree
trunk instances in Om are shown in blue and instances in Oo
are shown in purple. . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 The map representation based on the detections show in Fig-
ure 5.5. The metric FOV is shown in light blue with the indi-
vidual tree trunk instances in dark blue. The ordinal FOV is
shown khaki, with individual tree trunk instances in purple.
The obstacles in the ordinal FOV are ordered based on their
rough depth estimates. . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 The map used for simulations with manually annotated
polygons for obstacles in blue. . . . . . . . . . . . . . . . . . . 104
5.8 The map of Cass Park in Ithaca used for real world experi-
ments. The start and goal points are annotated on in light
blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.9 Real world experiment with the baseline planner. (1a-b)
show the first measurement and planned path in the sen-
sor FOV, (2a-b) show the second and so on. Images labeled
(a) show the raw image detection with the detected instance
mask projected on to them. Purple masks indicate the in-
stance is outside the metric range, and blue masks indicate
the instance is inside the metric range. Images labeled (b)
show the birds eye view FOV given the sensor location and
image detection. The metric FOV is in light blue and obsta-
cles in the metric range are blue. The global start location is
the green circle and the global goal location is the white cir-
cle. The planned path segment in the metric FOV is shown
in white. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xiv
5.10 Real world experiment with the long horizon planner. (1a-b)
show the first measurement and planned path, (2a-b) show
the second and so on. Images labeled (a) show the raw im-
age detection with the detected instance mask projected on
to them. Purple masks indicate the instance is outside the
metric range, and blue masks indicate the instance is inside
the metric range. Images labeled (b) show the birds eye view
FOV given the sensor location and image detection. The
metric FOV is in light blue and the ordinal FOV is in khaki.
Obstacles in the metric range are blue and obstacles outside
of the metric range are purple. The global start location is
the green circle and the global goal location is the white cir-
cle. The planned path segment in the metric FOV is shown
in white and the path segment in the ordinal FOV is shown
in orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xv
CHAPTER 1
INTRODUCTION
Motivation
Autonomous robots are an exciting field of research due to the benefits
they can bring to the daily lives of people. From self-driving cars to search
and rescue robots, autonomous robots provide numerous benefits including
saved commute time, increased safety, and more time for creative tasks. Ev-
ery segment of society will benefit from advancements in the field; billions
of dollars are currently being invested into bringing autonomous robots to
reality.
While purchasing C-3PO at a local Walmart may have seemed like a pipe
dream a decade ago, recent advancements in the robotics field including in
areas such as actuation, controls, planning, sensors, and computer vision
have made science fiction robots more and more realistic.
Nevertheless, there are still many challenges to overcome before C-3PO
does the dishes. Fully autonomous robots must be able to operate in com-
plex environments while making intelligent decisions. A lot of research
has been done in structured indoor environments, but there are still many
challenges to be addressed in unstructured, uncertain, and outdoor envi-
ronments. Youtube videos of a Tesla vehicle mistaking a bright moon with
a traffic light is evidence of that.
1
Background
One of the most important problems of autonomous robots is understand-
ing the world around the robot. In the world there exist both dynamic
agents (pedestrians, animals, cars, etc.) and a static scene (road, building,
tree, etc.). The robot must understand both dynamic and static elements of
the scene in order to operate in the world safely and reliably without caus-
ing potentially fatal accidents.
The most complex dynamic agent a robot will have to interact with in the
world are pedestrians. Pedestrians do not follow rigid rules like cars, and
are almost impossible to model using a physics based model. While there
exists literature using non-parametric machine learning methods to capture
low level pedestrian behaviors [24], [41], [15], these methods only capture
low level velocity clusters. In these methods gaussian processes (GP) are
fit to clusters of pedestrian trajectories. GPs are random processes where
joint distributions over random variables f (x) are fit to a normal distribution
such that:
p(f|X) ∼ N(f|µ,K) (1.1)
where µ is the mean and K is the kernel function representing the covari-
ance. K can be manipulated to generate different types of functions f .
These bayesian non-parametric models generally only fit to low level
2
velocity behavior such that x is a spatial point and f (x) is the velocity at that
spatial point. It is important to model the agents in the world beyond low
level velocity and clusters to gain more insight on the high level decisions
and behaviours of the agents in the world.
Another key aspect of understanding the world is the static world
around the robot. There has been a large body work done on modeling
and navigating the world in indoor environments [79, 39, 53, 73, 86], but
for more complex tasks such as self driving, robustness to outdoor environ-
ments is desired. Being able to leverage state of the art computer vision
technologies to understand the semantics and structure of the world and
the uncertainty in it is the key next step to robust autonomous systems in
complex outdoor environments.
Recent work on understanding uncertainties include the research from
[29] which uses stochastic dropout through a neural network to infer the
posterior distribution of the weights. During runtime, T stochastic passes
with dropout are run through the network to infer the posterior distribution
of the output:
1 ∑T
Eq(y*|x*)(y*) ≈ ŷ∗(x∗,Wt ,WtT 1 L) (1.2)
t=1
where q is the approximated distribution over the network output, {x∗,y∗}
are the network input and output respectively, T is the number of stochastic
passes through the network, and W represents the weights of L layers. This
3
is a convenient method which does not require significant network surgery
and can be applied to any neural network easily.
Other advances in the computer vision community include applying
generative adversarial networks (GANs) to image inpainting. GANs use
a generative model G and a discriminate model D to train the generator to
produce samples from a given training distribution. G attempts to gener-
ate samples of input data (in this case, images with missing regions) and D
attempts to classify the outputs of G as real or fake. These networks have
been able to generate impressive results in image inpainting tasks [95].
A significant challenge in understanding the static environment is deal-
ing with occluded spaces in the world. These are spaces that are blocked
off from sensor detections due to obstacles in the way. For example, a
house blocks a sensor from seeing what is behind the house, where there
could be many possibilities such as other houses, trees, or just empty space.
Occluded spaces are traditionally treated as free [73, 86]). The traditional
planning paradigm is to plan through unknown spaces as if they are unin-
habited, and then replan paths if it turns out obstacles actually occupy the
occluded space.
Lastly, there is the problem of dealing with limitations in sensors at far
ranges. Lidar can provide accurate depth at far distances but is too sparse
at those ranges for any classification. Vision based approaches can pro-
vide rich semantic information at far distances, but stereo depth error scales
quadratically with depth. Traditional approaches plan using metric based
4
maps of the world, based on accurate depth detections from sensors [87],
[2]. However, these approaches are limited by the aforementioned limita-
tions of lidar and image based sensors.
Key Research Questions
Given the previous work done in the robotics and computer vision commu-
nity, there are some key questions that need to be addressed.
How can we understand higher level behaviours of pedestrians in a
scene? While low level velocities from trajectories can provide insight on
pedestrian behavior, a more high level understanding of pedestrian deci-
sion making would be ideal in order to understand long term pedestrian
motion and potential waypoints.
How do we operate when the detections are uncertain because of light-
ing conditions or other sources of uncertainty? Bayesian neural networks
allow us to extract principled estimates of uncertainty, but how can we ap-
ply that to a practical robotics problem?
What should the robot do when parts of the world are occluded by ob-
stacles? While it is a reasonable heuristic to assume that unknown space is
free, it is intuitively not ideal and we should have a better understanding of
how to generate informed paths through unknown space.
What if obstacles are too far away to get accurate depth information
5
from? Is there a way to incorporate the close range metrical maps with
far away semantic detections?
Research Opportunities
Some key research opportunities informed by the open questions are:
• Developing a pedestrian model which is able to incorporate low level
velocities with higher level decision making, to gain more insight over
long term trajectory patterns.
• Applying network detection uncertainties to improve a robots under-
standing of the world; creating a more robust system for navigating in
uncertain environments.
• Leveraging GAN image inpainting to generate data-driven models in
occluded spaces of the world for robot navigation.
• Combining short distance metrical planners with long distance se-
mantic information for more informed planning.
Outline
This thesis addresses some of the key challenges in robotics by bridging the
gap between state of the art computer vision and machine learning tech-
niques, and the traditional robotics community.
6
Chapter 2 presents a pedestrian motion model that includes both low
level trajectory patterns, and high level discrete transitions. The inclusion
of both levels creates a more general predictive model, allowing for more
meaningful prediction and reasoning about pedestrian trajectories, as com-
pared to the current state of the art. The model uses an iterative clustering
algorithm with (1) Dirichlet Process Gaussian Processes to cluster trajecto-
ries into continuous motion patterns and (2) hypothesis testing to identify
discrete transitions in the data called transition points. The model iteratively
splits full trajectories into sub-trajectory clusters based on transition points,
where pedestrians make discrete decisions. State transition probabilities are
then learned over the transition points and trajectory clusters. The model
is for online prediction of motions, and detection of anomalous trajectories.
The proposed model is validated on the Duke MTMC dataset to demon-
strate identification of low level trajectory clusters and high level transi-
tions, and the ability to predict pedestrian motion and detect anomalies on-
line with high accuracy.
Chapter 3 presents a framework for managing uncertainties from per-
ception for planning. Planning in unstructured environments is challenging
– it relies on sensing, perception, scene reconstruction, and reasoning about
various uncertainties. A novel uncertainty-aware hypothesis-based plan-
ner for unstructured environments is proposed. The algorithmic pipeline
consists of: a deep Bayesian neural network which segments surfaces with
uncertainty estimates; a flexible point cloud scene representation; a next-
best-view planner which minimizes the uncertainty of scene semantics us-
7
ing sparse visual measurements; and a hypothesis-based path planner that
proposes multiple kinematically feasible paths with evolving safety confi-
dences given next-best-view measurements. and a hypothesis-based path
planner which develops multiple kinematically feasible paths for the un-
certain environment which evolve with incoming information. The pipeline
iteratively decreases semantic uncertainty along planned paths, filtering out
unsafe paths with high confidence. Experimental validations shows the
framework plans safe paths in real-world environments where existing path
planners typically fail.
Chapter 4 presents a novel framework for planning paths in maps con-
taining unknown spaces, such as from occlusions. The approach takes as
input a semantically-annotated point cloud, and leverages an image in-
painting neural network to generate a reasonable model of unknown space
as free or occupied. The validation campaign shows that it is possible to
greatly increase the performance of standard pathfinding algorithms which
adopt the general optimistic assumption of treating unknown space as free.
Chapter 5 presents a semantically aware long horizon planner which
leverages long distance semantic information in image detections to inform
planning with a short distance metric range based planner. By merging the
long distance semantic detections with a short distance planner, the planner
is able to alleviate the limitations of short distance metric based planners,
which are unable to take long distance semantic information into account.
Experiments in simulation and the real world demonstrate the ability of the
8
proposed planner at encoding semantic detections into a metric planner.
9
CHAPTER 2
PEDESTRIAN MOTION MODEL USING NON-PARAMETRIC
TRAJECTORY CLUSTERING AND DISCRETE TRANSITION POINTS
2.1 Motivation
Predictive models allow systems interested in pedestrian behavior to per-
form higher level inference and reasoning about scenes involving pedes-
trians. Models of how pedestrians navigate an environment are useful for
many robotics applications, including: surveillance robots [91] (e.g. under-
standing patterns of how people walk in a subway station could help detect
anomalies, such as a person running with a bag after a theft) and search
and rescue robots [62] (e.g. understanding movement patterns in a build-
ing on fire or during an earthquake can give information about where peo-
ple may be in the building). Insight can be obtained on pedestrian behaviors
and patterns by applying machine learning techniques to data of pedestrian
motion and learning predictive models much smaller in size than original
datasets.
Currently there are advanced and accurate systems for tracking multi-
ple targets with cameras within a scene [46], [55]. However, to gain more
insight into the scene and behaviors, patterns, motivations, anomalies etc.,
a more comprehensive model of pedestrian motions and behaviors is re-
quired. In this paper, a non-parametric model is developed to capture
10
the statistics of pedestrian trajectories and reason about pedestrian mo-
tion. Given the challenges of physics based modeling of pedestrian motion
and the availability of large datasets of pedestrian movements, data-driven
models are intuitively appropriate. Data-driven models do not make any
assumptions about pedestrian motion, such as optimizing trajectory dis-
tance or avoiding collisions with other pedestrians that may constrain the
model to produce unrealistic results. Instead, data-driven models have the
flexibility to fit to whatever the data suggests. Machine learning techniques
can be used to develop predictive models much smaller in size than origi-
nal datasets. In this paper, descriptive transition points are learned from the
data to represent high level behaviors where trajectory patterns branch or
merge together (for example, at a fork in the road). These transition points
are learned in conjunction with low level trajectory clusters in an iterative
algorithm to form a general model of pedestrian motion. By integrating the
two levels, the model can be used for many tasks, such as online clustering
of trajectories, anomaly detection, or reasoning about future movements
and goal locations. These high level understandings about human motion
patterns and behaviors are crucial for many robotics applications. For self-
driving cars, the model provides predictions of how and where pedestrians
will walk around cars, roads and intersections [34]. For surveillance, the
model gives locations for a patrol robot to intercept pedestrians. For per-
sonal robotics, the model provides predictions for collision avoidance in all
types of environments (museums, at home, etc.).
Fig. 2.1 provides a motivating example for the motion model in this
11
Figure 2.1: A motivating example for the motion model in this paper. The
local motion patterns are labeled m1,m2,m3 and shown in pur-
ple, light blue, and green respectively. There are four transition
points annotated with blue bubbles. There is an anomalous
trajectory shown in red. Background image from [8].
paper. The pedestrian movement patterns are decomposed into continuous
motion patterns and discrete transition points. Transition probabilities are
learned between the states, which are both motion patterns and transition
points, the pedestrian can be in. The model is able to perform high level
inference about pedestrian motion in the scene and also detect anomalous
pedestrian behaviors.
12
2.2 Related Work
Pedestrian motion models have been developed from trajectory data in sev-
eral ways. One approach involves using Bayesian non-parametric mod-
els to cluster trajectories, such as Gaussian Processes (GP) and Dirichlet-
Process Gaussian Processes (DPGP) [24], [41], [15]. A learned GP over a
cluster of trajectories is referred to as a motion pattern, which maps {x, y}
positions to a velocity flow field [41]. While these models have the flexibil-
ity to fit variable trajectory data and identify clusters of trajectories, they do
not capture higher level transitions, such as trajectories branching, or dis-
crete pedestrian decisions such as stopping at an intersection. Consider the
example of two clusters of trajectories with a transition point where trajec-
tories branch apart, such as at a fork in the road. There is an overlapping
region between the two clusters before the fork and non-overlapping re-
gions after the clusters branch apart. The motion patterns learned by the GP
clustering would not identify the common overlapping region referred to as
the sub-trajectory cluster nor the discrete location where the clusters branch
apart referred to as the transition point. Although these non-parametric
models capture coarse behaviors, higher level goals of transition points are
important for subsequent reasoning. Other models using Bayesian non-
parametrics have combined the temporal learning advantages of long short-
term memory (LSTM) networks with the flexibility of GPs [83]. Another
model uses LSTM by itself to learn movement and predict motion [1]. How-
ever, these models also do not capture higher level transitions correspond-
13
ing to changes in pedestrian behavior, and are only meant for short term
trajectory predictions.
Another approach uses discrete state transition models to learn trajec-
tory motion, such as Hidden Markov Models (HMM) [64], [40], [60], [49],
[88]. One of the challenges with using HMM is defining discrete latent states
that capture high level motion characteristics. In contrast, in this paper the
discrete states are intuitively defined as transition points, which are for-
mally inferred from low level trajectory data. Some HMM approaches use a
grid representation to define states. These models have trouble identifying
motion patterns and higher level behaviors among trajectory clusters. Grid
based approaches also can lose valuable information when real trajectories
do not fit cleanly into the grid space. Other models identify transition states
as locations where many trajectories converge in a gridded space [38]. In
these cases, information about trajectory motion patterns is lost and com-
plex transition points are not typically modeled. A more general approach
to defining and modeling transitions between general discrete states is de-
sired.
The work closest to this paper uses a dictionary learning algorithm to
discover local motion patterns which are similar to sub-trajectory clusters
[16]. This model cannot learn intuitive discrete transition points, such as
trajectories branching and does not consider multiple future transitions be-
tween subtrajectory clusters. There are many other approaches to modeling
movement patterns such as Vector Field k-Means [27], finding subtrajecto-
14
ries by clustering line segments [54], and finding frequent patterns through
error metric clustering [76]. These methods identify common patterns and
cluster trajectories accordingly. However, high level trajectory behavior at
transition points is not captured. A model that captures high level discrete
decisions is required for higher level reasoning.
Generally, data-driven trajectory models from the community have also
not considered anomaly detection online, which is addressed in the model
presented in this paper.
This paper provides the following contributions:
• A novel iterative probabilistic clustering algorithm, which finds low
level clusters of sub-trajectories using DPGP and discovers high level
transition points with hypothesis testing. The number of clusters and
transition points is not pre-specified.
• Online probabilistic prediction of both trajectories and novel high
level transition points.
• A formal way to discover anomalous trajectories online
2.3 PEDESTRIAN MOTION MODEL
Fig. 2.2 shows a block diagram of the proposed pedestrian model and algo-
rithmic flow. First, raw trajectory data are input into an iterative algorithm
15
with Dirichlet Process Gaussian Process (DPGP) clustering and hypothe-
sis testing, which produces trajectories clustered into sub-trajectory motion
patterns and estimated transition points. The clusters and transition points
are then used to find transition probabilities. Finally, in online application,
the motion patterns, transition points, and transition probabilities are used
for predictive modeling and anomaly detection.
Figure 2.2: Algorithmic flow of iterative DPGP clustering and hypothesis
test transition point estimation (black) of the proposed pedes-
trian model (blue) for both low level trajectories and high level
transitions. The model is then used for online inference (green).
2.4 MOTION & TRANSITION POINT MODELING
The pedestrian dataset is composed of the set of nt trajectories T =
{t1, ...tk, ...tnt}, where tk is an individual trajectory. Each trajectory tk is divided
into a sequence of nk points [{xk , yk}, ...{xk, ykp 1 1 i i }, ...{xkk , ykk }]T , tki = {xki , yki } repre-np np
sents the ith data point, and {x, y} represents positional data. The time inter-
16
val between two consecutive points i and i + 1 in a trajectory is constant. A
motion pattern is defined as a mapping from {x, y} to {ẋ, ẏ}, where {ẋ, ẏ} are
the time derivatives of the x and y positions respectively. The motion pat-
tern enables a distribution of velocity to be estimated given any 2-D point
in space. The set of motion patterns M is defined as {m1, ...m j, ...mnm}, where
m j is an individual motion pattern and nm is the total number of motion pat-
terns in M. Each motion pattern m j contains nmj trajectories, and there are
no shared trajectories between motion patterns. The goal of DPGP trajec-
tory clustering is then to divide the trajectory data T into nm representative
motion patterns.
The Dirichlet Process (DP) [85] is a nonparametric clustering process that
in this case discovers a mixture of GP models. DP allows for automatic dis-
covery of the number of clusters which grows as the data becomes more
complex. Gibbs sampling is used for the clustering process of DP. An ini-
tial number of clusters is manually chosen and the cluster assignments are
randomly initialized for the DP. In the clustering process, each trajectory tk
belongs to an existing motion pattern m j with prior probability
nmj
− (2.1)α + nt 1
and belongs to a new motion pattern with prior probability
α
α + nt −
(2.2)
1
17
where α is a concentration parameter.
The GP is used as a likelihood metric for the DP; GP [6] models motion
patterns by mapping the set {x, y} to the set {ẋ, ẏ}. The proposed GP uses a
radial basis function (RBF) kernel [41] to weight local data, defined as
(x − x′)2 ′ 2
Kx(x, y, x′, y′) = σ2x exp(− 2 −
(y − y )
2 ) + σ
2
nδ(x, y, x
′, y′) (2.3)
2ux 2uy
where Kx is the kernel function in the x direction, σx and σn are standard
deviations for the x direction and noise respectively, and ux and uy are length
scales in the x and y directions respectively. δ is the dirac delta function. A
separate kernel Ky is computed for the y direction and is defined similarly.
The kernel function represents the covariance between the set of points {x, y}
and {x′, y′}.
The predicted velocity of a motion pattern at a candidate new point
{x∗, y∗} is f ∗ = {ẋ∗, ẏ∗}with the distribution
f ∗ ∼ N(KT K−1∗ f ,K∗∗ − KT∗ K−1K∗) (2.4)
where K = {Kx,Ky}, K∗ = K(x, y, x∗, y∗), K∗∗ = K(x∗, y∗, x∗, y∗), and f represents
the training data {ẋ,ẏ} from the motion pattern.
The GP likelihood of a trajectory belonging to a motion pattern m j is
chosen as
∏nkp ∏nkp
l(tk,mk = m j|θ) = p(ẋk|tki ,m j, θ) · p(ẏk ki |t ,m j, θ) (2.5)
i=1 i=1
18
where θ are the GP hyperparameters of m j, and ẋki and ẏ
k
i correspond to the
ith point in tk.
An issue with using GPs as a likelihood metric is the planar shift prob-
lem [15]. In [15] a grid based approach is used instead of GPs as a likeli-
hood metric. However, a grid based approach loses the flexibility of GPs, so
in this paper the likelihood is adjusted to address the planar shift problem.
The GP only accounts for the differences in velocity distribution in (5), so
a weighting term to account for the positional distribution of trajectories is
added into the likelihood
∏nkp
l kw(t ,mk = m j|θ kj) = l(t ,mk = m k kj|θ j) · w(xi , yi ,m j, , β) (2.6)
i=1
( nm )β
w(xk, yk
j,
i i ,m j, , β) = 1 + (2.7)nmj
where  is a parameter for a neighborhood around {xk ki , yi }, β is a weighting
parameter, nmj, is the number of trajectories in m j that pass through the 
neighborhood surrounding {xk, yki i }, and w(xki , yki ,m j, , β) is a weighting term
that accounts for the positional distribution of tk with respect to m j.
Taking the log likelihood of lw gives
19
Figure 2.3: The  neighborhood of point {xk, yki i }. The  neighborhood is in
blue, m j is in green, {xk, yki i } is the red dot, and tk is in red. The
percentage of trajectories in m j that pass through the  neigh-
borhood of each point in tk is used to weigh if the positional
distribution of tk matches the positional distribution of trajec-
tories in m j
∑nk kp ∑np
log lw = log p(ẋ k|tki ,m , θ ) + log p(ẏ k kj j i |t ,m j, θ j)+
i=1 i=1 ∑ (2.8)nkp nmj,
β log(1 + )
n
i=1 mj
The positional weighting term w intuitively accounts for the percent of
trajectories of m j in the space near each point in tk, and weighs if the po-
20
sitional distribution of tk matches the positional distribution of trajectories
in m j. The parameter β is a tuning parameter for how much the positional
distribution is accounted for in the final likelihood.
The probability of a trajectory tk being assigned to an existing cluster m j
in the DPGP can be defined as
nmj
p(m k kk = m j|t ,m j, θ) ∝ − · lw(t ,mk = m j|θ) (2.9)α + nt 1
and the probability of being assigned to a new cluster is
∫
α
p(mk = mnm+1|tk,mnm+1, θ) ∝ − · lw(t
k,m = m
n 1 k nm+1
|θ)dθ (2.10)
α + t
The new motion pattern parameters, which are represented by raw trajecto-
ries, over which tk is evaluated for in equation (2.10) are randomly sampled.
Algorithm 1 provides pseudocode of the DPGP clustering process. In
practice, the algorithm generally converges quickly within five iterations.
2.4.1 Transition Point Estimation
Transition points are defined as spatial areas where motion patterns un-
dergo transition behavior, such as discrete branching and merging or dis-
crete changes in velocity. Transition points typically reflect discrete deci-
sions by the pedestrian when walking, such as branching or merging at a
21
fork in the road, or stopping at an intersection. Transition points separate
sub-trajectories when appropriate. Formally, we define a transition point as
a Gaussian distribution over spatial 2-D points where transition behavior
occurs.
Branching or merging is defined when sub-trajectories, or subsets of
Algorithm 1: DPGP Trajectory Clustering
Input: Trajectory dataset T
Output: Trajectories clustered into motion patterns M
Initialization : each trajectory tk is randomly grouped into one of nm clus-
ters, where nm is manually chosen as the initial number of trajectories
1: for iterations do
2: for i = 1 : nt do
3: for j = 1 : nm do
4: calculate p(m = m |titi j ,m j, θ)
5: end for
6: calculate p(mti = m inm+1|t ,mnm+1, θ)
7: assign ti to an existing cluster or start a new cluster based on the
highest calculated posterior.
8: end for
9: set nm = number of clusters
10: end for
11: return
22
the clustered trajectories, are shared. Given two motion patterns, the sub-
trajectory is found via a two-tailed hypothesis test. Consider a point q that
is part of a trajectory in motion pattern m1, and a second motion pattern
m2. It is desired to discover if m1 and m2 have a shared sub-trajectory, and
if q is a point in the sub-trajectory. The null hypothesis H0 and alternative
hypothesis HA are defined as
H0 : q = part of a shared sub-trajectory (2.11)
HA : q , part of a shared sub-trajectory (2.12)
The test statistic is calculated as
tq ∝ g(q, f q,2, f q,1, θ)w(q,m2, , β) (2.13)
g = p(q, f q,2| f q,1, θ) (2.14)
where the quantity tq is the test-statistic for q. f q,1 is the predicted distribu-
tion of velocity at q given the m1 GP, and f q,2 is the predicted distribution
of velocity at q given the m2 GP. The term w(q,m2, , β) is the same weight-
ing term from (2.7) If the p-value calculated from tq is less than the speci-
fied significance level σ, then H0 is rejected. In other words, given a single
{x, y} point q in m1, it is desired to find if q has a matching velocity and
positional distribution with m2. For each trajectory in m1, all points with
overlapping velocity and positional distributions with m2 are found, giving
potential sub-trajectories.
23
The branch or merge transition points are either at the beginning or
end of a sub-trajectory depending on whether the trajectories are joining
together or splitting apart. The points adjoining the sub-trajectories and
defined to be branch or merge points are fitted with a normal distribution.
Fig. 2.4 shows potential sub-trajectory points found for merging motion
patterns in the Duke MTMC dataset [71]. The sub-trajectory points found
intuitively appear in the region of overlap between the two motion patterns.
The spatial distribution over the location where the red pattern merges with
the blue pattern is a transition point.
2.4.2 Iterative Clustering
The iterative clustering process uses both DPGP clustering and sub-
trajectory discovery. The algorithm starts with raw trajectory data and no
cluster assignments. Next, 1) DPGP is performed to cluster the trajecto-
ries into representative motion patterns. 2) A pairwise comparison is per-
formed between sets of two motion patterns until a pair with a potential
sub-trajectory is discovered. 3) Once a sub-trajectory is discovered, the mo-
tion patterns are split based on where the sub-trajectory is into smaller local
motion patterns. 4) If no sub-trajectories are discovered then the algorithm
has converged. 5) If a sub-trajectory was found and split, return to step 1.
Algorithm 2 provides the pseudocode for the iterative clustering.
24
Figure 2.4: Discovery of sub-trajectories for the case of merging motion
patterns. There are two motion patterns in red and blue. Ar-
rows in the top left show the directionality of the motion pat-
terns. Circles on the trajectories indicate where the trajectory
data ends. Potential sub-trajectory points in the red pattern
are indicated with red crosses. The data is from video 3 of the
DukeMTMC project [71].
25
2.4.3 Transition Probabilities
Transition probabilities are learned over the transition points and motion
patterns in a transition probability matrix (TPM). Given the clustered data
from the iterative DPGP clustering, DBSCAN is first used to discover the
transition points. DBSCAN is a density based clustering algorithm [25].
The points at the start and ends of trajectories are inputs into DBSCAN and
Algorithm 2: Iterative Trajectory Clustering
Input: Trajectory dataset T
Output: Trajectories clustered into motion patterns of representative sub-
trajectories M
1: while true do
2: DPGP Clustering
3: find potential sub-trajectory between motion patterns m1 and m2 with
pairwise comparisons through hypothesis testing
4: if no sub-trajectories are found then
5: break
6: end if
7: split the trajectories in the compared pair of motion patterns into
smaller sub-trajectories corresponding to a new motion pattern
8: nm = nm + 1
9: end while
10: return
26
Figure 2.5: Results of the iterative DPGP trajectory clustering algorithm
and transition point estimation on simulated data with 90 tra-
jectories. The trajectories move in two directions, and the top
and bottom rows correspond to each of the directions respec-
tively. 2.5(a) is the initial clustering result before any splitting
of trajectories. 2.5(b) shows the iterative clustering result after
convergence. 2.5(c) shows the transition points found with 95%
confidence ellipses. The blue bubbles annotating 2.5(c) (bot-
tom) correspond to the states in Table I. The numbers 1 − 6 in
the blue bubbles correspond to the states s1 − s6 respectively.
the output is clusters of transition points. Normal distributions are fit over
the transition point clusters to find a spatial distribution.
The set of states S in the TPM model are defined as {s1, ...sh, ...sns}, where
sh is either a transition point or motion pattern and ns is the total number of
27
states. The TPM stores transition probabilities between the states in S . The
probability
di
p(s ji|s j) = (2.15)d j
is the i jth element of the TPM, d j is the number of elements or trajectories
in s j, and dij is the number of elements in s j that pass through si at a future
time step. The matrix contains probabilities for all possible transitions, short
term and long term.
Fig. 2.5 illustrates the iterative clustering results on a simulated data set
with 90 trajectories. The simulated dataset represents an intersection with
a total of six trajectory clusters that can be further divided into ten sub-
trajectory motion patterns and twelve transition points. The trajectories
move along the paths in two directions, which are processed simultane-
ously in the model. The initial DPGP clustering result is shown in fig. 2.5a.
6 motion patterns were initially found before splitting into sub-trajectories.
The circles represent the ends of trajectories. The trajectories move along
the paths in 2 directions. The top row and bottom row figs. correspond to
each of the directions respectively. Both directions are processed simulta-
neously and are separated in the fig. 2.5 only for clarity. Fig. 2.5b shows
the clustering result after running the iterative algorithm. 10 motion pat-
terns are found after iteratively splitting the initial 6 motion patterns into
sub-trajectory patterns. Fig. 2.5c shows the transition points discovered
with 95% confidence ellipses. The annotated blue bubbles in fig. 2.5c (bot-
28
Table 2.1: Transition Probabilities Between Transition Points
States s1 s2 s3 s4 s5 s6
s1 0 1 0 0.43 0.57 0.43
s2 0 0 0 0.45 0.55 0.45
s3 0 0 0 1 0 0.93
s4 0 0 0 0 0 1
s5 0 0 0 0 0 0
s6 0 0 0 0 0 0
tom) correspond to the states in Table I. The iterative clustering algorithm
successfully finds all ten motion patterns and all twelve transition points
after iteratively splitting the initial six patterns into sub-trajectory patterns.
The patterns and transition points match what intuition of the motion in the
scene should be.
Table 2.1 shows the TPM entries for the pedestrian states s1 − s6, which
are automatically extracted from the simulated data shown in Fig. 2.5. For
brevity, Table 2.1 only includes transition probabilities for the transition
points shown in Fig. 2.5(c) (bottom). The full TPM includes transitions
probabilities between all motion patterns and transition points. The val-
ues in Table 2.1 match intuition of what transition probabilities should be
between the states annotated in Fig. 2.5(c) (bottom).
29
2.5 Online Prediction of Motion and Anomaly Detection
Once the offline model has been trained, the model can be used for (1) online
predictive modeling and (2) online anomaly detection.
2.5.1 Predictive Modeling
Online predictive modeling makes use of both low level trajectory motion
patterns and high level transition points. Given a new observed trajectory
to, the algorithm attempts to cluster that trajectory with an existing motion
pattern. The model assigns to to the motion pattern mo with the highest
likelihood from (2.7). That is, to is assigned to mo satisfying
mo = arg maxm lw (2.16)j
where lw is the likelihood from (2.7). Because it is desired to cluster to with
local sub-trajectories, a sliding window is used in online clustering with
window size ws (in time steps of the observed trajectory data). As to is ob-
served, the last ws points within the trajectory are assigned to a learned
model motion pattern. Given the assigned motion pattern, the model as-
signs probabilities to future states S in the TPM.
Aside from the motion pattern assignment, to may also pass through the
spatial distribution of a transition point. If this occurs, the online model
30
continues clustering with the sliding window but gives precedence to the
predictive distribution of future states of the transition point.
2.5.2 Anomaly Detection
The same sliding window approach for predictive modeling is also used for
anomaly detection. The maximum likelihood in (2.16) is compared to a pre-
determined threshold lthresh to classify a sliding window trajectory segment
as anomalous. First, (2.16) finds mo which maximizes lw. Then, if lw > lthresh
the cluster assignment mo is kept for predictive modeling. If lw ≤ lthresh
the sliding window segment is classified as an anomaly with no predictive
states.
2.6 VALIDATION OF THE MOTION MODEL
For validation, the simulated dataset shown in Fig. 2.5 and the Duke MTMC
dataset [71] are used. The Duke dataset contains eight annotated videos
taken at 60 fps on the Duke University campus. Annotations include man-
ually labeled trajectories of pedestrians moving around a scene. Given the
walking speed of the pedestrians, the data is down-sampled to 2 fps without
losing any clarity in trajectories or motion patterns. The {x, y}measurements
defined in a global frame given by the Duke dataset are used as inputs to
the model.
31
In section 2.6.1, the robustness of the learned model to anomalous data
is evaluated using the simulated dataset in Fig. 2.5 due to the ability to
control the parameters of the simulation. The accuracy of online prediction
and the anomaly detection accuracy are evaluated using the Duke dataset,
given the challenges of an unstructured environment and not-ideal sensors,
in sections 2.6.2 and 2.6.3 respectively.
2.6.1 Robustness of the Iterative Clustering Algorithm to
Anomalous Training Data
To evaluate the robustness of the learned model, varying amounts of
anomalous trajectories are added to the simulated dataset (see Fig. 2.6).
Anomalous trajectories are similar to adding noise to the data for learning.
The amounts of anomalous trajectories added are in increments of 10% of
the size of the original dataset, until up to 30%. The resulting trajectory
cluster assignments are compared to the ground truth to find the percent
of trajectories that have incorrect cluster assignments. The results were av-
eraged over 10 trials for each increment of noisy data. Fig. 2.6 shows the
original simulated dataset of 90 trajectories with 30% anomalies added, to-
taling 117 trajectories.
Table 2.2 shows the iterative clustering algorithm is robust to anomalies
with an error of 9.1% with a training dataset of 30% anomalies when com-
pared with the ground truth. The ground truth takes labels from the fully
32
Figure 2.6: Simulated dataset of 90 trajectories with 30% added anomalies.
The original trajectories are in blue and the anomalous trajec-
tories are in red. The circles signify the ends of trajectories.
trained model in Fig. 2.5 and includes additional labels for anomalies. The
algorithm is still able to extract the primary underlying motion patterns in
Fig. 2.5 and iteratively split the trajectory data accordingly, while classifying
the majority of the anomalies into their own singleton clusters. The num-
ber of transition points found did not change with the added anomalies.
The presence of outliers did not have a significant effect on the underlying
33
distribution of motion patterns discovered by the algorithm.
Table 2.2: % Error for Different Amounts of Anomalies
Anomalies (%) 10 20 30
Error (%) 4.6 6.5 9.1
Figure 2.7: Trajectories used from the Duke MTMC dataset, video number
8. The 191 trajectories used to train the offline model in k-folds
validation are shown in blue with circles signifying the ends of
trajectories. 40 anomalous trajectories used for online anomaly
detection analysis are shown in red.
34
Figure 2.8: An example of the online predictive model. The observed tra-
jectory is shown in magenta. The current state, which includes
both transition points and motion patterns, is in green. Pos-
sible future states are in shades of blue, scaled by probabil-
ity, with darker shades corresponding to higher probability of
transition, and improbable future states are shaded in gray. In
2.8(a), the trajectory has just started out so there are a wide
range of possible future states. In 2.8(b), the trajectory moves
within the overlap of clusters, and the probability of transition-
ing to some states increases with a darker shade of blue, while
other states are eliminated. In 2.8(c) the trajectory has only one
possible future state to transition to, which is the confidence
ellipse shaded in black, the color corresponding to the highest
transition probability of one.
2.6.2 Pedestrian Prediction Accuracy
For analysis of pedestrian prediction accuracy, k-fold cross validation with
k = 10 was used. The model was evaluated on 191 trajectories extracted
from the Duke MTMC dataset, video number 8. For each fold, 90% of the
191 trajectories are used for training the model and the remaining 10% are
used to simulate online tracker readings of new trajectories; the trajectories
are shown in blue in Fig. 2.7. To evaluate the model, two metrics are in-
35
troduced. 1) Prediction Accuracy: the percent of time in the total observed
trajectory that the final state the trajectory will reach is part of the distribu-
tion of predicted states and 2) Transition Prediction Time (TPT): the number
of time steps after leaving a transition point needed for the model to accu-
rately predict the next transition point. The sliding window size ws = 6 was
manually chosen to balance resolution and long term characteristics. Pre-
diction accuracy of the proposed model is compared with a constant veloc-
ity prediction model as a baseline where the average velocity of the sliding
window is extrapolated to predict future locations of the pedestrian. TPT
is not applicable to a constant velocity prediction model, since high level
transitional description of trajectories is the proposed models novelty and
not available in the competing model.
The metrics were evaluated across the 10 folds and the results are shown
in Table 2.3. A statistical t-test is performed to test the significance of the
difference between predictor accuracies. The proposed model performed
significantly better than the constant velocity baseline (p < 0.01), yielding
the correct final state in the predicted distribution at 95% of the time on
average, compared with 39% for constant velocity predictions. For TPT,
the proposed model finds the next transition point within 1.1s on average,
i.e. 2.2 time steps at 0.5s measurement intervals, which is fast considering
the speed of pedestrians. The algorithm is able to forecast long term final
locations of new observed trajectories and can converge on future states fast
enough for online prediction.
36
Table 2.3: K-folds Cross Validation Results
Prediction Prediction
TPT(s):
Accuracy (%): Accuracy(%):
Metrics Proposed
Proposed Constant
Model
Model Velocity
Mean 95 39 1.10
Max 99 51 1.50
Min 89 31 0.85
Fig. 2.8 shows an example of the online predictive model when observ-
ing a new trajectory; the model is shown in one-directional flow of trajec-
tories for brevity. Note the trajectory patterns all move in two directions.
The observed trajectory is shown in magenta. The current assigned state
is shown in green. Possible future states are shown in blue with darker
shades indicating a higher probability of transition, and improbable future
states are shown in grey. In Fig. 2.8(a), the newly observed trajectory can
potentially transition to most future states. In Fig. 2.8(b), as the person
moves within the first transition point, the transition probabilities toward
the remaining future states are increased, and such future states are shown
in a darker shade of blue than in Fig. 2.8(a), while other improbable states
are eliminated. In Fig. 2.8(c), the observed trajectory can only transition to
one future state, which includes a transition point where trajectories leave
the scene. The model results match intuition and accurately predict the dis-
tribution of future states of the observed trajectory.
37
Figure 2.9: Online anomaly detection. The trained model is shown in gray.
If the trajectory is found to be aligned with a learned motion
pattern, it is highlighted in green. If the sliding window is de-
tected as an anomaly, it is highlighted in red. The segment of
the observed trajectory that is no longer in the sliding window
is highlighted in black.
2.6.3 Online Anomaly Detection
Each of the 10 folds from the split k-folds data is evaluated on accuracy of
detecting anomalous trajectories online. The same 40 anomalies, which are
separate from the split k-folds data, shown in red in Fig. 2.7 are evaluated
on each training fold. The trained model is evaluated based on the ability
to detect sections of trajectories that exhibit anomalous behavior based on
human labels of anomalies.
The average correct anomaly detection rate was very high at 95%, with
38
a standard deviation of 3%. The accuracy is measured based on the per-
centage of anomalous trajectories where all anomalous behavior is cor-
rectly identified by the model. Fig. 2.9 shows a visualization of the on-
line anomaly detection. Fig. 2.9(a) shows an example where the trajectory
is initially assigned to a motion pattern. As the trajectory moves forward
and additional info is collected (Fig. 2.9(b)), the trajectory is labeled an
anomaly once it diverges from learned motion patterns. In summary, the
online anomaly detection is able to both detect (1) whether a trajectory is
an anomaly and (2) which parts of the trajectory are anomalous with high
accuracy.
2.7 Model Complexity
The most computationally expensive part of the model is trajectory cluster-
ing with the GPs which is O(N3) due to inversion of matrices. Cholesky de-
composition is used to speed up inversion and reduces complexity to O( N
3
3 ).
The algorithm processes each new sliding window online in an average of
0.63s and is feasible to implement online. Experiments are done in MAT-
LAB on a machine with the specifications: Intel(R) Xeon (R) CPU E5-1630
v3 @ 3.70GHz, 3701 Mhz, 4 Cores, 8 Logical Processors, 40 GB RAM, 64-bit
Windows OS.
39
2.8 TAKEAWAYS
A probabilistic, two level pedestrian model is developed that captures low
level motion patterns using an iterative Dirichlet Process Gaussian Process
clustering model, high level transition points through hypothesis testing,
and transition probabilities between pedestrian states. Results show that
the model captures high level discrete behavior such as discrete pedestrian
decisions at an intersection, while also retaining low level clusters of tra-
jectories. The iterative model is able to learn both a high level pedestrian
decision model in conjunction with low level continuous motion patterns.
That is, high level transition point estimation is influenced by the low level
motion pattern clustering, and vice versa. Given a new observed trajectory,
the model can both quickly generate a distribution over future states and
also detect anomalous behavior. The novel motion model is a potentially
powerful tool for many applications, such as mobile robot path planning
to intercept a person of interest at a future time or identifying anomalous
pedestrian behavior at an intersection for autonomous driving.
40
CHAPTER 3
DEEPSEMANTICHPPC: HYPOTHESIS-BASED PLANNING OVER
UNCERTAIN SEMANTIC POINT CLOUDS
3.1 Motivation
Path planning for complex outdoor environments is challenging due to the
unstructured nature of environments that do not fall neatly into discretized
space. Moreover, different terrain surface types can be difficult to detect
with traditional sensing modalities. In indoor environments, a grid space
representation with lidar sensors is sufficient [79, 39, 53]. Outdoor environ-
ments exhibit complex geometries and surface types, which are difficult–
if not impossible–to differentiate using just lidar data. Therefore, a more
flexible scene representation, surface classification using computer vision
techniques, and reasoning about scene uncertainties are necessary.
Previous work for outdoor planning has focused on classifying ter-
rain and surface roughness using SVM classifiers [89, 58], neural networks
[13, 22], and various other computer vision techniques [59, 28]. While
these techniques can differentiate between simple terrain types, they do not
model the inherent uncertainties and ambiguities in complex scenes which
make it difficult to differentiate between terrain types (e.g., an offroad robot
driving through a patch of grass with small rocks). Many current outdoor
planning approaches still rely on grid maps which do not model the com-
41
Figure 3.1: The DeepSemanticHPPC pipeline. (1) In the first stage, initial
inputs generate a multi-hypothesis graph of possible paths. (2)
In the second stage, the uncertainty in the scene is reduced and
path costs are updated. (3) The second stage is repeated for a
set number of iterations. This is terminated early if a safe path
is confirmed or all paths are confirmed as unsafe. (4) A path is
selected.
plex geometry of an outdoor scene (e.g., a field with irregular bumps and
rocks) [65, 26]. Recent work models outdoor maps for planning with a point
cloud [48], which is more flexible and suitable for unstructured scenes; how-
ever, [48] uses traditional lidar sensing which cannot differentiate between
different surface types as broadly as computer vision.
In this paper we present DeepSemanticHPPC (Deep Semantic
Hypothesis-based Planner over Point Clouds), a novel algorithmic pipeline
42
for planning over uncertain semantic point clouds, which leverages a
Bayesian neural network (BNN) [29, 44] to extract principled estimates of
segmentation uncertainty. This allows our framework to reason about am-
biguous terrain as well as robustly handle false positive detections by tak-
ing additional measurements to reduce semantic uncertainty in the scene.
However, each measurement is costly due to the computationally expen-
sive nature of Bayesian neural networks operating on a robotic platform
with limited computing power. Our planner hence employs next-best-view
(NBV) techniques [7, 37, 21] to optimize for new measurements. DeepSe-
manticHPPC includes:
• the employment of a deep Bayesian neural network [29, 44] to obtain
surface and obstacle semantics with uncertainty estimates for unstruc-
tured outdoor environments;
• a flexible point cloud scene representation;
• a next-best-view planner which minimizes the uncertainty of terrain
semantics using sparse visual measurements;
• a hypothesis-based path planner (extending [48]) that proposes mul-
tiple kinematically feasible paths with evolving safety confidences
given the NBV measurements.
Experimental results with real environments show that our pipeline
plans safe paths in real-world environments where existing path planners
typically fail. Fig. 3.1 illustrates DeepSemanticHPPC. In the first stage,
43
a multi-hypothesis planner generates multiple hypotheses of possible safe
paths given a scene belief. In the second stage, a NBV function calculates
NBV poses and associated rewards. These poses and rewards are input to
a NBV selection block which selects the best feasible NBV. A BNN extracts
semantic segmentations and associated uncertainties from the NBV mea-
surement, which are used to generate a new scene belief. The new scene
belief reduces hypothesis uncertainty, and the second stage is repeated for
a set number of iterations. Finally, a safe path (hypothesis) with high con-
fidence is selected; if all paths are classified as unsafe, then no path is se-
lected. The algorithm terminates once a path is confirmed safe or all paths
are confirmed unsafe.
3.2 Background
3.2.1 RRT-Based Non-Holonomic Planning over a Point
Cloud
We build upon an existing rapidly-exploring random tree (RRT) [51] based
planner for finding kinematically feasible trajectories over non-planar point
cloud environments [48]. A core contribution of [48] is computing 6D poses
and trajectories on point cloud surface representations that are compliant
with the robot’s physical (e.g. maximum sustainable roll and pitch angles)
44
and kinematic constraints. Please refer to [48] for full details.
6D robot poses are expressed by transformation matrices belonging to
the Special Euclidean Group SE(3). A matrix TMR specifies the position and
orientation of a robot-fixed coordinate frame R expressed in a given refer-
ence map frame M. In particular, a pose expressed in coordinate frame A
can be stated in terms of the position and orientation of another frame B
w.r.t. A as

RAB tABT = 

AB 
ᵀ  ∈ SE(3),0 1
where RAB is a rotation matrix belonging to the Special Orthogonal Group
SO(3) and t 3AB ∈ R is a translation vector. Robot poses can then be denoted
by matrices TMR, which specify the position and orientation of a robot-fixed
coordinate frame R expressed in a given reference map frame M.
[48] considers the following planning problem: given start and goal
poses TMS, TMG, and a point cloudM = {mi} with mi ∈ R3, compute a con-
necting trajectory π : R>0 → SE(3). The trajectory has to satisfy a number of
constraints up to a given degree of approximation: contact with the terrain
surface, static traversability (e.g., bounded roll and pitch angles), and kine-
matic constraints – including bounded continuous curvature. Trajectories
are represented as piecewise continuous functions in the 6D space of robot
poses, and are specified by a sequence of nodes π̂ = [N k], where each N k
is a tuple (T k k kMRk , τ ,w , κ ). Here, TMRk is a 6D pose attached to the terrain
surface, τk ∈ [0, 1] is the associated static traversability value, wk is a pa-
45
rameter vector specifying a short planar trajectory segment connecting TMRk
to the next pose in the sequence, and κk is the curvature at the beginning
of the trajectory segment. wk specifies a trajectory segment as a cubic cur-
vature polynomial [63] evolving along the planar patch defined by the xy
plane of the coordinate frame R attached to TMRk . The end point of such a
trajectory segment gives the subsequent pose TMRk+1 through a projection on
the terrain surface via f : (M,TMR) 7→ TMR; f queriesM for the K nearest-
neighbors of the end point, which can be thought of as the points the robot
will lie on at TMRk+1 (K depends on the size of the robot and point cloud den-
sity). We use φ(N k+1) to denote such points. The actual start and goal poses
are also obtained by applying f to TMS and TMG.
Leveraging the above trajectory representation, [48] proposes to define
a small set of motion primitives (short trajectory segments) and use them to
grow two RRTs1, one from the start pose and one from the goal pose, and
iteratively try to connect them. Each new pose is associated with a nodeN k,
which is accepted in the tree only if τk > 0. Connections between close poses
belonging to different trees can be attempted by generating short trajectory
segments on demand. [48] also proposes a technique to derive a better tra-
jectory (in terms of smoothness and distance) starting from an initial one.
This second optimization stage is not explicitly considered in this work, be-
cause it is easily generalized and applied to all “safe” trajectories according
to our method.
1A standard approach in RRT-based planning [50].
46
3.2.2 Segmentation with Bayesian Neural Networks
Although segmentation networks for 3D point clouds exist [69, 70, 99, 67,
90], 3D data repositories are focused on object recognition and part seg-
mentation (e.g., [11]) or only contain a small number of scenes [32]. In con-
trast, existing large-scale image segmentation datasets [45, 10, 98, 4] contain
varied surfaces and obstacles in diverse real-world outdoor scenes. There-
fore, we leverage a state-of-the-art image segmentation network [14] (Sec-
tion 3.4.1), and update the point cloud environment from per-pixel image
segmentations (Section 3.4.2). Furthermore, we augment the network to es-
timate output uncertainty, allowing uncertainty in surface predictions to be
embedded in the point cloud. Uncertainty in surface type is used to guide
path-safety evaluation.
At inference time, forward passes with active dropout layers can be
interpreted as an approximation of the posterior distribution of model
weights of a neural network [29, 44]. The uncertainty of predictions are
computed by taking the sample standard deviation across multiple forward
passes. For each pixel (i, j)X of image X, the mean softmax vector over T for-
ward passes is:
1 ∑T
p(i, j)X s(i, j)= Xt (y|X) (3.1)T
t=1
where s(i, j)X (y|X) ∈ RC is the softmax output of the network. The correspond-
ing uncertainty vector σ(i, j)X on p(i, j)X is:
47
√√∑T ( (i, j)X )s (y|X) − p(i, j) 2Xt
(i, j)X t=1σ = − (3.2)T 1
In our framework, image X corresponds to a view with known camera
parameters. p(i, j)X and σ(i, j)X are combined with existing measurements for
each point m ∈ M that maps to pixel (i, j)X (Section 3.4.2).
3.3 Approach Overview
The planner presented in Section 3.2.1 does not leverage critical semantic
information about terrain types. In this work, we assume an initial point
cloud is given in the form M = {ei}, where each element ei is a tuple
(mi,pi, σi). Here, mi ∈ R3 as before; pi = (pi , pi , . . . , pi1 2 C) is a vector specifying
the probabilities that the point belongs to each one of the C possible seman-
tic classes (gravel, water, etc.); σi = (σi1, σ
i i
2, . . . , σC) is a vector specifiying
the uncertainties of pi, as discussed in Section 3.2.2. Points are initialized
with uniform semantic probabilities and maximum uncertainties. Updat-
ing the semantic point cloud is discussed in Section 3.4.2. Fig. 3.2(b) shows
a pointcloud obtained in a real (ambiguous) environment, where each point
mi is associated with the color corresponding to the class label j whose pij is
maximum.
We assume the semantic classes have been partitioned into two sets: the
safe set S (e.g., gravel, grass) and the unsafe set U (e.g., water, snow). For
48
(a)
(b)
(c)
Figure 3.2: An example point cloud. (a) Image view of a portion of the en-
vironment. (b) Point cloud colored with the most likely class
predicted from image (a) (bright green: “grass”; dark green:
“tree”; purple: “sidewalk”; dark grey: “road”; light grey: no in-
formation available). All the classes except “tree” belong to the
set S . The region around the tree is actually mulch/woodchips,
which should be classified as “dirt” (belonging to U). (c) Point
cloud colored to show safe (white), unsafe (black), unclear re-
gions R partitioned according to most likely class (random col-
ors), and points not associated with any class (orange).
49
∑ ∑
each point mi,√the points√are defined as pi i i i iS = j∈S p j, pU = 1 − pS = j∈U p j,
and σi
∑
= min( σi2
∑ i2
j∈S j , j∈U σ j ). Each point is then classified as:
• safe if piS − wσσi ≥ θs;
• unsafe if pi iU − wσσ ≥ θu;
• unclear otherwise.
Intuitively, this implies that points are safe/unsafe given high probability
(piS , p
i
U) and low uncertainty (σ
i), and unclear otherwise. wσ, θs, θu are de-
fined by the mission planner (with 1−θs < θu). We useMsafe,Munsafe,Munclear
to denote the partition of M obtained from the above classification. Note
that a point is labeled safe (unsafe) even with large uncertainty on piS (p
i
U),
provided there is small uncertainty on piU (p
i
S ). This is captured by the min
in the definition of σi. For example, the network may be uncertain between
gravel and grass (both safe), but it is sure that the point is neither water nor
snow (both unsafe).
Consider now a trajectory π̂ = [N k], and recall that φ(N k) denotes the
set of points on which the robot lies when at pose TMRk . Depending on the
semantic information initially available, it might be very difficult –if not
impossible– to immediately find a trajectory whose node points φ(N k) all
belong toMsafe. DeepSemanticHPPC works in two stages:
1) Compute a set of candidate paths traversing different unclear regions,
and
50
2) Reduce the uncertainty of such paths by taking new views in the prox-
imity of the robot’s starting position of the most promising path.
We relax the path planning problem in a natural way – instead of reach-
ing a specific goal pose, we require the robot to reach a goal pose region
G defined around TMG. Then, the points in Munclear are organized into a
set R of unclear regions. To build the set R, we use the following two-stage
clustering process: first, DBSCAN [25] performs a large-scale clustering of
the points inMunclear, obtaining a set of large unclear regions R̂. Then, the
points of each r̂ ∈ R̂ are further partitioned according to their most likely
class (treating the points not associated with any prediction as belonging to
a special class); DBSCAN is called again on each partition. Fig. 3.2(c) shows
the result of this process on our example with θs = 0.9, θu = 0.3,wσ = 3.
The remaining task is to compute a set of candidate paths from TMS
to G. Section 3.4.3 presents a variant of a standard RRT algorithm which
constructs multiple hypothesis for the safest path, traversing different un-
clear regions. These paths are stored in the directed graph G = (V, A),
where each v ∈ V is associated with a potential trajectory node N v and
each a ∈ A represents the existence of a short trajectory segment con-
necting∑two poses2. With a slight∑abuse of notation, we define pvS =1 i v 1 i
|φ(TMRv )| i∈φ(T v )
pS and σ = | (T )| i∈φ(T v ) σ for each v ∈ V . The costMR φ MRv MR
for each no∑de is based on how far it is from satisfying our safety constraint:
p̄v = 1 i i| (Nv)| i∈φ(Nv) min(1,max(0, θs − pS + wσσ )) for for each v ∈ V . However,φ
2G can be thought as a generalization of the trajectory representation π̂ of Section 3.2.1.
51
if the node region sufficiently intersects with an unsafe region, the cost is
infinite; and if the node lies entirely in a safe region, the cost is zero.
Summarizing, c(v) is defined as:
0 if |Msafe ∩ φ(N
v)| = |φ(N v)|
c(v) =  v
∞ if |Munsafe ∩ φ(N )| ≥ φv (3.3)
p̄v otherwise,
where φv is a user-defined threshold. The first condition can be relaxed for
the nodes in close proximity of the starting pose. Although a vertex with
infinite cost is never obtained when the candidate paths are initially com-
puted, its cost might tend to infinity when additional views are taken dur-
ing the NBV stage (Section 3.4.4). A predefined number of NBV iterations
are run. At each iteration, the m most promising (shortest) paths accord-
ing to the above cost function are computed by a k-shortest paths algorithm
(we use Yen’s [94]). The associated vertices v such that 0 < c(v) < ∞ are
then considered in the function that computes the best additional view. The
value of m is also decided by the mission planner: m = 1 corresponds to an
aggressive setting, while m > 1 is preferred given a large temporal budget
for taking additional views.
52
3.4 Technical Details
3.4.1 Predicting Semantic Labels
Dataset
We curate a segmentation dataset for outdoor navigation in unstructured
environments from the existing large-scale COCO panoptic dataset [45].
First, images of unstructured outdoor scenes are selected using a Places365
[97] classifier. An image is kept if (a) the classifier’s top one (highest) pre-
diction is an unstructured outdoor category with > 50% probability, or
(b) two or more of the top five predictions are unstructured outdoor cat-
egories. Second, the 133 categories in COCO panoptic are merged: (a) all
outdoor terrains (e.g., grass, dirt, snow, pavement) are retained; (b) ob-
stacles are merged into four categories: fixed obstacles (e.g., buildings),
moving human-made obstacles (e.g., vehicles), humans, and animals; (c)
all indoor categories are removed. Our final dataset consists of 34K train-
ing images with 22 categories. Our navigation segmentation categories
(from COCO panoptic), and filtered list of COCO panoptic images are at:
https://deepsemantichppc.github.io
For our network architecture, we use DeepLabv3+[14] with Xcep-
tion65 [17] backbone augmented with dropout in the middle and exit flow
blocks for semantic segmentations. For details, refer to [14, 17]. ASPP atrous
53
rates are set to 6,12,18 for output stride 16 and 12,24,36 for output stride 8
as in [14]. Training uses output stride 16 while inference uses output stride
8.
Training Details
We initialize from network pretrained on ImageNet [?] + MSCOCO [?] +
Pascal VOC [?], and train for 160000 iterations using SGD with momentum
with batchsize 4 on our dataset. Initial learning rate 0.01, momentum 0.9,
and polynomial learning rate decay with power 0.9 are used. Images are
resized to 513x513.
Inference. At inference time, 50 forward passes are used to predict seman-
tics and uncertainties (Eqs. (3.1)-(3.2)).
3.4.2 Associating Semantic Labels to a Point Cloud
Given a viewpoint with known pose and camera intrinsics, image segmen-
tation probabilities and uncertainties are mapped to the point cloud. To
map pixel (i, j)X:
1. Estimate depth map DX for view X.
2. Each pixel is backprojected to a single point corresponding to the cen-
ter of the projected pixel. This approximation does not hold for pixels
54
with very large depths, but typically produces good results. Backpro-
jected pixel point m̃(i, j)X is computed as:
m̃(i, j)X (i, j)= PX[DX I3×3|[0 0 1]T ]T K−1X [i j 1]T (3.4)
where [i j 1] are pixel coordinates in homogeneous coordinates, KX is
3 × 3 intrinsic matrix, D(i, j)X is depth at pixel (i, j), and PX is 4 × 4 pose
matrix for view X.
3. Each backprojected point is merged with its nearest neighbor mnn in
the point cloud M within threshold distance R. If a backprojected
point does not have a neighbor within the threshold, the point is dis-
carded
4. p(i, j)X and σ(i, j)X are merged with existing measurements for point mnn.
The combined measurement is the best linear unbiased estimator un-
der the simplifying assumption that the per-class predictions are in-
dependent.
Given a set of K measurements {(p(k),σ(k))}, the combined measure-
ment (p̃, σ̃) for class c is:
(
1 ∑ √K ∑K )
p̃ = w(k) p(k) , σ̃ (k) 2 (k) 2c Z c c
= (w ) (σ )
k=1 k=1
(k) −2 ∑
where w(k)c = ∑(σc )(k) , Z s.t. p̃c′ = 1 (3.5)(σc′ )−2′ c′∈Cc ∈C
55
3.4.3 RRT-Based Multi-hypothesis Planner
The algorithm to compute G = (V, A) starts by building an initial RRT from a
root vertex vs associated with the projected start pose TMS via a predefined
set of motion primitives. Instead of building two RRTs as is done in [48], we
build a single RRT from the start pose and bias the sampling to the point
that is closest to the projected ideal goal pose. Sampling is performed on
the points in M not lying in the forbidden points set F , which is initialized
as F ← Munsafe. Once the first path π̂ = [N v] is found, the algorithm exam-
ines which regions in R are traversed by the vertices of π̂ by checking their
intersections with each set of points φ(N v). Except for regions containing
points belonging to the K-nearest neighbors of TMS and TMG, all regions tra-
versed by π̂ are placed into the removal candidates set C. A heuristic is then
used to decide which region(s) should be removed from subsequent plan-
ning stages. In this work, we simply remove the largest region ĉ, and F is
updated as F ← F ∪ ĉ. When ĉ is removed, the RRT vertices lying on it, in-
cluding those of π̂, are removed from the RRT. The algorithm then proceeds
to optimize and find a new path to the goal, by continuously expanding the
RRT component containing vs. This time, however, the algorithm also tries
to connect (by computing ad hoc trajectory segments) new vertices to those
that were disconnected from the RRT due to the removal of ĉ. When a new
path is found, the process repeats. If at any iteration, the algorithm finds
a path only contained in the regions of TMS and TMG, it can be restarted
with a different random seed. The two graphs are merged at a later stage.
56
Figure 3.3: An example graph G = (V, A), with poses. Vertices with c(v) = 0
are in green, while vertices with 0 < c(v) ≤ 1 are in red (the
darker, the closer to 1). The blue, green, and red axes indicate
the pose of the robot.
Fig. 3.3 shows an example multi-hypothesis graph computed on the exam-
ple of Section 3.3 (Fig. 3.2).
If any path computed on the multipath graph G = (V, A) has zero cost,
the robot can start following that path since all the underlying points lie in
Msa f e. Otherwise, the robot enters the NBV stage described in section 3.4.4.
3.4.4 Next-Best-View (NBV) Planning
A set of n viable NBV poses Tviable = {TMV1 , . . . ,TMVn} is computed by grow-
ing a RRT starting from TMS. The candidate poses Tviable are a subsample of
57
the RRT vertices. All points lying withinMunclear are treated as unsafe, and
sampling is performed on the safe points lying within a given radius r from
TMS. The robot should not travel too far to take a new view; otherwise it is
more appropriate to follow the most promising path of G = (V, A).
A reward J(TMV j) is calculated for each candidate pose TMV j . The NBV
pose TNBV is selected by picking the pose with the highest value of J that
also allows a safe path back to a safe pose from which all the paths can be
followed.
J(TMV j) = βdD + βγγ + βvisNvis + βQQ, (3.6)
where D is a distance metric, γ is a viewing angle metric, Nvis is the num-
ber of visible vertices from TMV j , and Q is the average information gain
over visible vertices from TMV j . The weights βd, βγ, βvis, βQ sum to 1 and
D, γ, Nvis, and Q are normalized. Distance and change in viewing angle are
used based on the assumption that closeness and view diversity will reduce
vertex uncertainty. Nvis puts more weight on candidate poses which have
higher chances of reducing the uncertainty of multiple segments of the mul-
tipath graph G = (V, A). Q represents the expected reduction in uncertainty
in the graph vertices given TMV j . Each of these components are defined as
follows.
Begin by defining the set of vertices v ∈ VNBV, where VNBV is the set of
unclear vertices belonging to the m most promising paths (see the end of
58
Section 3.3). For each candidate pose TMV j ∈ Tviable, only vertices visible
from TMV j are considered in calculating the reward. Visible vertices are de-
fined as vertices which occupy greater than a predefined number of pixels
in the image plane rendering of the point cloud from TMV j . Visible vertices
are added to the set v ∈ Vvis, j, where Vvis, j ∈ VNBV.
To calculate D, the distances from TMV j to v ∈ Vvis, j are normalized. Since
a lower distance should correspond to a higher reward, we subtract the nor-
malized distances from 1. To calculate γ: the set of negative cosine distances
of the angle between TMS and TMV j to v ∈ Vvis, j are used. Nvis is the size of
Vvis, j.
The information gain metric Q(v) is calculated for each v ∈ VNBV. Q(v)
represents the expected reduction in uncertainty for each v ∈ VNBV, and is a
function of the visibility and uncertainty of the points lying in φ(N v).
The visibility I(v,TMV j) of a vertex v ∈ VNBV, given a candidate pose TMV j ,
is the pixel coverage of φ(N v) in the rendered image plane of TMV j . Per-
point bounding squares of size equal to half the point cloud resolution are
used to compute occlusions and pixel coverage for surface points. The num-
ber of pixels that φ(N v) occupies in the rendering is the predicted visibility
I(v,TMV j) of v at TMV j .
The uncertainty σ(v) of a vertex v ∈ VNBV is the average sum of the un-
certainties of the points in φ(N v):
59
1 ∑ ∑C
σ(v) j= |φ(N v)| σi (3.7)
i∈φ(Nv) j=1
The information gain metric Q(v) of vertex v ∈ VNBV can be formally written
as
Q(v) = αiI(v,TMV j) + ασσ(v), (3.8)
where αI , ασ are weights that sum to 1 and I(v,TMV j) and σ(v) are normal-
ized.
3.5 Validation
3.5.1 Validation Scenes and Overview
For validation, we use both a simulated scene and a real scene. The sim-
ulated scene is generated using a prebuilt environment from AirSim [75].
Fig. 3.4 shows a top down view of the Africa environment from AirSim
which includes five different semantic classes. The simulated scene is used
for testing the efficacy of the multipath planner in planning a diverse set of
paths on a point cloud map (Section 3.4.3). A point cloud map of a subset
of the Africa environment is generated and subsampled to 20cm resolution.
60
(a)
Figure 3.4: The prebuilt Africa environment from Airsim used for testing
the multipath planner.
We implement our full pipeline in the AirSim simulator [75]. How-
ever, preliminary experiments show that the BNN trained on the real-world
dataset (Section 3.4.1) performs poorly on the synthetic AirSim environ-
ment. The full pipeline with a simpler BNN trained on the AirSim envi-
ronment and corresponding experimental results are shown in the accom-
panying video.
For real world validation, we collect data of two different scenes using
the ZED stereo camera from Stereolabs. The first scene is next to the Mann
Library in Cornell University (Fig. 3.2(a)) and the second scene is at Cass
Park in Ithaca (Fig. 3.5). These scenes are selected due to their varying
61
Figure 3.5: Left: Image from Cass Park in Ithaca. There are multiple dif-
ferent terrains in the scene including grass, mud, and water.
Right: Annotated safe (blue) and unsafe (red) regions for the
Cass Park point cloud.
(but common) terrain types. The scenes are representative of common un-
structured outdoor environments. The ZED camera API is used to extract
depth maps and generate point cloud reconstructions of the scenes. Seman-
tic segmentations and uncertainties are acquired and mapped to the point
cloud using techniques described in Sections 3.4.1, 3.4.2. The videos used
for reconstruction are recorded at 60 fps with 720p resolution. The data
from real world scenes are used for evaluating uncertainty reduction us-
ing the NBV function and for evaluating the safety of the paths planned by
the full pipeline. For these scenes, we heuristically select a set of candidate
NBV poses instead of growing a RRT from TMS, and assume a path between
these poses and the start pose exists. Candidate NBV poses are selected to
be oriented in the general direction of the goal while covering a wide range
of the scene. Our method (and baseline methods) choose NBVs from the set
of candidate NBV poses.
62
3.5.2 NBV evaluation
To evaluate the performance of the NBV function, we examine the change
in uncertainty of the path vertices as the number of NBVs increases. The
complete NBV reward function (Eq. 3.6) is compared against: (a) random
selection, (b) geometry-only reward, and (c) uncertainty-only reward. For
the geometry-only NBV reward, we set {βQ} to zero, and for the uncertainty-
only NBV reward, we set {βd, βγ, βvis} to zero. The change in uncertainties
summed across all the classes and points for each path vertex averaged over
500 trials is shown in Fig. 3.6. In our experiments, the full NBV reward
weights are set as follows: {βd = 0.4, βγ = 0.05, βvis = 0.25, βQ = 0.3, αI =
0.5, ασ = 0.5} for the Mann Library scene, and {βd = 0.15, βγ = 0.05, βvis =
0.2, βQ = 0.6, αI = 0.3, ασ = 0.7} for the Cass Park scene. A higher weight
is assigned to uncertainty terms for the Cass scene because the boundaries
between surface types (e.g., water, mud, grass) are more ambiguous than in
the Mann scene.
For both scenes, the full NBV reward function consistently achieves the
lowest uncertainty with two or more NBVs (Fig. 3.6). This illustrates the
importance of both geometry and uncertainty terms in the reward function.
In the Mann scene, the baseline reward functions converge to a higher un-
certainty than the full reward function. Due to the small size of the scene
and the nature of all candidate NBVs being oriented towards the goal pose,
random selection performs quite well. It initially outperforms uncertainty-
only which does not take into account point visibility or viewpoint diver-
63
Figure 3.6: Change in uncertainty of path vertices (y-axis) as the number
of NBV measurements increase (x-axis). Left: Mann Library.
Right: Cass Park.
sity, while random sampling implicitly selects diverse views. In the Cass
scene, the baseline reward functions converge to a higher uncertainty than
the full reward function, except for uncertainty-only which converges to the
same point as the full reward function. The overall uncertainty in the BNN
predictions are much higher at Cass park, so heavily weighting the uncer-
tainty in the reward function performs well. In the Mann scene, geometry-
only outperforms uncertainty-only, whereas the opposite result holds for
the Cass scene. Mann has less inherent ambiguity so geometry terms are
more important, while Cass has more ambiguous regions so uncertainty
terms are more important.
64
3.5.3 Path Safety Evaluation
To evaluate the real world application of DeepSemanticHPPC, we study
the safety of selected paths. We annotate a point cloud (Fig. 3.5 right) with
MeshLab [18] into ground truth safe and unsafe regions. For Mann library
mulch (dirt) is labeled as unsafe, and for Cass Park mud (dirt) and wa-
ter are labeled as unsafe. Any path which contains a vertex that overlaps
with the unsafe region with over Nunsafe points is unsafe. We set Nunsafe = 4.
Two baselines are considered: (a) B1: planning without semantic informa-
tion (based on [48]) and (b) B2: planning with semantic information from
a single initial view without taking any NBV measurements to reduce path
uncertainty. We also study the performance of DeepSemanticHPPC as the
number of NBVs increase.
Table 3.1 shows the path safety results for the Mann Library and Cass
Park scenes over 500 trials. Start and goal poses are varied for each trial
within a one meter radius of reference poses. A path is confirmed safe (CS)
if it has cost zero. A graph G = (V, A) is confirmed unsafe (CN) if all paths
have infinite cost. Since both safe and unsafe terrain surfaces can be geomet-
rically similar, baseline B1 cannot reliably avoid unsafe semantic regions.
Baseline B2 performs significantly better than B1 because of the inclusion
of semantic surface types in the planner. However, because the semantic
segmentations can be incorrect, especially in regions with high uncertainty,
this planner still plans over unsafe terrain.
65
Mann B1 B2 1N 2N 3N 4N 5N
Safe % 0 23.4 81.6 79.8 81.4 83.6 86.0
Unsafe % 100 76.6 18.4 15.4 11.2 6.6 3.8
CS % N/A N/A 0 0 4.0 18.0 28.0
CN % N/A N/A 0 4.8 7.4 9.8 10.2
Cass B1 B2 1N 2N 3N 4N 5N
Safe % 13.2 33.4 54.0 57.4 57.4 59.0 59.2
Unsafe % 86.8 66.6 46.0 41.6 39.8 37.4 36.6
CS % N/A N/A 0 0 0 0 0
CN % N/A N/A 0 1.0 2.8 3.6 4.2
Table 3.1: 500 trials of path safety evaluation. The columns are the path
planning methods used: B1 is the planner based on [48], B2 is
the variant of our framework which does not utilize any NBVs,
XN is DeepSemanticHPPC (ours) with X NBVs. The rows are
the metrics: Safe is the number of trials where a safe path is
selected, Unsafe is the number of trials where an unsafe path
is selected (lower is better), CS is the number of trials where
a safe path is confirmed, CN is the number of trials where all
multipaths are confirmed as unsafe (and no path is selected).
Our DeepSemanticHPPC framework is significantly better than the two
baselines. NBVs allow the planner to discard unsafe paths as semantic un-
certainties decrease. With just one NBV, the percentage of safe paths taken
increases drastically from 23.4% to 81.6% (Mann) and increases from 33.4%
to 54.0% (Cass). As NBVs increase, the percentage of safe paths selected
generally increases while the percentage of unsafe paths selected decreases.
With 5 NBVs, 86% (59.2%) of paths selected for Mann (Cass) are safe, and
only 3.8% (36.6%) of paths selected for Mann (Cass) are unsafe. With 5
NBVs, uncertainty is sufficiently reduced so that in 10.2% (4.2%) of trials,
all multipaths are confirmed to be unsafe for Mann (Cass) and no path is
66
selected. The complexity of the Cass scene is reflected in these results.
3.6 Takeaways
In this paper we present DeepSemanticHPPC, a novel framework for plan-
ning in unstructured outdoor environments while accounting for uncertain
terrain types. Our experiments show DeepSemanticHPPC reduces seman-
tic uncertainty in planned paths and increases the safety of paths planned
in environments with unsafe terrains. We plan to implement DeepSeman-
ticHPPC on a robot for physical experiments. Other interesting directions
include exploring the ability to build the point cloud online and incorporat-
ing geometric uncertainties.
67
CHAPTER 4
PLANNING THROUGH UNKNOWN SPACE BY IMAGINING WHAT
LIES THEREIN
4.1 Motivation
Planning paths for robots in environments with unknown spaces, such as
from occlusions, is a challenging task due to the difficulty in modeling what
actually lies therein. The customary approach to tackling this challenge in
traditional grid-based representations is to assume any unknown cells as
free. Any discrepancy between such an optimistic assumption and the ac-
tual map developed while executing the path and collecting information is
used to trigger the computation of a new plan [47].
The very nature of this problem is online, making an approach based
on constant replanning reasonable. We humans typically use scene con-
text to reason about unknown spaces. To reduce running into dead ends,
humans predict possible realizations of unseen space and plan paths accord-
ingly, thereby reducing the number of replans. The goal of this work is to
endow robots with a way of reasoning about unknown spaces, as humans
do, in order to plan more efficient paths.
Our approach, sketched in Figure 4.1, takes as input a partial map in
the form of a point cloud with semantic annotations, which can be obtained
from a single lidar scan or a single stereo image pair. The semantic point
68
cloud is converted into a bird’s eye view (BEV) image, and the input into
an image inpainting neural network [95]. The network fills in empty pixels
with semantic labels in order to obtain a reasonable prediction of an obstacle
map for path planning.
Figure 4.1: from left to right. Current partial map represented as a point
cloud annotated with semantic labels (e.g. purple = road);
bird’s eye view of the annotated point cloud, where white cor-
responds to unlabeled pixels whose semantic class will be pre-
dicted via image inpainting; inpainted image; obstacle map as-
sociated with the inpainted image (white = free space, black =
obstacles), shown along with obstacles discovered online (red)
and final path to the goal (blue) in one of our simulations.
The main contributions of this paper are:
• a novel framework for modeling unknown/occluded spaces for path
planning using image inpainting to generate more informed maps
• the generation of a dataset for training an image inpainting network
suitable for planning in outdoor urban environments.
The proposed framework is compared against a typical planner which
treats all unknown space as free. Experimental results demonstrate the
69
novel framework greatly reduces (1) the lengths of final paths which are
traversed through occlusion heavy environments, (2) the number of replans
online, and (3) the number of times the planner replans before it discovers
no path is possible.
4.2 Related Works
Planning paths for mobile robots in the presence of a fully known, com-
pletely deterministic map is a well-understood problem in robotics. The
classical approach leverages a grid-based representation of the environ-
ment, and uses a graph-search algorithm to plan a distance-optimal path
to the goal. Among these, A* [35] is the most widely used due to its effi-
ciency. When the map is not fully known in advance, the customary ex-
pedient for adapting classical pathfinding algorithms is that of treating all
the unknown cells as free, and computing new plans as needed. New al-
gorithms were developed to handle the replanning phase more efficiently
(D* [81] and its variants [82, 47]); however, unknown space has always been
treated as free up to the present day (see, for example, [73, 86]).
Learning-based approaches for navigation in partially unknown envi-
ronments have started being explored only recently. Ref. [80] focuses on
partially mapped indoor environments represented as grids, and presents a
method for selecting the “frontier” (i.e. the boundary between free and un-
known space) to guide the robot in order to reach the goal. A utility function
70
for the frontiers is learned via a neural network, but can only be used in in-
door environments and with traditional (dense) partial grid maps. Ref. [33]
presents a next-best view planning approach that addresses unknown ter-
rain semantics; however, the geometric map of the environment is assumed
to be known in advance. This problem is also being actively investigated by
the deep reinforcement learning community [61, 84, 96, 42]. These works
give up explicit map representations in favor of other features learned from
experience. While these methods demonstrate promising initial results,
they only work on simple indoor environments [84, 42] or mazes [61, 96].
We argue that such a sophisticated approach is not actually required for en-
abling more robust navigation in realistic, partially unknown environments.
Pairing standard pathfinding techniques with reasonable inferences of the
unknown portions of the map provides a better approach to address this
problem.
In this work, we use an image inpainting network [95] to fill in a bird’s
eye view (BEV) of an initial sensor measurement of an urban environment.
The initial measurement has unobserved regions due to occlusions and sen-
sor resolution, which make robust planning difficult. By filling in these re-
gions with a data-driven inpainting network, a more informed map can be
used for planning. While there are other works demonstrating the use of
image inpainting to model urban environments [57, 66, 74], these papers
do not actually demonstrate their use in path planning tasks. In addition,
these works inpaint behind segmented foregound object classes, which can
be restrictive as occluded regions behind background classes can also be in-
71
formative. For example, a building may be labeled as a background class,
but it would be useful for a planner to reason about what is behind the
building.
4.3 Technical Approach
Figure 4.1 shows the flow chart of the proposed framework. Initially, a raw
image is taken. Depth data is acquired either through stereo depth or from
a lidar scan. Semantic segmentation [14] is then performed on the image or
lidar point cloud [3]. The depth and semantic labels are mapped to a point
cloud which is rendered from a BEV for use as a map for a planner. Due
to geometries in urban scenes, there will likely be unobserved pixels due to
occlusions or sensor resolution. These unobserved pixels are then filled in
using general adversarial network (GAN) based inpainting. The inpainted
output of the network is then used as a more informed map for a planner.
4.3.1 Image Inpainting
Image inpainting is a technique traditionally used for restoring old or dam-
aged pictures and paintings. The goal is to modify the images to be visually
realistic to a viewer and restore the original state of the image [23, 5]. Gen-
erally, the user labels regions of the image to be filled in, and the inpainting
algorithm automatically fills in those regions. Image inpainting techniques
72
include patch-based techniques, diffusion-based techniques, the use of con-
volutional neural networks (CNN), and the use of generative adversarial
networks (GAN) [95, 23]. GAN-based inpainting networks simultaneously
train a generative network to hallucinate images and a discriminative net-
work to evaluate the hallucinated images. GAN-based methods are used in
this paper due to their recent superior performance [95, 93].
The GAN-based network used in this paper improves on previous
work by introducing a contextual attention layer that utilizes distant image
features during training to improve inpainting performance by reducing
blurry textures and distortions [95]. Typical inpainting applications include
filling in missing parts of an outdoor scene or a persons face. In this work,
the goal is to fill in a bird’s eye view (BEV) of a semantic map such as the
one pictured in Figure 4.1 (middle left).
A set of C semantic classes are defined for the input and output images.
Each pixel is defined as pc(i, j), where i and j are the image coordinates
of the pixel and c is the semantic class of the pixel. In the input image the
pixel can additionally be part of the unobserved class due to sensor range or
occlusions. The goal of inpainting is to fill in the unobserved pixels (white)
with a color corresponding to one of the c semantic classes. The inpainted
image is then used as a more informed map for the path planner.
To use the inpainted image as a map for planning, the set of C seman-
tic classes are divided into subsets of obstacle classes Co and non-obstacle
classes Cn, where C = Co ∪ Cn and Co ∩ Cn = ∅. Pixels belonging to Co are
73
labeled as obstacles and pixels belonging to Cn are considered free space.
4.3.2 Path Planner
Any planner suitable for traditional grid-based planning can be used in
our framework; as the inpainting will improve the performance of each ap-
proach (for example, A* [35], Theta* [20], or D* [47]). A* [35] is used in the
experimental section of this paper given its extensive usage. The inpainted
image is converted into a 2D obstacle map where each pixel is a node. Pixels
belonging to Co (such as wall or vegetation) are labeled as obstacles (black),
and pixels belonging to Cn (such as sidewalk or road) are labeled as free
space (white). As the robot executes the path, the occupied and free spaces
are continuously recomputed (e.g. at a given frequency) based on the most
recent sensory information (and hence obstacle map). Local planning in-
corporating the robot’s kinematics (and possibly dynamics) can be easily
performed by setting a local goal along the computed path, and is out of
the scope of this work.
Kinematics and dynamics of the robot are not considered at this stage of
planning, as the impact of inpainting can be evaluated without this level of
detail. However, kinematics and dynamics can be easily taken into account
locally by iteratively defining a close cell along the planned path as a local
goal, and by using, for example, a sampling-based planner such as RRT [52]
or RRT* [43] to compute a feasible path to such local goal or its immediate
74
surroundings.
At every step the robot can transition to any of the six neighboring pixels
that are not an obstacle. Kinematics and dynamics are out of the scope of
this paper not considered for the planner in the experimental section.
4.3.3 Planning with Lidar Data
In the proposed model, a dataset is required which includes pairs of data
samples: an initial observation (Figure 4.2 (left)), and a filled in ground truth
semantic map (Figure 4.2 (right)). The ground truth map is used for network
training. Our framework (Figure 4.1) can be used with any lidar sensor that
creates a semantically segmented point cloud [68, 3]. The key challenge is
labeling ground truth filled in semantic maps for network training. Given
adequate training data, the framework can easily be applied to any envi-
ronment with unobserved space with any grid-based path planner.
While there are several datasets with semantically labeled images or
point clouds [31, 19], they do not provide the density from a BEV to gen-
erate ground truth semantic maps (Figure 4.2). Datasets which do provide
BEV labels [9] generally do not provide ground truth semantic measure-
ments as shown in figure 4.2.
75
Lidar Urban Dataset Generation
The recent SemanticKITTI dataset [3], provides the tools to generate a
dataset with pairs of data samples such as the one in Figure 4.2. More
specifically, for each of the 22 sequences in the KITTI odometry dataset [31],
SemanticKITTI conglomerates all the lidar scans together into a single point
cloud and manually labels 28 semantic classes. The point cloud is then vox-
elized and used for voxel-based semantic scene completion tasks. After ev-
ery five time steps in each sequence, a ground truth semantic scene comple-
tion voxel map is provided within a predetermined volume of 256×256×32
voxels (with voxel resolution of 0.2m) in front of the car. The SemanticKITTI
dataset also provides individual lidar scans within the same volume for
each time step in each sequence along with the manually labeled semantics.
The semantic lidar scans and completed voxel maps are shown in Figure
4.2 with voxels converted to 3D points. These images are rendered from
a BEV for the path planner applications. Semantic classes for moving ob-
jects and unidentifiable pixels are removed from the renderings as a static
environment is assumed.
The goal of inpainting is then to fill the empty spaces in Figure 4.2 (left)
with the semantics in Figure 4.2 (right); these unknown spaces are a func-
tion of occlusions or sensor resolution. To be compatible with semantic la-
bels generated from a color camera image, only points which would be seen
from a camera frame are kept. In order to do this with the SemanticKITTI
dataset, all 3D points in the scan and completed map are projected into the
76
Figure 4.2: Left: The semantic input point cloud of an initial lidar scan.
Right: The ground truth completed semantic map.
camera frame:
x = Tr ∗ P ∗ X (4.1)
where x is the coordinates of a point in the image frame, Tr is the trans-
formation from lidar coordinates to the rectified camera coordinates, P is
the projection matrix after image rectification, and X represents the coordi-
nates of a point in the lidar frame. All transformation matrices are provided
by the KITTI dataset [31]. Only points which are projected into the image
frame are kept. The points are then reprojected into lidar coordinates.
An example of a time step from the generated dataset is shown in Fig-
ure 4.3. Because the GAN-based inpainting network uses all white pixels as
pixels to be inpainted during training, when rendering the semantic point
clouds from a BEV, all unknown pixels are rendered as grey. The adversarial
network trains itself by blanking out sampled rectangles from the training
77
Figure 4.3: An example from the generated dataset of point clouds ren-
dered from a BEV. Left: Input semantically labeled point cloud
with occluded and unknown spaces in white. Middle: Mask
of image to be inpainted. Known pixels are black and pixels
to be filled are white. Note that grey (unobserved) pixels in
the left image are labeled as known in the image mask as they
are not the targets for inpainting. Right: Ground truth labeled
semantic point cloud.
image by setting the color in those rectangles as white. Therefore, in the
original training image there should not be an excessive amount of white
pixels as that would interfere with the network training. Figure 4.3 (left)
shows the initial input image to be filled via inpainting. White pixels are
required to be filled by the network. A convex hull is fit to the initial lidar
scan and unknown pixels inside that convex hull are colored white. Only
unknown pixels within the convex hull are designated for inpainting be-
cause the network is designed to inpaint unknown pixels using contextual
information. In general, the network empirically performs poorly when a
majority of pixels are designated for inpainting. Figure 4.3 (middle) shows
the input mask to the network, where black pixels are known and white
pixels are unknown. Figure 4.3 (right) is the ground truth filled in semantic
map which is used for the actual training of the network.
78
Figure 4.4: A distribution of the semantic pixel labels of the training
dataset of the first ten sequences. Each semantic class is shown
in the same color as displayed in the dataset. Note that the
y-axis is log scale.
SemanticKITTI provides the first 11 of the 22 labeled KITTI odometry se-
quences publicly. In this work, the first ten sequences are used for training,
and the eleventh sequence is used for testing. The fully processed dataset
(all 11 sequences) of images shown in Figure 4.3 consists of 4,649 total pairs
of training and input images. There are 241 images in the eleventh sequence
which are used for testing. Two semantic classes of bus and other-vehicle
did not have any corresponding labeled pixels. Figure 4.4 shows the dis-
tribution of semantic pixel labels in the training dataset of the first ten se-
quences.
4.3.4 Planning with Stereo Camera Data
The proposed novel framework can also use stereo camera data as input.
Given the SemanticKITTI dataset, lidar depth and ground truth semantic
labels are used for experimental purposes. However, the framework has
the flexibility to also use stereo depth and semantic segmentations from
79
any network. While the SemanticKITTI dataset contains ground truth for
lidar depth which is used for the quantitative evaluation, a qualitative eval-
uation of semantic map inpainting with only a stereo camera as a sensor is
performed to show the potential of the framework to be used with stereo
depth.
The pipeline for using stereo depth is the same as for using lidar depth
(Figure 4.1). Stereo results are expected to be worse than lidar due to
quadratically increasing error in disparity with respect to depth. However,
stereo cameras are frequently more convenient than lidar so it is important
to have the flexibility to use stereo cameras in the framework.
For evaluation with stereo depth, learned stereo disparity from PSM-
NET [12] is used due to its high performance on KITTI benchmarks. How-
ever, other stereo depth algorithms could be used without loss of generality.
To perform semantic segmentation, a DeepLabv3+ network [14] pretrained
on Cityscapes [19] is used. While the Cityscapes dataset is not necessarily
the same as the KITTI dataset, there is a large overlap of classes and the Se-
manticKITTI dataset uses a pretrained network from Cityscapes for image
segmentation.
80
4.4 Experimental Evaluation
Both qualitative and quantitative experimental evaluations are conducted
for lidar depth input. Only qualitative evaluations are conducted for stereo
depth input as there is no ground truth data for stereo depth. The out-
put of the inpainting network is qualitatively evaluated based on its visual
comparisons to the environment being reconstructed. Inpainting on input
images generated from lidar scans and stereo cameras are evaluated. Quan-
titative evaluation is conducted on the inpainting accuracy compared with
ground truth using the mean Intersection-over-Union (mIoU) [3]. The abil-
ity of the inpainting network to generate better informed maps for path
planning is quantitatively evaluated by comparing the (1) number of plan-
ner steps taken, (2) the number of times the planner replans, and (3) the
number of times the planner replans before discovering no path is possible.
4.4.1 Network Training
The GAN-based network is trained using output (filled in semantic
maps) ground truth images generated of the first ten sequences from Se-
manticKITTI dataset. There are a total of 4,408 images in the training dataset
(700×580 pixels each). The network is trained on a GeForce GTX 1080 Ti
GPU with 11 GB of memory. The network was trained for two and a half
days.
81
Figure 4.5: Qualitative evaluation of the inpainting network performance.
Odd rows are the RGB semantic maps and even rows are
the obstacle maps generated from the semantic BEV images.
Columns from left to right: (1) lidar input, (2) inpainted image,
(3) stereo input, (4) inpainted image, (5) ground truth (GT). For
the obstacle maps, black pixels are obstacles and white pixels
are free space.
82
4.4.2 Qualitative Evaluation
Figure 4.5 shows qualitative results for three images of different scenes (45,
150, 285) from the test set. Results are from the input BEV images from lidar
depth described in section 4.3.3 (Figure 4.5 columns 1, 2) and also input BEV
images from stereo depth described in section 4.3.4 (Figure 4.5 columns 3,
4). Figure 4.4 shows the color key for semantics in Figure 4.5. Note that
the input images are based only on a single sensor measurement, while the
ground truth is based on all measurements for each sequence. This is clear
for image 45 in Figure 4.5, where there is much less vegetation (dark green
pixels) in the top left of the image compared to the ground truth, because
the ground truth takes all past and future measurements into account.
For input images from lidar depth, the network reasonably fills in road
pixels (pink). For image 150, the network is able to fill in the sidewalk (pur-
ple) enclosed by vegetation (green) and fence (orange) in the center of the
image next to the road. For image 285, the network does not accurately
predict the sidewalk perpendicular to the road in the middle of the im-
age. However, it does a good job of filling in the bottom half of the image
with vegetation and buildings (yellow). Due to the small amount of avail-
able data for training, the network does not learn all the possible structures
based on the input images but reasonably fills in empty spaces occupied by
road, buildings, and vegetation.
For input images from stereo depth with semantic labels from the
83
deeplabv3+ network, the images are generated from a different distribution
than the trained inpainting network. Thus, it is expected that the network
will perform worse than on the lidar derived images. In image 45 in Figure
4.5, the network is unable to predict the road structure well and fills most of
the unobserved space as terrain. In images 150 and 285, the road structure is
cutoff early, but the network is able to predict the building structures in the
bottom center of the maps. Because the stereo depth inputs are from a dif-
ferent distribution than the lidar depth inputs, an exact comparison cannot
be made between the two as the network is trained on only ground truth
lidar data. However, the potential of the inpainting network to be applied
to a single stereo camera sensor is demonstrated.
4.4.3 Quantitative Evaluation
Quantitative evaluation is conducted on (1) network inpainting perfor-
mance compared to ground truth and (2) path planner performance using
inpainted images. While planner performance improvement using inpaint-
ing is the goal of this paper, inpainting network performance is provided as
a reference to the reader.
For evaluation of the inpainting performance of the network, the mean
Intersection-over-Union (mIoU) is used:
1 ∑C TPc (4.2)
C TP
c=1 c
+ FPc + FNc
84
where TPc is the number of true positives for each class c, FPc is the
number of false positives, FNc is the number of false negatives, and C is the
number of classes. Only pixels within the inpainting mask or convex hull
are used to calculate the mIoU.
The mIoU for the testing dataset is 13.10. As a comparison, the top per-
forming baseline for scene completion from SemanticKITTI had a mIoU of
17.70. Note that the SemanticKITTI scene completion task is 3D seman-
tic voxel completion with a dataset of 19,130 input and target pairs used for
training. The mIoU measured in this paper is for image inpainting with a to-
tal of 4,408 input and target pairs available for training. While the mIoU for
inpainting is relatively low, it should greatly increase given a larger dataset.
To evaluate the performance of the framework in creating better maps
for path planning, experiments are conducted with an A* planner. A custom
simulator is used, where the robot is assumed to have a 360◦ sensor within a
range of 30 pixels that can be used to update the initial lidar-based map, and
replans whenever the path is discovered to be crossing through an obstacle
(based on the ground truth map).
The obstacle maps shown in Figure 4.5 are used for planning. These
images were chosen to represent three levels of difficulty: easy (image 45),
medium (image 150), and hard (image 285). Difficulty level is decided based
on number of classes and structure of the scene. Image 285 is considered
more difficult than image 150 because image 285 has cars (light blue pixels),
more buildings (yellow pixels), and contains an intersection structure. For
85
the planner, each pixel of the image is considered a node and the euclidean
distance from the path node to the goal is used as the distance heuristic.
Images are resized to 25% of their original size (175×145 pixels) to reduce
computation time. The obstacle maps are evaluated for (1) the final number
of path steps taken from start to goal, (2) the number of times the planner
replans, and (3) the number of times the planner replans before determining
no final path is found.
Table 4.1 displays results for the quantitative path planner evaluation.
Trials were run and compared for the single lidar scan input image, the in-
painted image, and the ground truth (GT). For each of the three scenes (45,
150, 285), trials were conducted using random goal locations, and start lo-
cations either random or fixed near the sensor with a Gaussian distribution.
For each study, 50 trials were conducted, with the mean (std) given in Ta-
ble 4.1. The inpainted image outperforms the single lidar scan for every
experiment except for the easy image (45) with starting locations fixed near
the sensor. This is due to the simplicity of the scene, where the inpainting
image is not necessary to fill in any occluded spaces. The inpainted image
outperforms the baseline in path steps taken, number of times replanned,
and number of replans before discovering no path is available, for all other
experiments. Inpainting allows the planner to ”see” around corners and
behind obstacles to predict what types of semantic labels are likely to be
present in those unobserved spaces. This allows the planner to plan more
informed paths which are less likely to run into dead ends. This is evident
as not only do the number of path steps decrease when using the inpainted
86
Table 4.1: Evaluation of inpainting for planning with the baseline and ground
truth (GT). Metrics are given as mean (std). Each study has 50 trials
with goal positions chosen randomly. Smaller numbers are better.
Image 45 (easy), Fixed Start Input (baseline) Inpainted (ours) GT
Path Steps 70.8 (54.6) 75.3 (67.3) 64.9 (35.4)
Replans 3.5 (8.1) 3.2 (8.6) 0 (0)
Replans (no path found) 53.0 (1.0) 43.5 (1.5) 0 (0)
Image 45 (easy), Random Start Input Inpaint GT
Path Steps 118.6 (84.0) 113.5 (80.4) 91.1 (49.7)
Replans 11.04 (12.2) 9.7 (10.9) 0 (0)
Replans (no path found) 48.0 (0.0) 42.0 (0.0) 0 (0)
Image 150 (medium), Fixed Start Input Inpaint GT
Path Steps 173.9 (171.3) 105.6 (94.1) 86.4 (45.5)
Replans 10.4 (11.7) 6.1 (6.3) 0 (0)
Replans (no path found) 5.5 (1.6) 4.0 (0.9) 0 (0)
Image 150 (medium), Random Start Input Inpaint GT
Path Steps 125.8 (134.8) 121.7 (128.7) 95.5 (79.1)
Replans 10.1 (15.6) 7.7 (10.9) 0 (0)
Replans (no path found) 24.3 (18.8) 19.6 (13.5) 0 (0)
Image 285 (hard), Fixed Start Input Inpaint GT
Path Steps 149.5 (158.9) 95.3 (56.3) 91.6 (52.0)
Replans 9.2 (12.5) 5.2 (6.2) 0 (0)
Replans (no path found) 42.6 (8.0) 40.7 (10.6) 0 (0)
Image 285 (hard), Random Start Input Inpaint GT
Path Steps 169.3 (123.0) 140.6 (97.7) 124.4 (68.4)
Replans 15.1 (14.7) 7.5 (7.0) 0 (0)
Replans (no path found) 46.1 (21.2) 32.4 (12.1) 0 (0)
87
map, so do the number of times the planner needs to replan. By using previ-
ous experience (training data) to predict unknown spaces for path planning,
the proposed framework imitates human behavior and greatly improves
performance.
The ability to fill in occluded sections of vegetation or walls and build-
ings is incredibly helpful for path planning, as shown in Figure 4.6. There, a
trial on image 150 (medium difficulty) with start locations near the sensor is
shown. For the lidar input obstacle map, there is initially an empty space in
the middle of the map shown by the light blue circle. The planner attempts
to plan a path through the gap but eventually observes there is a wall; the
robot replans by continuing to move right and runs into a dead end. Even-
tually, the robot must plan all the way back around causing it to take many
(399) additional steps. The inpainted obstacle map (Figure 4.6) fills in that
gap and the robot eventually takes a path (129 steps) much closer to the
ground truth shortest path.
4.5 Takeaways
In this paper, a novel framework is presented to model unobserved and
occluded regions for planning in outdoor urban environments. The plan-
ner uses GAN-based inpainting to generate better informed obstacle maps
for planning compared to an initial sensor detection. Experiments validate
the ability of the novel framework to reduce the number of steps taken,
88
Figure 4.6: A comparison of a path planner trial on the lidar input obstacle
map, the inpainted obstacle map, and the GT map shown in
figure 4.5 (image 150). The path is in blue pixels. The start is in
the middle left and the goal is in the top right. Red pixels are
obstacles the robot observes online. Green pixels show where
the path overlaps itself. The light blue circles show a gap in the
lidar input map which is filled in on the inpainted image.
the number of times replanned, and the number of times replanned be-
fore discovering no possible path. The impact of the inpainting is higher
as the scene becomes more challenging with more occlusions, as it allows
the planner to predict obstacles and dead ends in unobserved spaces and
avoid them.
89
CHAPTER 5
SEMANTICALLY ENABLED LONG HORIZON PLANNING
5.1 Motivation
Long horizon planning in outdoor environments is a difficult challenge due
to limitations in sensor equipment. Lidar-based sensors are able to collect
accurate depth measurements but are unable to collect dense measurements
at far distances due to sensor resolution. Image based detectors are able to
provide semantic information at far distances but are unable to give accu-
rate depth at those ranges. In order to address the shortcomings of both
sensor paradigms, this chapter presents a semantically aware long-horizon
planner which uses receding horizon on a semantic long distance map gen-
erated using short distance metrical information and long distance ordinal
information. The proposed planner is novel due to planning long distance
paths in the long distance map, which combines both short distance metric
(metric range) and long distance semantic information (ordinal range with
rough depth information). The planner executes the segment of the path
which passes through the metric range and replans after taking a new mea-
surement.
Fig. 5.1 shows a motivating scenario for this work. The image is taken
from an outdoor scene with a ZED stereo camera. The purple represents a
region close to the robot where metric depth information is available (for
90
example, from a lidar sensor or stereo camera) in order to plan short dis-
tance paths such as over traditional occupancy grids. The red arrows point
to potential obstacles further off in the distance which the robot must plan
around. However, these obstacles may be too far away for a sensor to detect.
The orange region in the image represents the portion of the environment
where there is only semantic info (e.g. from a camera) but no range info.
In order to use information about obstacles in the scene, the robot must
be able to detect instance segmentations of the obstacles. For outdoor en-
vironments this includes trees and buildings which are not represented in
current state of the art datasets. This work also contributes an instance seg-
mentation dataset for tree trunks in outdoor environments which is used to
train a detector for real world experiments.
The main contributions of this work are:
• A semantically aware long horizon map for planning which uses
metrical information in close ranges and semantic information in far
ranges.
• A novel planning paradigm which plans a path in the combined metri-
cal map (short distance) and semantic map (long distance), and which
executes on the planned path in the metric range before replanning.
• An instance segmentation dataset for tree trunks to outdoor planning
in forested environments.
91
Figure 5.1: Motivating scenario taken at the Mann Library at Cornell Uni-
versity. The purple region represents where the robot is able
to get accurate depth information and perform planning using
traditional metrical planning methods. The red arrows repre-
sent far away obstacles which the robot must plan around. The
orange represents the region where the robot is able to capture
semantic but not depth information in the scene.
5.2 Related Works
92
Metric Based Planners
There has been substantial work done in the robotics community for met-
ric based planners which rely on accurate metric information from sensors
such as lidar [87], [2]. However these methods generally rely on the high
accuracy of lidar, which are unable to account for semantic information of
obstacles. Stereo cameras are able to capture both depth and semantics but
can only plan accurate paths in short distances.
Image Based Planners
Ref. [65] presents a planner which directly plans on the image space of a
detector. By planning directly on the image space, [65] is able to plan over
longer distances compared to metric based planners and are able to avoid
obstacles as long as they are visible in the image. However, the approach
uses a simple color based obstacle detector which maps pixels to obstacles
based on their color. A more sophisticated approach is desired, especially
with the advent of modern computer vision technology.
Ref. [72] proposes a semantically aware planner which uses semantic
segmentation information in order enable an unmanned aerial vehicle to
track a road. This work demonstrates the ability to bypass the use of a heavy
lidar sensor and is able to plan at long ranges using semantic information,
and outperforms traditional metric based planners. However, this approach
seeks only to track a visible road and does not plan in occluded spaces or in
93
off-road environments.
Ref. [78] proposes a sparse map representation which can be derived
from monocular vision. This technique estimates the high level structure of
the environment and is able to reason about the structure between bound-
aries in the environment. However, the approach does not plan in more
realistic outdoor environments.
Receding Horizon Methods
The receding horizon paradigm [30], has found a lot of success in the con-
trols and planning community due to its ability to create smooth trajectories
by accounting for future actions. Receding horizon methods plan actions
over multiple time steps, but only execute the first step in the sequence of
planned actions. At the next time step, the planner plans over multiple time
steps again, before again only executing the first step. This allows the plan-
ner to create smooth trajectories because long term information is encoded
into the action of the first time step each iteration.
Instance Segmentation
There has been a plethora of works in the computer vision community for
instance segmentation [36], [92]. These works have performed very well at
instance segmentation tasks. However, the datasets available in instance
94
segmentation for outdoor scenes [31], [19], generally only include fore-
ground objects such as cars, pedestrians, cyclists and do not include back-
ground obstacles such as buildings and trees which are relevant to long
horizon path planning.
5.3 Approach
First, we demonstrate a basic image space based planner which expands
upon [65]. We show the limitations of the basic image spaced based planner
map representation. Then, we introduce a semantically aware long horizon
map representation, which takes advantage of semantic instance informa-
tion with rough long range depth estimates.
5.3.1 Image Based Planning
Expanding on [65]. A semantically aware map representation of the envi-
ronment is extracted by projecting the image pixels into a 3D point cloud.
Ref. [14] is used to extract the semantic segmentation information from the
image and then depth from [56] is used to project the semantic pixels into
a point cloud. Note that these semantic labels do not include instance in-
formation. This map representation expands on the work done in [65] by
using a deep learning approach to extract semantic information for obstacle
detection, improving on the previous color based classification approach,
95
while also allowing for planning in occluded spaces. Fig. 5.2 shows the key
steps for extracting the 3D point cloud.
A Rapidly-exploring Random Tree (RRT) planner is then used to plan
a path over the ground plane points of the point cloud. The ground plane
points are labeled based on the semantic segmentation labels. Fig. 5.3 shows
the RRT planner. Fig. 5.3(a) show the nodes (spheres) and edges (lines) of
the RRT planner. To plan in occluded spaces on the point cloud, a ground
plane is fit to the point cloud using Principle Component Analysis (PCA)
and RRT nodes are projected onto the ground plane in regions where obsta-
cles where not observed.
While this expansion on [65] shows some promise for a image based
planner with semantic information, it is unable to reason about the different
instances of vegetation in the image, as all vegetation is segmented into the
same class. All the trees are labeled as vegetation and the planner is unable
to separate the cloud of vegetation points into multiple tree instances. In-
stance segmentations are necessary to enable more intelligent decision mak-
ing.
5.3.2 Instance Segmentation
In order to perform image segmentation and detect relevant background
images, a dataset is manually labeled using the Makesense AI tool [77]. The
Makesense tool allows the user to manually annotate polygons which rep-
96
(a)
(b)
(c)
Figure 5.2: Example of generating a point cloud for an image based plan-
ner. (a) Raw image of the environment. (b) A semantically seg-
mented image using [14]. (c) A 3D point cloud projected using
depth classification from [56].
97
(a)
(b)
Figure 5.3: An example of an RRT image space based planner. (a) Sample
space with plan hypotheses. The start is in the bottom center
and the goal is in the top right. The start and goal are both
annotated as red spheres. RRT nodes are shown as blue spheres
and edges shown as lines. Blue lines are visible and red lines
are occluded. (b) The planner back-projected into the image
space.
resent instance segmentations of objects in the scene. A total of 50 images of
a local park in Ithaca NY, are manually annotated to label tree trunks. Trees
are obstacles for planning in this scene
Fig. 5.4 shows an example image taken from the manually labeled in-
stance segmentation dataset with tree trunk labels. The white polygons
98
represent individual instance labels for trees.
Figure 5.4: A sample image from the manually labeled dataset for instance
segmentations. The white polygons represent individual in-
stances of trees.
A pre-trained model of Mask R-CNN [36] is fine-tuned with the labeled
dataset in order to detect instance segmentations of tree trunks which are
obstacles in the environment.
5.3.3 Long Distance Map
A long distance map is built using instance segmentations from [36] and
depth information extracted from the ZED2 stereolabs camera SDK. Each
image contains a set of n obstacle instances O = {o1, o2, ...on}. Each instance
99
is classified as a metric obstacle or ordinal obstacle based on their distance
from the camera. So there are two sets of obstacles: Om (metric) and Oo
(ordinal) where Om ∪ Oo = O, and Om ∩ Oo = ∅.
A predefined metric range threshold tmetric is used to classify obstacles as
metric or ordinal; obstacle instance masks with greater than 50% of pixels
inside tmetric are placed into Om, and all other obstacles instances placed into
Oo.
Figure 5.5 shows tree trunk detections classified into Om (blue) and Oo
(purple).
Figure 5.5: Mask R-CNN detections and distance classifications. Tree
trunk instances in Om are shown in blue and instances in Oo
are shown in purple.
To generate the map for planning, the field of view (FOV) is split into the
100
metric FOV and the ordinal FOV using the same tmetric. The metric FOV is
the region within tmetric and the ordinal FOV represents the region outside of
tmetric. Instance distances from the sensor for Om are calculated based on the
average distance of the instance mask pixels within tmetric. This is because
depths of edges of objects are extruded backwards and have poor depth
information. For tree trunks, a reasonable heuristic is to use the depths of
the instance mask that are inside the predefined tmetric, as the size of a tree
trunk is relatively small compared. Instance distance from the sensor for Oo
are calculated using the average depth of the instance pixels. For ordinal
instances, a rough estimate of the depth suffices for the map so taking the
average over all of the instance pixels is reasonable.
Obstacles in Om are directly projected into the metric FOV. Obstacles in
Oo are sorted by their rough stereo depth estimate and the projected into the
ordinal FOV with the following steps:
1. The instances in Oo are sorted by distance from the sensor.
2. The instances are projected into the ordinal FOV based on their sorted
order but rather than using the rough depth estimate, their depth are
separated by a predefined distance dord which is added onto tmetric. For
example, if tmetric = 5m, dord = 2.5m and there happened to be three
instances in Oo, the closest instance would be projected at 7.5m, the
middle instance would be projected at 10m, and the farthest instance
would be projected to 12.5m.
101
The long distance map includes both close metrical information and also
long distance ordinal information. By combining information from two
range spaces a representation is developed to enable planning over larger
ranges with metrical and semantic information. (Figure 5.1). Figure 5.6
shows the map generated based on the detections in Figure 5.5.
Figure 5.6: The map representation based on the detections show in Figure
5.5. The metric FOV is shown in light blue with the individual
tree trunk instances in dark blue. The ordinal FOV is shown
khaki, with individual tree trunk instances in purple. The ob-
stacles in the ordinal FOV are ordered based on their rough
depth estimates.
102
5.3.4 Receding Horizon Planner
The proposed planner is inspired by the work the community has done with
receding horizon planners. First, the field of view is project to a 2D birds eye
view (BOV) image space such as in Figure 5.6. Next, the closest free point
(image pixel) in the metric and ordinal FOVs to the global goal is found.
Then, an off-the-shelf planner such as A* is used to plan a path from the
sensor to that closest free point. However, the robot only traverses the seg-
ment of the path that is inside the metric FOV. This intuitively allows the
path planner to encode information from the ordinal FOV into the metric
FOV, while the robot only traverses the region where there is accurate met-
rical information. Once the robot has reached the end of the path segment
in the metric FOV, a new measurement is taken and the planner replans a
path. By combining both the metric and ordinal regions, the planner is able
to plan more informed short distance trajectories, based on long distance
information.
5.4 Results
The long horizon planner is validated both in simulation and in a real world
experiment.
103
5.4.1 Simulation
A simulator based on a manually annotated real world outdoor environ-
ment is used for simulation experiments of the long horizon planner. To
generate a map for the simulator, a Google Maps image (1028 x 709) of an
outdoor park in Ithaca (Cass Park), is manually annotated using Makesense
AI, with trees as obstacles. Figure 5.7 shows the simulation environment
with manually labeled instances of obstacles shown as blue polygons.
Figure 5.7: The map used for simulations with manually annotated poly-
gons for obstacles in blue.
Random start and goal locations are chosen for the simulator with the
104
only constraint being that the start and goal locations have to be at least 600
pixels apart. A map representation such as the one in Figure 5.6 is used
for the simulator with a metric range of 100 pixels and ordinal detections
of up to 250 pixels from the sensor. In other words, for the simulator, it is
assumed that it can detect accurate depth for a range of 100 pixels and it can
detect rough depth for a range of 250 pixels. For each new measurement,
the sensor is assumed to be pointing in the direction of the goal.
The baseline planner used for comparison to the semantically-aware
long horizon planner uses the metric FOV to plan. The baseline planner
finds the closest point to the goal in the metric FOV and plans a path to
that point. Then, the planner takes a new measurement from that point and
replans again until it reaches the goal.
In 1000 trials with random start and goal locations, where both the long
horizon planner and the baseline are able to find a path from the start to
the goal, the long horizon planner takes an average of 2.8% fewer steps to
reach the goal, with a standard deviation of 2.7%. Table 5.1 shows the num-
ber of trials where the baseline took a greater number of steps to reach the
goal, less number of steps, and an equal number of steps compared with the
long horizon planner. The baseline takes more steps than the long horizon
planner 97% of the cases. The improvement of the long horizon planner
on the baseline is extremely consistent as it almost always is a speed up.
This demonstrates the efficacy of encoding long distance information into
the planner, by attaching a qualitative ordinal map onto the metric range of
105
Greater Less Equal
970 28 2
Table 5.1: Number of trials out of 1000, where the baseline took a greater
number of steps to reach the goal, less steps, and an equal num-
ber of steps compared with the long horizon planner.
the sensor, at reducing the number of steps taken.
5.4.2 Real World Experiment
Real world experiments were completed in Cass Park in a region shown on
Figure 5.8. A ZED2 stereo camera is used as a sensor to collect raw images
and depth detections. Measurements are always taken with the sensor fac-
ing the goal point. The fine-tuned Mask R-CNN model is used to detect
instances of obstacles online. The metric range for the metric FOV (both
sensor and planner) is 5 meters.
Figures 5.9 and 5.10 show a comparison between the baseline and long
horizon planners. In Figure 5.9, the baseline planner detects the mask of
the tree trunk from far away but is unable to generate a plan that accounts
for the obstacle because the tree trunk is not inside the metric range 5.9 (1b,
2b, 3b). It is not until the sensor is very close to the tree trunk in Figure
5.9 (4a) that the tree is accounted for the planner (see Figure 5.9 (4b)). This
causes the planner to have to plan around the obstacle in close range, which
is inefficient in close quarters.
106
Figure 5.8: The map of Cass Park in Ithaca used for real world experi-
ments. The start and goal points are annotated on in light blue.
In Figure 5.10, the long horizon planner is able to detect the tree trunk
from far away (1b, 2b) and plans a path which avoids the tree trunk early
on, so that by the time the sensor reaches the tree trunk it is able to com-
pletely avoid the obstacle (3b). By accounting for the obstacle which was
outside of the metric planner range, and by using the proposed receding
horizon inspired planner, the planner is able to avoid the cost of avoiding
the obstacle at close range, such as in Figure 5.9. The baseline planner to-
taled 238 steps and the long horizon planner totaled 230 steps, representing
a 3.5% decrease in number of steps taken.
107
(1a)
(1b)
(2a)
108
(2b)
(3a)
(3b)
109
(4a)
(4b)
Figure 5.9: Real world experiment with the baseline planner. (1a-b) show
the first measurement and planned path in the sensor FOV, (2a-
b) show the second and so on. Images labeled (a) show the raw
image detection with the detected instance mask projected on
to them. Purple masks indicate the instance is outside the met-
ric range, and blue masks indicate the instance is inside the
metric range. Images labeled (b) show the birds eye view FOV
given the sensor location and image detection. The metric FOV
is in light blue and obstacles in the metric range are blue. The
global start location is the green circle and the global goal loca-
tion is the white circle. The planned path segment in the metric
FOV is shown in white.
110
(1a)
(1b)
(2a)
111
(2b)
(3a)
(3b)
112
Figure 5.10: Real world experiment with the long horizon planner. (1a-b)
show the first measurement and planned path, (2a-b) show
the second and so on. Images labeled (a) show the raw im-
age detection with the detected instance mask projected on to
them. Purple masks indicate the instance is outside the metric
range, and blue masks indicate the instance is inside the met-
ric range. Images labeled (b) show the birds eye view FOV
given the sensor location and image detection. The metric
FOV is in light blue and the ordinal FOV is in khaki. Obstacles
in the metric range are blue and obstacles outside of the metric
range are purple. The global start location is the green circle
and the global goal location is the white circle. The planned
path segment in the metric FOV is shown in white and the
path segment in the ordinal FOV is shown in orange.
5.5 Takeaways
This chapter presents a semantically-aware long horizon planner enabled
by a map representation of the environment which allows for long distance
planning in outdoor environments. Long distance ordinal information is
stitched onto a short distance metrical map for a receding horizon path
planner which is able to encode information of far away obstacles. The pro-
posed long range planner is able to bypass the limitations of depth sensors,
and utilize the rich semantic image information for long distance planning.
Experimental results in simulation and a real world scenario show that the
long horizon planner is able to reduce path length compared to a traditional
metric based path planner.
113
CHAPTER 6
SUMMARY AND CONCLUSIONS
Some of the key challenges in robotics moving forward are in making
intelligent decisions in complex unstructured environments which cannot
be cleanly discretized. My work focuses on how to use perception systems
to enable robots to interact and plan in unstructured environments.
Chapter 2 presents a model for pedestrian motion, which combines low
level velocity flow fields and high level decision points. The model is capa-
ble of predicting a distribution over future motion and performing online
anomaly detection.
Chapter 3 presents a framework for planning in uncertain outdoor envi-
ronments with different semantic surface types. A NBV planner iteratively
decreases uncertainty in a multiple hypothesis path planner until a safe path
is found or all paths are eliminated as unsafe.
Chapter 4 presents a framework to use a data-driven model to navigate
in occlusion heavy environments. The framework is intuitive and is ca-
pable of inferring semantic information of occluded regions in addition to
geometric information.
Chapter 5 develops and demonstrates a semantically aware long hori-
zon planner which allows for encoding of long distance detections in image
space into a short distance metric range based planner.
114
The main contributions of my research are:
• A novel pedestrian motion model which is capable of modeling low
level velocities and high level discrete transitions.
• A framework for utilizing the uncertainties from a segmentation net-
work for safe path planning in unstructured outdoor environments.
• A novel data-driven approach to planning in occluded environments
which outperforms traditional planners.
• A long horizon planner which bridges the gap between long distance
semantic detections in image space with a short distance metrical
planner.
Key areas of future research should focus on innovative approaches
to using techniques from the computer vision community to expand the
boundaries of robot capabilities in uncertain, unstructured, and outdoor
environments. These are challenging environments to operate in, and the
fusion of semantic information (from advances in computer vision) with
traditional metric robot planning will be the direction of my work. In par-
ticular, combining the semantically aware long horizon planner presented
in Chapter 5, with the inpainting based planning approach in Chapter 4,
is a potentially innovative approach to intelligent planning over long dis-
tances with occluded spaces. In addition, leveraging the uncertainties for
robust navigation with a method similar to the one from Chapter 3 would
make the planner robust from complex environmental conditions such as
weather or lighting.
115
BIBLIOGRAPHY
[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Ro-
bicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory
prediction in crowded spaces. In Proceedings of the Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2016.
[2] B.L.E.A. Balasuriya, B.A.H. Chathuranga, B.H.M.D. Jayasundara,
N.R.A.C. Napagoda, S.P. Kumarawadu, D.P. Chandima, and A.G.B.P.
Jayasekara. Outdoor robot navigation using gmapping based slam al-
gorithm. In 2016 Moratuwa Engineering Research Conference (MERCon),
pages 403–408, 2016.
[3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss,
and J. Gall. SemanticKITTI: A Dataset for Semantic Scene Understand-
ing of LiDAR Sequences. In Proc. of the IEEE/CVF International Conf. on
Computer Vision (ICCV), 2019.
[4] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material
recognition in the wild with the materials in context database. In Proc.
CVPR, pages 3479–3487, 2015.
[5] Marcelo Bertalmı́o, Guillermo Sapiro, Vicent Caselles, and C. Ballester.
Image inpainting. pages 417–424, 01 2000.
[6] C. Bishop. Pattern Recognition and Machine Learning. Springer Sci-
ence+Business Media, 2006.
[7] R. Border, J. D. Gammell, and P. Newman. Surface edge explorer (SEE):
Planning next best views directly from 3d observations. In Proc. ICRA,
pages 6116–6123, 2018.
[8] J. Brandt. At the intersection of business and technology. PinHawk-
Blog, 2015.
[9] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin
Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Os-
car Beijbom. nuscenes: A multimodal dataset for autonomous driving.
arXiv preprint arXiv:1903.11027, 2019.
116
[10] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing
and stuff classes in context. In Proc. CVPR, pages 1209–1218, 2018.
[11] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan,
Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song,
Hao Su, et al. Shapenet: An information-rich 3d model repository.
arXiv preprint arXiv:1512.03012, 2015.
[12] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching net-
work. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 5410–5418, 2018.
[13] R. Omar Chavez-Garcia, Jérôme Guzzi, Luca Maria Gambardella, and
Alessandro Giusti. Learning ground traversability from simulations.
IEEE RAL, 3:1695–1702, 2017.
[14] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff,
and Hartwig Adam. Encoder-decoder with atrous separable convolu-
tion for semantic image segmentation. In Proc. ECCV, pages 801–818,
2018.
[15] Y. Chen, M. Liu, S. Liu, J. Miller, and J. How. Predictive modeling of
pedestrian motion patterns with bayesian nonparametrics. In AIAA
Guidance, Navigation, and Control Conference, 2016.
[16] Yu Fan Chen, Miao Liu, and Jonathan P. How. Augmented dictio-
nary learning for motion prediction. In IEEE International Conference
on Robotics and Automation (ICRA), 2016.
[17] François Chollet. Xception: Deep learning with depthwise separable
convolutions. In Proc. CVPR, pages 1251–1258, 2017.
[18] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepi-
ane, Fabio Ganovelli, and Guido Ranzuglia. Meshlab: an open-source
mesh processing tool. In Eurographics Italian chapter conference, vol-
ume 1, pages 129–136, 2008.
[19] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,
Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and
117
Bernt Schiele. The cityscapes dataset for semantic urban scene under-
standing. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[20] Kenny Daniel, Alex Nash, Sven Koenig, and Ariel Felner. Theta*: Any-
angle path planning on grids. J Artif Intell Res, 39:533–579, 2010.
[21] Jonathan Daudelin and Mark Campbell. An adaptable, probabilistic,
next best view algorithm for reconstruction of unknown 3d objects.
IEEE RAL, 2(3):1540–1547, 2017.
[22] Jeffrey A. Delmerico, Alessandro Giusti, Elias Mueggler, Luca Maria
Gambardella, and Davide Scaramuzza. ”on-the-spot training” for ter-
rain classification in autonomous air-ground collaborative teams. In
Proc. ISER, 2016.
[23] Omar El Harrouss, Noor Almaadeed, Somaya Al-ma’adeed, and Y. Ak-
bari. Image inpainting: A review. Neural Processing Letters, 12 2019.
[24] D. Ellis, E. Sommerlade, and I. Reid. Modelling pedestrian trajectory
patterns with gaussian processes. In IEEE 12th International Conference
on Computer Vision Workshops, ICCV Workshops, 2009.
[25] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm
for discovering clusters in large spatial databases with noise. In KDD-
96 Proceedings, 1996.
[26] Dave Ferguson and Anthony Stentz. Using interpolation to improve
path planning: The field D* algorithm. J Field Robot, 23:79–101, 2006.
[27] N. Ferreira, J. Klosowski, C. Scheidegger, and C. Silva. Vector field
k-means: Clustering trajectories by fitting multiple vector fields. Com-
puter Graphics Forum, 2013.
[28] Paul Filitchkin. Visual Terrain Classification For Legged Robots. PhD the-
sis, University of California, Santa Barbara, 2011.
[29] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approxima-
118
tion: Representing model uncertainty in deep learning. In Proc. ICML,
2016.
[30] Carlos E. Garcı́a, David M. Prett, and Manfred Morari. Model predic-
tive control: Theory and practice—a survey. Automatica, 25(3):335–348,
1989.
[31] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous
Driving? The KITTI Vision Benchmark Suite. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3354–
3361, 2012.
[32] Timo Hackel, N. Savinov, L. Ladicky, J. Wegner, K. Schindler, and
M. Pollefeys. SEMANTIC3D.NET: A new large-scale point cloud clas-
sification benchmark. ISPRS Annals, 4(1):91–98, 2016.
[33] Yutao Han, Hubert Lin, Jacopo Banfi, Mark Campbell, and Kavita
Bala. DeepSemanticHPPC: Hypothesis-based planning over uncertain
semantic point clouds. In Proc. ICRA, 2020. To appear.
[34] Jason Hardy, Frank Havlak, and Mark Campbell. Multi-step prediction
of nonlinear gaussian process dynamics models with adaptive gaus-
sian mixtures. The International Journal of Robotics Research, 2015.
[35] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for
the heuristic determination of minimum cost paths. IEEE T Syst Sci
Cyb, 4(2):100–107, 1968.
[36] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask
r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV),
pages 2980–2988, 2017.
[37] Benjamin Hepp, Debadeepta Dey, Sudipta N. Sinha, Ashish Kapoor,
Neel Joshi, and Otmar Hilliges. Learn-to-score: Efficient 3d scene ex-
ploration by predicting view utility. In Proc. ECCV, pages 437–452,
2018.
[38] T. Ikeda, Y. Chigodo, D. Rea, F. Zanlungo, M. Shiomi, and T. Kanda.
119
Modeling and prediction of pedestrian behavior based on the sub-goal
concept. In Robotics: Science and Systems, 2012.
[39] Alexander Ivanov and Mark E. Campbell. Uncertainty constrained
robotic exploration: An integrated exploration planner. IEEE T Contr
Syst T, 27:146–160, 2016.
[40] H. Jeung, H. Shen, and X. Zhou. Mining trajectory patterns using
hidden markov models. In Data Warehousing and Knowledge Discovery,
2007.
[41] J. Joshua, F. Doshi-Velez, A. Huang, and N. Roy. A bayesian non-
parametric approach to modeling motion patterns. Autonomous Robots,
2011.
[42] Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, and Sergey
Levine. Self-supervised deep reinforcement learning with generalized
computation graphs for robot navigation. In Proc. ICRA, pages 5129–
5136, 2018.
[43] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for
optimal motion planning. Inter J Robot Res, 30(7):846–894, 2011.
[44] Alex Kendall and Yarin Gal. What uncertainties do we need in
bayesian deep learning for computer vision? In Proc. NIPS, pages
5574–5584, 2017.
[45] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and
Piotr Dollár. Panoptic segmentation. In Proc. CVPR, pages 9404–9413,
2019.
[46] T. Klinger, F. Rottensteiner, and C. Heipke. Probabilistic multi-person
tracking using dynamic bayes networks. ISPRS Annals of Photogram-
metry, Remote Sensing and Spatial Information Sciences, 2015.
[47] Sven Koenig and Maxim Likhachev. D* lite. In Proc. AAAI, pages 476–
483, 2002.
120
[48] Philipp Krüsi, Paul Furgale, Michael Bosse, and Roland Siegwart.
Driving on point clouds: Motion planning, trajectory optimization,
and terrain assessment in generic nonplanar environments. J Field Rob,
34(5):940–984, 2017.
[49] T. Laursen, N. B. Pedersen, J. J. Nielsen, and T. K. Madsen. Hidden
markov model based mobility learning fo improving indoor tracking
of mobile users. In 9th Workshop on Positioning, Navigation and Commu-
nication, 2012.
[50] S. LaValle and J. Kuffner. Randomized kinodynamic planning. Int J
Robot Res, 20(5):378–400, 2001.
[51] Steven M. LaValle. Rapidly-exploring random trees: A new tool for
path planning. 1998.
[52] Steven M LaValle. Rapidly-exploring random trees: A new tool for
path planning. 1998.
[53] Steven M. LaValle. Planning Algorithms. Cambridge University Press,
New York, NY, USA, 2006.
[54] J. Lee, J. Han, and K. Whang. Trajectory clustering: A partition-and-
group framework. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, 2007.
[55] Kuan-Hui Lee, J. Hwang, G. Okapal, and J. Pitton. Driving recorder
based on-road pedestrian tracking using visual slam and constrained
multiple-kernel. In 17th International IEEE Conference on Intelligent
Transportation Systems (ITSC), 2014.
[56] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view
depth prediction from internet photos. In Computer Vision and Pattern
Recognition (CVPR), 2018.
[57] C. Lu and G. Dubbelman. Semantic foreground inpainting from weak
supervision. IEEE RA-L, 5(2):1334–1341, 2020.
121
[58] Khairul Azmi Mahadhir, Shing Chiang Tan, Cheng Yee Low, Roman
Dumitrescu, Adam Tan Mohd Amin, and Ahmed Jaffar. Terrain clas-
sification for track-driven agricultural robots. In Proc. SYSINT, pages
775 – 782, 2014.
[59] R. Manduchi, A. Castano, A. Talukder, and L. Matthies. Obstacle de-
tection and terrain classification for autonomous off-road navigation.
Auton Robot, 18(1):81–102, 2005.
[60] W. Mathew, R. Raposo, and B. Martins. Predicting future locations with
hidden markov models. In Proceedings of the 2012 ACM Conference on
Ubiquitous Computing, 2012.
[61] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J
Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre,
Koray Kavukcuoglu, et al. Learning to navigate in complex environ-
ments. arXiv preprint arXiv:1611.03673, 2016.
[62] Robin R Murphy, Satoshi Tadokoro, Daniele Nardi, Adam Jacoff, Paolo
Fiorini, Howie Choset, and Aydan M Erkmen. Search and rescue
robotics. In Springer handbook of robotics. Springer, 2008.
[63] B. Nagy and A. Kelly. Trajectory generation for car-like robots using
cubic curvature polynomials. In Proc. FSR, 2001.
[64] J. J. Nielsen, N. Amiot, and T. K. Madsen. Directional hidden markov
model for indoor tracking of mobile users and realistic case study. In
European Wireless 2013; 19th European Wireless Conference, 2013.
[65] Michael W. Otte, Scott G. Richardson, Jane Mulligan, and Gregory Z.
Grudic. Path planning in image space for autonomous robot naviga-
tion in unstructured environments. J Field Robot, 26:212–240, 2009.
[66] P. Purkait, C. Zach, and I. Reid. Seeing behind things: Extending se-
mantic segmentation to occluded regions. In Proc. IROS, pages 1998–
2005, 2019.
[67] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas.
122
Frustum pointnets for 3d object detection from rgb-d data. In Proc.
CVPR, pages 918–927, 2018.
[68] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Point-
net: Deep learning on point sets for 3d classification and segmentation.
arXiv preprint arXiv:1612.00593, 2016.
[69] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas.
Pointnet: Deep learning on point sets for 3d classification and segmen-
tation. In Proc. CVPR, pages 77–85, 2017.
[70] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
net++: Deep hierarchical feature learning on point sets in a metric
space. In Proc. NIPS, pages 5099–5108, 2017.
[71] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo
Tomasi. Performance measures and a data set for multi-target, multi-
camera tracking. In European Conference on Computer Vision workshop on
Benchmarking Multi-Target Tracking, 2016.
[72] Markus Ryll, J. Ware, J. Carter, and N. Roy. Semantic trajectory plan-
ning for long-distant unmanned aerial vehicle navigation in urban en-
vironments. 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 1551–1558, 2020.
[73] Markus Ryll, John Ware, John Carter, and Nick Roy. Efficient trajec-
tory planning for high speed flight in unknown environments. In Proc.
ICRA, pages 732–738, 2019.
[74] Samuel Schulter, Menghua Zhai, Nathan Jacobs, and Manmohan Kr-
ishna Chandraker. Learning to look around objects for top-view repre-
sentations of outdoor scenes. ArXiv, abs/1803.10870, 2018.
[75] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Air-
sim: High-fidelity visual and physical simulation for autonomous ve-
hicles. In Field and Service Robotics, 2017.
[76] A. Shaw and N. Gopalan. Finding frequent trajectories by clustering
123
and sequential pattern mining. Journal of Traffic and Transportation En-
gineering, 2014.
[77] Piotr Skalski. Make Sense. https://github.com/SkalskiP/make-
sense/, 2019.
[78] Gregory J. Stein, Christopher Bradley, Victoria Preston, and N. Roy. En-
abling topological planning with monocular vision. 2020 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), pages 1667–1673,
2020.
[79] Gregory J. Stein, Christopher Bradley, and Nicholas Roy. Learning
over subgoals for efficient navigation of structured, unknown environ-
ments. In Proc. CORL, pages 213–222, 2018.
[80] Gregory J Stein, Christopher Bradley, and Nicholas Roy. Learning
over subgoals for efficient navigation of structured, unknown environ-
ments. In Proc. CORL, pages 213–222, 2018.
[81] Anthony Stentz. Optimal and efficient path planning for partially
known environments. In Proc. ICRA, 1994.
[82] Anthony Stentz. The focussed D* algorithm for real-time replanning.
In Proc. IJCAI, 1995.
[83] Hang Su, Jun Zhu, Yinpeng Dong, and Bo Zhang. Forecast the plau-
sible paths in crowd scenes. In Proceedings of the Twenty-Sixth Interna-
tional Joint Conference on Artificial Intelligence (IJCAI-17), 2017.
[84] Lei Tai, Giuseppe Paolo, and Ming Liu. Virtual-to-real deep reinforce-
ment learning: Continuous control of mobile robots for mapless navi-
gation. In Proc. IROS, pages 31–36, 2017.
[85] Y. W. Teh. Dirichlet processes. In Encyclopedia of Machine Learning.
Springer, 2010.
[86] Jesus Tordesillas, Brett T Lopez, and Jonathan P How. FASTER: Fast
and safe trajectory planner for flights in unknown environments. In
Proc. IROS, pages 1934–1940, 2019.
124
[87] Emmanouil Tsardoulias, Katerina Iliakopoulou, Andreas Kargakos,
and Loukas Petrou. A review of global path planning methods for oc-
cupancy grid maps regardless of obstacle density. Journal of Intelligent
Robotic Systems, 84, 12 2016.
[88] Dizan Vasquez, Thierry Fraichard, and Christian Laugier. Incremental
learning of statistical motion patterns with growing hidden markov
models. IEEE Transactions on Intelligent Transportation Systems, 2009.
[89] Krzysztof Walas. Terrain classification and negotiation with a walking
robot. J Intell Robot Syst, 78:401–423, 2015.
[90] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bron-
stein, and Justin M Solomon. Dynamic graph cnn for learning on point
clouds. arXiv preprint arXiv:1801.07829, 2018.
[91] Stefan Witwicki, Jose Carlos Castillo, Joao Messias, Jesus Capitan,
Francisco S Melo, Pedro U Lima, and Manuela Veloso. Autonomous
surveillance robots: A decision-making framework for networked
muiltiagent systems. IEEE Robotics & Automation Magazine, 2017.
[92] Yuxin Wu, Alexander Kirillov, Francisco Massa,
Wan-Yen Lo, and Ross Girshick. Detectron2.
https://github.com/facebookresearch/detectron2, 2019.
[93] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson,
and M. N. Do. Semantic image inpainting with deep generative mod-
els. In 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 6882–6890, 2017.
[94] J. Yen. Finding the k shortest loopless paths in a network. Manage Sci,
17(11):712–716, 1971.
[95] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S
Huang. Generative image inpainting with contextual attention. arXiv
preprint arXiv:1801.07892, 2018.
[96] Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, and Wol-
fram Burgard. Deep reinforcement learning with successor features
125
for navigation across similar environments. In Proc. IROS, pages 2371–
2378, 2017.
[97] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
Torralba. Places: A 10 million image database for scene recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
[98] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and
Antonio Torralba. Scene parsing through ade20k dataset. In Proc.
CVPR, pages 633–641, 2017.
[99] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point
cloud based 3d object detection. In Proc. CVPR, pages 4490–4499, 2018.
126