PERCEPTION ENABLED PLANNING FOR AUTONOMOUS SYSTEMS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Yutao Han August 2021 © 2021 Yutao Han ALL RIGHTS RESERVED PERCEPTION ENABLED PLANNING FOR AUTONOMOUS SYSTEMS Yutao Han, Ph.D. Cornell University 2021 As we move towards fully autonomous robotic systems, one of the key chal- lenges is the integration of state-of-the-art perception techniques in machine learning and computer vision with decision making for robots. While there have been tremendous strides in the deep learning community, there are still challenges in how to apply these technological advances to extend the boundaries of autonomous planning. In this thesis, I will discuss my research which attempts to bridge ad- vances in computer vision to practical applications in robotics. In the first part of this thesis, I will present a pedestrian motion model which combines high level decision making with low level motion patterns. Next, I will present my work on managing scene uncertainties with a focus on semantic surface types. Next-Bext-View measurements are taken to iteratively de- crease uncertainty for a multi-hypothesis planner. Then, I will present my work on planning through occluded spaces. A GAN-based image inpaint- ing network fills in unknown spaces for planning and I demonstrate the efficacy of my method compared to the traditional paradigm for managing occluded spaces. In the last part, I will discuss a planner which leverages instance segmentation and ordinal information to extend the boundaries of traditional metrical planners to long distance planning. BIOGRAPHICAL SKETCH Yutao is a researcher in robotics at Cornell University. He is interested in the applications of computer vision and machine learning to enable robots to make more intelligent decisions in a complex world. He is advised by Professor Mark Campbell who is an expert in autonomy for robotics. Yutao completed his Bachelors degree in Mechanical Engineering from the Uni- versity of Illinois at Urbana-Champaign (UIUC). iii 献给我的外公外婆。 iv ACKNOWLEDGEMENTS My research is funded by the ONR under the PERISCOPE MURI Grant N00014-17-1-2699. Thank you to my advisor Mark, and my other committee members KB and Hadas. Thank you to my many labmates and friends. Thank you to my family. Most importantly, thank you to everyone who has laughed with me, I would not have finished my PhD without you. v TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction 1 2 Pedestrian Motion Model Using Non-Parametric Trajectory Clus- tering and Discrete Transition Points 10 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 PEDESTRIAN MOTION MODEL . . . . . . . . . . . . . . . . . 15 2.4 MOTION & TRANSITION POINT MODELING . . . . . . . . 16 2.4.1 Transition Point Estimation . . . . . . . . . . . . . . . . 21 2.4.2 Iterative Clustering . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Transition Probabilities . . . . . . . . . . . . . . . . . . . 26 2.5 Online Prediction of Motion and Anomaly Detection . . . . . 30 2.5.1 Predictive Modeling . . . . . . . . . . . . . . . . . . . . 30 2.5.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 31 2.6 VALIDATION OF THE MOTION MODEL . . . . . . . . . . . 31 2.6.1 Robustness of the Iterative Clustering Algorithm to Anomalous Training Data . . . . . . . . . . . . . . . . . 32 2.6.2 Pedestrian Prediction Accuracy . . . . . . . . . . . . . . 35 2.6.3 Online Anomaly Detection . . . . . . . . . . . . . . . . 38 2.7 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 TAKEAWAYS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 DeepSemanticHPPC: Hypothesis-based Planning over Uncertain Semantic Point Clouds 41 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 RRT-Based Non-Holonomic Planning over a Point Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Segmentation with Bayesian Neural Networks . . . . 47 3.3 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . 53 vi 3.4.1 Predicting Semantic Labels . . . . . . . . . . . . . . . . 53 3.4.2 Associating Semantic Labels to a Point Cloud . . . . . 54 3.4.3 RRT-Based Multi-hypothesis Planner . . . . . . . . . . 56 3.4.4 Next-Best-View (NBV) Planning . . . . . . . . . . . . . 57 3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.1 Validation Scenes and Overview . . . . . . . . . . . . . 60 3.5.2 NBV evaluation . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.3 Path Safety Evaluation . . . . . . . . . . . . . . . . . . . 65 3.6 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 Planning Through Unknown Space by Imagining What Lies Therein 68 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.1 Image Inpainting . . . . . . . . . . . . . . . . . . . . . . 72 4.3.2 Path Planner . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.3 Planning with Lidar Data . . . . . . . . . . . . . . . . . 75 4.3.4 Planning with Stereo Camera Data . . . . . . . . . . . . 79 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Network Training . . . . . . . . . . . . . . . . . . . . . . 81 4.4.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . 83 4.4.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . 84 4.5 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5 Semantically Enabled Long Horizon Planning 90 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.1 Image Based Planning . . . . . . . . . . . . . . . . . . . 95 5.3.2 Instance Segmentation . . . . . . . . . . . . . . . . . . . 96 5.3.3 Long Distance Map . . . . . . . . . . . . . . . . . . . . . 99 5.3.4 Receding Horizon Planner . . . . . . . . . . . . . . . . . 103 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.2 Real World Experiment . . . . . . . . . . . . . . . . . . 106 5.5 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6 Summary and Conclusions 114 vii Bibliography 116 viii LIST OF TABLES 2.1 Transition Probabilities Between Transition Points . . . . . . . 29 2.2 % Error for Different Amounts of Anomalies . . . . . . . . . . 34 2.3 K-folds Cross Validation Results . . . . . . . . . . . . . . . . . 37 3.1 500 trials of path safety evaluation. The columns are the path planning methods used: B1 is the planner based on [48], B2 is the variant of our framework which does not utilize any NBVs, XN is DeepSemanticHPPC (ours) with X NBVs. The rows are the metrics: Safe is the number of trials where a safe path is selected, Unsafe is the number of trials where an unsafe path is selected (lower is better), CS is the number of trials where a safe path is confirmed, CN is the number of trials where all multipaths are confirmed as unsafe (and no path is selected). . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Evaluation of inpainting for planning with the baseline and ground truth (GT). Metrics are given as mean (std). Each study has 50 trials with goal positions chosen randomly. Smaller numbers are better. . . . . . . . . . . . . . . . . . . . . 87 5.1 Number of trials out of 1000, where the baseline took a greater number of steps to reach the goal, less steps, and an equal number of steps compared with the long horizon planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 ix LIST OF FIGURES 2.1 A motivating example for the motion model in this paper. The local motion patterns are labeled m1,m2,m3 and shown in purple, light blue, and green respectively. There are four transition points annotated with blue bubbles. There is an anomalous trajectory shown in red. Background image from [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Algorithmic flow of iterative DPGP clustering and hypoth- esis test transition point estimation (black) of the proposed pedestrian model (blue) for both low level trajectories and high level transitions. The model is then used for online in- ference (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 The  neighborhood of point {xki , yki }. The  neighborhood is in blue, m is in green, {xkj i , yki } is the red dot, and tk is in red. The percentage of trajectories in m j that pass through the  neighborhood of each point in tk is used to weigh if the po- sitional distribution of tk matches the positional distribution of trajectories in m j . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Discovery of sub-trajectories for the case of merging motion patterns. There are two motion patterns in red and blue. Ar- rows in the top left show the directionality of the motion pat- terns. Circles on the trajectories indicate where the trajectory data ends. Potential sub-trajectory points in the red pattern are indicated with red crosses. The data is from video 3 of the DukeMTMC project [71]. . . . . . . . . . . . . . . . . . . . 25 2.5 Results of the iterative DPGP trajectory clustering algorithm and transition point estimation on simulated data with 90 trajectories. The trajectories move in two directions, and the top and bottom rows correspond to each of the directions respectively. 2.5(a) is the initial clustering result before any splitting of trajectories. 2.5(b) shows the iterative clustering result after convergence. 2.5(c) shows the transition points found with 95% confidence ellipses. The blue bubbles anno- tating 2.5(c) (bottom) correspond to the states in Table I. The numbers 1 − 6 in the blue bubbles correspond to the states s1 − s6 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 27 x 2.6 Simulated dataset of 90 trajectories with 30% added anoma- lies. The original trajectories are in blue and the anomalous trajectories are in red. The circles signify the ends of trajec- tories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7 Trajectories used from the Duke MTMC dataset, video num- ber 8. The 191 trajectories used to train the offline model in k-folds validation are shown in blue with circles signifying the ends of trajectories. 40 anomalous trajectories used for online anomaly detection analysis are shown in red. . . . . . 34 2.8 An example of the online predictive model. The observed trajectory is shown in magenta. The current state, which includes both transition points and motion patterns, is in green. Possible future states are in shades of blue, scaled by probability, with darker shades corresponding to higher probability of transition, and improbable future states are shaded in gray. In 2.8(a), the trajectory has just started out so there are a wide range of possible future states. In 2.8(b), the trajectory moves within the overlap of clusters, and the probability of transitioning to some states increases with a darker shade of blue, while other states are eliminated. In 2.8(c) the trajectory has only one possible future state to tran- sition to, which is the confidence ellipse shaded in black, the color corresponding to the highest transition probability of one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.9 Online anomaly detection. The trained model is shown in gray. If the trajectory is found to be aligned with a learned motion pattern, it is highlighted in green. If the sliding win- dow is detected as an anomaly, it is highlighted in red. The segment of the observed trajectory that is no longer in the sliding window is highlighted in black. . . . . . . . . . . . . . 38 3.1 The DeepSemanticHPPC pipeline. (1) In the first stage, ini- tial inputs generate a multi-hypothesis graph of possible paths. (2) In the second stage, the uncertainty in the scene is reduced and path costs are updated. (3) The second stage is repeated for a set number of iterations. This is terminated early if a safe path is confirmed or all paths are confirmed as unsafe. (4) A path is selected. . . . . . . . . . . . . . . . . . . 42 xi 3.2 An example point cloud. (a) Image view of a portion of the environment. (b) Point cloud colored with the most likely class predicted from image (a) (bright green: “grass”; dark green: “tree”; purple: “sidewalk”; dark grey: “road”; light grey: no information available). All the classes except “tree” belong to the set S . The region around the tree is actually mulch/woodchips, which should be classified as “dirt” (be- longing to U). (c) Point cloud colored to show safe (white), unsafe (black), unclear regions R partitioned according to most likely class (random colors), and points not associated with any class (orange). . . . . . . . . . . . . . . . . . . . . . . 49 3.3 An example graph G = (V, A), with poses. Vertices with c(v) = 0 are in green, while vertices with 0 < c(v) ≤ 1 are in red (the darker, the closer to 1). The blue, green, and red axes indicate the pose of the robot. . . . . . . . . . . . . . . . 57 3.4 The prebuilt Africa environment from Airsim used for test- ing the multipath planner. . . . . . . . . . . . . . . . . . . . . 61 3.5 Left: Image from Cass Park in Ithaca. There are multiple dif- ferent terrains in the scene including grass, mud, and water. Right: Annotated safe (blue) and unsafe (red) regions for the Cass Park point cloud. . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Change in uncertainty of path vertices (y-axis) as the number of NBV measurements increase (x-axis). Left: Mann Library. Right: Cass Park. . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 from left to right. Current partial map represented as a point cloud annotated with semantic labels (e.g. purple = road); bird’s eye view of the annotated point cloud, where white corresponds to unlabeled pixels whose semantic class will be predicted via image inpainting; inpainted image; obstacle map associated with the inpainted image (white = free space, black = obstacles), shown along with obstacles discovered online (red) and final path to the goal (blue) in one of our simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Left: The semantic input point cloud of an initial lidar scan. Right: The ground truth completed semantic map. . . . . . . 77 xii 4.3 An example from the generated dataset of point clouds ren- dered from a BEV. Left: Input semantically labeled point cloud with occluded and unknown spaces in white. Middle: Mask of image to be inpainted. Known pixels are black and pixels to be filled are white. Note that grey (unobserved) pix- els in the left image are labeled as known in the image mask as they are not the targets for inpainting. Right: Ground truth labeled semantic point cloud. . . . . . . . . . . . . . . . 78 4.4 A distribution of the semantic pixel labels of the training dataset of the first ten sequences. Each semantic class is shown in the same color as displayed in the dataset. Note that the y-axis is log scale. . . . . . . . . . . . . . . . . . . . . . 79 4.5 Qualitative evaluation of the inpainting network perfor- mance. Odd rows are the RGB semantic maps and even rows are the obstacle maps generated from the semantic BEV images. Columns from left to right: (1) lidar input, (2) inpainted image, (3) stereo input, (4) inpainted image, (5) ground truth (GT). For the obstacle maps, black pixels are obstacles and white pixels are free space. . . . . . . . . . . . . 82 4.6 A comparison of a path planner trial on the lidar input obsta- cle map, the inpainted obstacle map, and the GT map shown in figure 4.5 (image 150). The path is in blue pixels. The start is in the middle left and the goal is in the top right. Red pixels are obstacles the robot observes online. Green pixels show where the path overlaps itself. The light blue circles show a gap in the lidar input map which is filled in on the inpainted image. . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1 Motivating scenario taken at the Mann Library at Cornell University. The purple region represents where the robot is able to get accurate depth information and perform plan- ning using traditional metrical planning methods. The red arrows represent far away obstacles which the robot must plan around. The orange represents the region where the robot is able to capture semantic but not depth information in the scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Example of generating a point cloud for an image based planner. (a) Raw image of the environment. (b) A seman- tically segmented image using [14]. (c) A 3D point cloud projected using depth classification from [56]. . . . . . . . . . 97 xiii 5.3 An example of an RRT image space based planner. (a) Sam- ple space with plan hypotheses. The start is in the bottom center and the goal is in the top right. The start and goal are both annotated as red spheres. RRT nodes are shown as blue spheres and edges shown as lines. Blue lines are visible and red lines are occluded. (b) The planner back-projected into the image space. . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 A sample image from the manually labeled dataset for in- stance segmentations. The white polygons represent indi- vidual instances of trees. . . . . . . . . . . . . . . . . . . . . . . 99 5.5 Mask R-CNN detections and distance classifications. Tree trunk instances in Om are shown in blue and instances in Oo are shown in purple. . . . . . . . . . . . . . . . . . . . . . . . . 100 5.6 The map representation based on the detections show in Fig- ure 5.5. The metric FOV is shown in light blue with the indi- vidual tree trunk instances in dark blue. The ordinal FOV is shown khaki, with individual tree trunk instances in purple. The obstacles in the ordinal FOV are ordered based on their rough depth estimates. . . . . . . . . . . . . . . . . . . . . . . . 102 5.7 The map used for simulations with manually annotated polygons for obstacles in blue. . . . . . . . . . . . . . . . . . . 104 5.8 The map of Cass Park in Ithaca used for real world experi- ments. The start and goal points are annotated on in light blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.9 Real world experiment with the baseline planner. (1a-b) show the first measurement and planned path in the sen- sor FOV, (2a-b) show the second and so on. Images labeled (a) show the raw image detection with the detected instance mask projected on to them. Purple masks indicate the in- stance is outside the metric range, and blue masks indicate the instance is inside the metric range. Images labeled (b) show the birds eye view FOV given the sensor location and image detection. The metric FOV is in light blue and obsta- cles in the metric range are blue. The global start location is the green circle and the global goal location is the white cir- cle. The planned path segment in the metric FOV is shown in white. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 xiv 5.10 Real world experiment with the long horizon planner. (1a-b) show the first measurement and planned path, (2a-b) show the second and so on. Images labeled (a) show the raw im- age detection with the detected instance mask projected on to them. Purple masks indicate the instance is outside the metric range, and blue masks indicate the instance is inside the metric range. Images labeled (b) show the birds eye view FOV given the sensor location and image detection. The metric FOV is in light blue and the ordinal FOV is in khaki. Obstacles in the metric range are blue and obstacles outside of the metric range are purple. The global start location is the green circle and the global goal location is the white cir- cle. The planned path segment in the metric FOV is shown in white and the path segment in the ordinal FOV is shown in orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 xv CHAPTER 1 INTRODUCTION Motivation Autonomous robots are an exciting field of research due to the benefits they can bring to the daily lives of people. From self-driving cars to search and rescue robots, autonomous robots provide numerous benefits including saved commute time, increased safety, and more time for creative tasks. Ev- ery segment of society will benefit from advancements in the field; billions of dollars are currently being invested into bringing autonomous robots to reality. While purchasing C-3PO at a local Walmart may have seemed like a pipe dream a decade ago, recent advancements in the robotics field including in areas such as actuation, controls, planning, sensors, and computer vision have made science fiction robots more and more realistic. Nevertheless, there are still many challenges to overcome before C-3PO does the dishes. Fully autonomous robots must be able to operate in com- plex environments while making intelligent decisions. A lot of research has been done in structured indoor environments, but there are still many challenges to be addressed in unstructured, uncertain, and outdoor envi- ronments. Youtube videos of a Tesla vehicle mistaking a bright moon with a traffic light is evidence of that. 1 Background One of the most important problems of autonomous robots is understand- ing the world around the robot. In the world there exist both dynamic agents (pedestrians, animals, cars, etc.) and a static scene (road, building, tree, etc.). The robot must understand both dynamic and static elements of the scene in order to operate in the world safely and reliably without caus- ing potentially fatal accidents. The most complex dynamic agent a robot will have to interact with in the world are pedestrians. Pedestrians do not follow rigid rules like cars, and are almost impossible to model using a physics based model. While there exists literature using non-parametric machine learning methods to capture low level pedestrian behaviors [24], [41], [15], these methods only capture low level velocity clusters. In these methods gaussian processes (GP) are fit to clusters of pedestrian trajectories. GPs are random processes where joint distributions over random variables f (x) are fit to a normal distribution such that: p(f|X) ∼ N(f|µ,K) (1.1) where µ is the mean and K is the kernel function representing the covari- ance. K can be manipulated to generate different types of functions f . These bayesian non-parametric models generally only fit to low level 2 velocity behavior such that x is a spatial point and f (x) is the velocity at that spatial point. It is important to model the agents in the world beyond low level velocity and clusters to gain more insight on the high level decisions and behaviours of the agents in the world. Another key aspect of understanding the world is the static world around the robot. There has been a large body work done on modeling and navigating the world in indoor environments [79, 39, 53, 73, 86], but for more complex tasks such as self driving, robustness to outdoor environ- ments is desired. Being able to leverage state of the art computer vision technologies to understand the semantics and structure of the world and the uncertainty in it is the key next step to robust autonomous systems in complex outdoor environments. Recent work on understanding uncertainties include the research from [29] which uses stochastic dropout through a neural network to infer the posterior distribution of the weights. During runtime, T stochastic passes with dropout are run through the network to infer the posterior distribution of the output: 1 ∑T Eq(y*|x*)(y*) ≈ ŷ∗(x∗,Wt ,WtT 1 L) (1.2) t=1 where q is the approximated distribution over the network output, {x∗,y∗} are the network input and output respectively, T is the number of stochastic passes through the network, and W represents the weights of L layers. This 3 is a convenient method which does not require significant network surgery and can be applied to any neural network easily. Other advances in the computer vision community include applying generative adversarial networks (GANs) to image inpainting. GANs use a generative model G and a discriminate model D to train the generator to produce samples from a given training distribution. G attempts to gener- ate samples of input data (in this case, images with missing regions) and D attempts to classify the outputs of G as real or fake. These networks have been able to generate impressive results in image inpainting tasks [95]. A significant challenge in understanding the static environment is deal- ing with occluded spaces in the world. These are spaces that are blocked off from sensor detections due to obstacles in the way. For example, a house blocks a sensor from seeing what is behind the house, where there could be many possibilities such as other houses, trees, or just empty space. Occluded spaces are traditionally treated as free [73, 86]). The traditional planning paradigm is to plan through unknown spaces as if they are unin- habited, and then replan paths if it turns out obstacles actually occupy the occluded space. Lastly, there is the problem of dealing with limitations in sensors at far ranges. Lidar can provide accurate depth at far distances but is too sparse at those ranges for any classification. Vision based approaches can pro- vide rich semantic information at far distances, but stereo depth error scales quadratically with depth. Traditional approaches plan using metric based 4 maps of the world, based on accurate depth detections from sensors [87], [2]. However, these approaches are limited by the aforementioned limita- tions of lidar and image based sensors. Key Research Questions Given the previous work done in the robotics and computer vision commu- nity, there are some key questions that need to be addressed. How can we understand higher level behaviours of pedestrians in a scene? While low level velocities from trajectories can provide insight on pedestrian behavior, a more high level understanding of pedestrian deci- sion making would be ideal in order to understand long term pedestrian motion and potential waypoints. How do we operate when the detections are uncertain because of light- ing conditions or other sources of uncertainty? Bayesian neural networks allow us to extract principled estimates of uncertainty, but how can we ap- ply that to a practical robotics problem? What should the robot do when parts of the world are occluded by ob- stacles? While it is a reasonable heuristic to assume that unknown space is free, it is intuitively not ideal and we should have a better understanding of how to generate informed paths through unknown space. What if obstacles are too far away to get accurate depth information 5 from? Is there a way to incorporate the close range metrical maps with far away semantic detections? Research Opportunities Some key research opportunities informed by the open questions are: • Developing a pedestrian model which is able to incorporate low level velocities with higher level decision making, to gain more insight over long term trajectory patterns. • Applying network detection uncertainties to improve a robots under- standing of the world; creating a more robust system for navigating in uncertain environments. • Leveraging GAN image inpainting to generate data-driven models in occluded spaces of the world for robot navigation. • Combining short distance metrical planners with long distance se- mantic information for more informed planning. Outline This thesis addresses some of the key challenges in robotics by bridging the gap between state of the art computer vision and machine learning tech- niques, and the traditional robotics community. 6 Chapter 2 presents a pedestrian motion model that includes both low level trajectory patterns, and high level discrete transitions. The inclusion of both levels creates a more general predictive model, allowing for more meaningful prediction and reasoning about pedestrian trajectories, as com- pared to the current state of the art. The model uses an iterative clustering algorithm with (1) Dirichlet Process Gaussian Processes to cluster trajecto- ries into continuous motion patterns and (2) hypothesis testing to identify discrete transitions in the data called transition points. The model iteratively splits full trajectories into sub-trajectory clusters based on transition points, where pedestrians make discrete decisions. State transition probabilities are then learned over the transition points and trajectory clusters. The model is for online prediction of motions, and detection of anomalous trajectories. The proposed model is validated on the Duke MTMC dataset to demon- strate identification of low level trajectory clusters and high level transi- tions, and the ability to predict pedestrian motion and detect anomalies on- line with high accuracy. Chapter 3 presents a framework for managing uncertainties from per- ception for planning. Planning in unstructured environments is challenging – it relies on sensing, perception, scene reconstruction, and reasoning about various uncertainties. A novel uncertainty-aware hypothesis-based plan- ner for unstructured environments is proposed. The algorithmic pipeline consists of: a deep Bayesian neural network which segments surfaces with uncertainty estimates; a flexible point cloud scene representation; a next- best-view planner which minimizes the uncertainty of scene semantics us- 7 ing sparse visual measurements; and a hypothesis-based path planner that proposes multiple kinematically feasible paths with evolving safety confi- dences given next-best-view measurements. and a hypothesis-based path planner which develops multiple kinematically feasible paths for the un- certain environment which evolve with incoming information. The pipeline iteratively decreases semantic uncertainty along planned paths, filtering out unsafe paths with high confidence. Experimental validations shows the framework plans safe paths in real-world environments where existing path planners typically fail. Chapter 4 presents a novel framework for planning paths in maps con- taining unknown spaces, such as from occlusions. The approach takes as input a semantically-annotated point cloud, and leverages an image in- painting neural network to generate a reasonable model of unknown space as free or occupied. The validation campaign shows that it is possible to greatly increase the performance of standard pathfinding algorithms which adopt the general optimistic assumption of treating unknown space as free. Chapter 5 presents a semantically aware long horizon planner which leverages long distance semantic information in image detections to inform planning with a short distance metric range based planner. By merging the long distance semantic detections with a short distance planner, the planner is able to alleviate the limitations of short distance metric based planners, which are unable to take long distance semantic information into account. Experiments in simulation and the real world demonstrate the ability of the 8 proposed planner at encoding semantic detections into a metric planner. 9 CHAPTER 2 PEDESTRIAN MOTION MODEL USING NON-PARAMETRIC TRAJECTORY CLUSTERING AND DISCRETE TRANSITION POINTS 2.1 Motivation Predictive models allow systems interested in pedestrian behavior to per- form higher level inference and reasoning about scenes involving pedes- trians. Models of how pedestrians navigate an environment are useful for many robotics applications, including: surveillance robots [91] (e.g. under- standing patterns of how people walk in a subway station could help detect anomalies, such as a person running with a bag after a theft) and search and rescue robots [62] (e.g. understanding movement patterns in a build- ing on fire or during an earthquake can give information about where peo- ple may be in the building). Insight can be obtained on pedestrian behaviors and patterns by applying machine learning techniques to data of pedestrian motion and learning predictive models much smaller in size than original datasets. Currently there are advanced and accurate systems for tracking multi- ple targets with cameras within a scene [46], [55]. However, to gain more insight into the scene and behaviors, patterns, motivations, anomalies etc., a more comprehensive model of pedestrian motions and behaviors is re- quired. In this paper, a non-parametric model is developed to capture 10 the statistics of pedestrian trajectories and reason about pedestrian mo- tion. Given the challenges of physics based modeling of pedestrian motion and the availability of large datasets of pedestrian movements, data-driven models are intuitively appropriate. Data-driven models do not make any assumptions about pedestrian motion, such as optimizing trajectory dis- tance or avoiding collisions with other pedestrians that may constrain the model to produce unrealistic results. Instead, data-driven models have the flexibility to fit to whatever the data suggests. Machine learning techniques can be used to develop predictive models much smaller in size than origi- nal datasets. In this paper, descriptive transition points are learned from the data to represent high level behaviors where trajectory patterns branch or merge together (for example, at a fork in the road). These transition points are learned in conjunction with low level trajectory clusters in an iterative algorithm to form a general model of pedestrian motion. By integrating the two levels, the model can be used for many tasks, such as online clustering of trajectories, anomaly detection, or reasoning about future movements and goal locations. These high level understandings about human motion patterns and behaviors are crucial for many robotics applications. For self- driving cars, the model provides predictions of how and where pedestrians will walk around cars, roads and intersections [34]. For surveillance, the model gives locations for a patrol robot to intercept pedestrians. For per- sonal robotics, the model provides predictions for collision avoidance in all types of environments (museums, at home, etc.). Fig. 2.1 provides a motivating example for the motion model in this 11 Figure 2.1: A motivating example for the motion model in this paper. The local motion patterns are labeled m1,m2,m3 and shown in pur- ple, light blue, and green respectively. There are four transition points annotated with blue bubbles. There is an anomalous trajectory shown in red. Background image from [8]. paper. The pedestrian movement patterns are decomposed into continuous motion patterns and discrete transition points. Transition probabilities are learned between the states, which are both motion patterns and transition points, the pedestrian can be in. The model is able to perform high level inference about pedestrian motion in the scene and also detect anomalous pedestrian behaviors. 12 2.2 Related Work Pedestrian motion models have been developed from trajectory data in sev- eral ways. One approach involves using Bayesian non-parametric mod- els to cluster trajectories, such as Gaussian Processes (GP) and Dirichlet- Process Gaussian Processes (DPGP) [24], [41], [15]. A learned GP over a cluster of trajectories is referred to as a motion pattern, which maps {x, y} positions to a velocity flow field [41]. While these models have the flexibil- ity to fit variable trajectory data and identify clusters of trajectories, they do not capture higher level transitions, such as trajectories branching, or dis- crete pedestrian decisions such as stopping at an intersection. Consider the example of two clusters of trajectories with a transition point where trajec- tories branch apart, such as at a fork in the road. There is an overlapping region between the two clusters before the fork and non-overlapping re- gions after the clusters branch apart. The motion patterns learned by the GP clustering would not identify the common overlapping region referred to as the sub-trajectory cluster nor the discrete location where the clusters branch apart referred to as the transition point. Although these non-parametric models capture coarse behaviors, higher level goals of transition points are important for subsequent reasoning. Other models using Bayesian non- parametrics have combined the temporal learning advantages of long short- term memory (LSTM) networks with the flexibility of GPs [83]. Another model uses LSTM by itself to learn movement and predict motion [1]. How- ever, these models also do not capture higher level transitions correspond- 13 ing to changes in pedestrian behavior, and are only meant for short term trajectory predictions. Another approach uses discrete state transition models to learn trajec- tory motion, such as Hidden Markov Models (HMM) [64], [40], [60], [49], [88]. One of the challenges with using HMM is defining discrete latent states that capture high level motion characteristics. In contrast, in this paper the discrete states are intuitively defined as transition points, which are for- mally inferred from low level trajectory data. Some HMM approaches use a grid representation to define states. These models have trouble identifying motion patterns and higher level behaviors among trajectory clusters. Grid based approaches also can lose valuable information when real trajectories do not fit cleanly into the grid space. Other models identify transition states as locations where many trajectories converge in a gridded space [38]. In these cases, information about trajectory motion patterns is lost and com- plex transition points are not typically modeled. A more general approach to defining and modeling transitions between general discrete states is de- sired. The work closest to this paper uses a dictionary learning algorithm to discover local motion patterns which are similar to sub-trajectory clusters [16]. This model cannot learn intuitive discrete transition points, such as trajectories branching and does not consider multiple future transitions be- tween subtrajectory clusters. There are many other approaches to modeling movement patterns such as Vector Field k-Means [27], finding subtrajecto- 14 ries by clustering line segments [54], and finding frequent patterns through error metric clustering [76]. These methods identify common patterns and cluster trajectories accordingly. However, high level trajectory behavior at transition points is not captured. A model that captures high level discrete decisions is required for higher level reasoning. Generally, data-driven trajectory models from the community have also not considered anomaly detection online, which is addressed in the model presented in this paper. This paper provides the following contributions: • A novel iterative probabilistic clustering algorithm, which finds low level clusters of sub-trajectories using DPGP and discovers high level transition points with hypothesis testing. The number of clusters and transition points is not pre-specified. • Online probabilistic prediction of both trajectories and novel high level transition points. • A formal way to discover anomalous trajectories online 2.3 PEDESTRIAN MOTION MODEL Fig. 2.2 shows a block diagram of the proposed pedestrian model and algo- rithmic flow. First, raw trajectory data are input into an iterative algorithm 15 with Dirichlet Process Gaussian Process (DPGP) clustering and hypothe- sis testing, which produces trajectories clustered into sub-trajectory motion patterns and estimated transition points. The clusters and transition points are then used to find transition probabilities. Finally, in online application, the motion patterns, transition points, and transition probabilities are used for predictive modeling and anomaly detection. Figure 2.2: Algorithmic flow of iterative DPGP clustering and hypothesis test transition point estimation (black) of the proposed pedes- trian model (blue) for both low level trajectories and high level transitions. The model is then used for online inference (green). 2.4 MOTION & TRANSITION POINT MODELING The pedestrian dataset is composed of the set of nt trajectories T = {t1, ...tk, ...tnt}, where tk is an individual trajectory. Each trajectory tk is divided into a sequence of nk points [{xk , yk}, ...{xk, ykp 1 1 i i }, ...{xkk , ykk }]T , tki = {xki , yki } repre-np np sents the ith data point, and {x, y} represents positional data. The time inter- 16 val between two consecutive points i and i + 1 in a trajectory is constant. A motion pattern is defined as a mapping from {x, y} to {ẋ, ẏ}, where {ẋ, ẏ} are the time derivatives of the x and y positions respectively. The motion pat- tern enables a distribution of velocity to be estimated given any 2-D point in space. The set of motion patterns M is defined as {m1, ...m j, ...mnm}, where m j is an individual motion pattern and nm is the total number of motion pat- terns in M. Each motion pattern m j contains nmj trajectories, and there are no shared trajectories between motion patterns. The goal of DPGP trajec- tory clustering is then to divide the trajectory data T into nm representative motion patterns. The Dirichlet Process (DP) [85] is a nonparametric clustering process that in this case discovers a mixture of GP models. DP allows for automatic dis- covery of the number of clusters which grows as the data becomes more complex. Gibbs sampling is used for the clustering process of DP. An ini- tial number of clusters is manually chosen and the cluster assignments are randomly initialized for the DP. In the clustering process, each trajectory tk belongs to an existing motion pattern m j with prior probability nmj − (2.1)α + nt 1 and belongs to a new motion pattern with prior probability α α + nt − (2.2) 1 17 where α is a concentration parameter. The GP is used as a likelihood metric for the DP; GP [6] models motion patterns by mapping the set {x, y} to the set {ẋ, ẏ}. The proposed GP uses a radial basis function (RBF) kernel [41] to weight local data, defined as (x − x′)2 ′ 2 Kx(x, y, x′, y′) = σ2x exp(− 2 − (y − y ) 2 ) + σ 2 nδ(x, y, x ′, y′) (2.3) 2ux 2uy where Kx is the kernel function in the x direction, σx and σn are standard deviations for the x direction and noise respectively, and ux and uy are length scales in the x and y directions respectively. δ is the dirac delta function. A separate kernel Ky is computed for the y direction and is defined similarly. The kernel function represents the covariance between the set of points {x, y} and {x′, y′}. The predicted velocity of a motion pattern at a candidate new point {x∗, y∗} is f ∗ = {ẋ∗, ẏ∗}with the distribution f ∗ ∼ N(KT K−1∗ f ,K∗∗ − KT∗ K−1K∗) (2.4) where K = {Kx,Ky}, K∗ = K(x, y, x∗, y∗), K∗∗ = K(x∗, y∗, x∗, y∗), and f represents the training data {ẋ,ẏ} from the motion pattern. The GP likelihood of a trajectory belonging to a motion pattern m j is chosen as ∏nkp ∏nkp l(tk,mk = m j|θ) = p(ẋk|tki ,m j, θ) · p(ẏk ki |t ,m j, θ) (2.5) i=1 i=1 18 where θ are the GP hyperparameters of m j, and ẋki and ẏ k i correspond to the ith point in tk. An issue with using GPs as a likelihood metric is the planar shift prob- lem [15]. In [15] a grid based approach is used instead of GPs as a likeli- hood metric. However, a grid based approach loses the flexibility of GPs, so in this paper the likelihood is adjusted to address the planar shift problem. The GP only accounts for the differences in velocity distribution in (5), so a weighting term to account for the positional distribution of trajectories is added into the likelihood ∏nkp l kw(t ,mk = m j|θ kj) = l(t ,mk = m k kj|θ j) · w(xi , yi ,m j, , β) (2.6) i=1 ( nm )β w(xk, yk j, i i ,m j, , β) = 1 + (2.7)nmj where  is a parameter for a neighborhood around {xk ki , yi }, β is a weighting parameter, nmj, is the number of trajectories in m j that pass through the  neighborhood surrounding {xk, yki i }, and w(xki , yki ,m j, , β) is a weighting term that accounts for the positional distribution of tk with respect to m j. Taking the log likelihood of lw gives 19 Figure 2.3: The  neighborhood of point {xk, yki i }. The  neighborhood is in blue, m j is in green, {xk, yki i } is the red dot, and tk is in red. The percentage of trajectories in m j that pass through the  neigh- borhood of each point in tk is used to weigh if the positional distribution of tk matches the positional distribution of trajec- tories in m j ∑nk kp ∑np log lw = log p(ẋ k|tki ,m , θ ) + log p(ẏ k kj j i |t ,m j, θ j)+ i=1 i=1 ∑ (2.8)nkp nmj, β log(1 + ) n i=1 mj The positional weighting term w intuitively accounts for the percent of trajectories of m j in the space near each point in tk, and weighs if the po- 20 sitional distribution of tk matches the positional distribution of trajectories in m j. The parameter β is a tuning parameter for how much the positional distribution is accounted for in the final likelihood. The probability of a trajectory tk being assigned to an existing cluster m j in the DPGP can be defined as nmj p(m k kk = m j|t ,m j, θ) ∝ − · lw(t ,mk = m j|θ) (2.9)α + nt 1 and the probability of being assigned to a new cluster is ∫ α p(mk = mnm+1|tk,mnm+1, θ) ∝ − · lw(t k,m = m n 1 k nm+1 |θ)dθ (2.10) α + t The new motion pattern parameters, which are represented by raw trajecto- ries, over which tk is evaluated for in equation (2.10) are randomly sampled. Algorithm 1 provides pseudocode of the DPGP clustering process. In practice, the algorithm generally converges quickly within five iterations. 2.4.1 Transition Point Estimation Transition points are defined as spatial areas where motion patterns un- dergo transition behavior, such as discrete branching and merging or dis- crete changes in velocity. Transition points typically reflect discrete deci- sions by the pedestrian when walking, such as branching or merging at a 21 fork in the road, or stopping at an intersection. Transition points separate sub-trajectories when appropriate. Formally, we define a transition point as a Gaussian distribution over spatial 2-D points where transition behavior occurs. Branching or merging is defined when sub-trajectories, or subsets of Algorithm 1: DPGP Trajectory Clustering Input: Trajectory dataset T Output: Trajectories clustered into motion patterns M Initialization : each trajectory tk is randomly grouped into one of nm clus- ters, where nm is manually chosen as the initial number of trajectories 1: for iterations do 2: for i = 1 : nt do 3: for j = 1 : nm do 4: calculate p(m = m |titi j ,m j, θ) 5: end for 6: calculate p(mti = m inm+1|t ,mnm+1, θ) 7: assign ti to an existing cluster or start a new cluster based on the highest calculated posterior. 8: end for 9: set nm = number of clusters 10: end for 11: return 22 the clustered trajectories, are shared. Given two motion patterns, the sub- trajectory is found via a two-tailed hypothesis test. Consider a point q that is part of a trajectory in motion pattern m1, and a second motion pattern m2. It is desired to discover if m1 and m2 have a shared sub-trajectory, and if q is a point in the sub-trajectory. The null hypothesis H0 and alternative hypothesis HA are defined as H0 : q = part of a shared sub-trajectory (2.11) HA : q , part of a shared sub-trajectory (2.12) The test statistic is calculated as tq ∝ g(q, f q,2, f q,1, θ)w(q,m2, , β) (2.13) g = p(q, f q,2| f q,1, θ) (2.14) where the quantity tq is the test-statistic for q. f q,1 is the predicted distribu- tion of velocity at q given the m1 GP, and f q,2 is the predicted distribution of velocity at q given the m2 GP. The term w(q,m2, , β) is the same weight- ing term from (2.7) If the p-value calculated from tq is less than the speci- fied significance level σ, then H0 is rejected. In other words, given a single {x, y} point q in m1, it is desired to find if q has a matching velocity and positional distribution with m2. For each trajectory in m1, all points with overlapping velocity and positional distributions with m2 are found, giving potential sub-trajectories. 23 The branch or merge transition points are either at the beginning or end of a sub-trajectory depending on whether the trajectories are joining together or splitting apart. The points adjoining the sub-trajectories and defined to be branch or merge points are fitted with a normal distribution. Fig. 2.4 shows potential sub-trajectory points found for merging motion patterns in the Duke MTMC dataset [71]. The sub-trajectory points found intuitively appear in the region of overlap between the two motion patterns. The spatial distribution over the location where the red pattern merges with the blue pattern is a transition point. 2.4.2 Iterative Clustering The iterative clustering process uses both DPGP clustering and sub- trajectory discovery. The algorithm starts with raw trajectory data and no cluster assignments. Next, 1) DPGP is performed to cluster the trajecto- ries into representative motion patterns. 2) A pairwise comparison is per- formed between sets of two motion patterns until a pair with a potential sub-trajectory is discovered. 3) Once a sub-trajectory is discovered, the mo- tion patterns are split based on where the sub-trajectory is into smaller local motion patterns. 4) If no sub-trajectories are discovered then the algorithm has converged. 5) If a sub-trajectory was found and split, return to step 1. Algorithm 2 provides the pseudocode for the iterative clustering. 24 Figure 2.4: Discovery of sub-trajectories for the case of merging motion patterns. There are two motion patterns in red and blue. Ar- rows in the top left show the directionality of the motion pat- terns. Circles on the trajectories indicate where the trajectory data ends. Potential sub-trajectory points in the red pattern are indicated with red crosses. The data is from video 3 of the DukeMTMC project [71]. 25 2.4.3 Transition Probabilities Transition probabilities are learned over the transition points and motion patterns in a transition probability matrix (TPM). Given the clustered data from the iterative DPGP clustering, DBSCAN is first used to discover the transition points. DBSCAN is a density based clustering algorithm [25]. The points at the start and ends of trajectories are inputs into DBSCAN and Algorithm 2: Iterative Trajectory Clustering Input: Trajectory dataset T Output: Trajectories clustered into motion patterns of representative sub- trajectories M 1: while true do 2: DPGP Clustering 3: find potential sub-trajectory between motion patterns m1 and m2 with pairwise comparisons through hypothesis testing 4: if no sub-trajectories are found then 5: break 6: end if 7: split the trajectories in the compared pair of motion patterns into smaller sub-trajectories corresponding to a new motion pattern 8: nm = nm + 1 9: end while 10: return 26 Figure 2.5: Results of the iterative DPGP trajectory clustering algorithm and transition point estimation on simulated data with 90 tra- jectories. The trajectories move in two directions, and the top and bottom rows correspond to each of the directions respec- tively. 2.5(a) is the initial clustering result before any splitting of trajectories. 2.5(b) shows the iterative clustering result after convergence. 2.5(c) shows the transition points found with 95% confidence ellipses. The blue bubbles annotating 2.5(c) (bot- tom) correspond to the states in Table I. The numbers 1 − 6 in the blue bubbles correspond to the states s1 − s6 respectively. the output is clusters of transition points. Normal distributions are fit over the transition point clusters to find a spatial distribution. The set of states S in the TPM model are defined as {s1, ...sh, ...sns}, where sh is either a transition point or motion pattern and ns is the total number of 27 states. The TPM stores transition probabilities between the states in S . The probability di p(s ji|s j) = (2.15)d j is the i jth element of the TPM, d j is the number of elements or trajectories in s j, and dij is the number of elements in s j that pass through si at a future time step. The matrix contains probabilities for all possible transitions, short term and long term. Fig. 2.5 illustrates the iterative clustering results on a simulated data set with 90 trajectories. The simulated dataset represents an intersection with a total of six trajectory clusters that can be further divided into ten sub- trajectory motion patterns and twelve transition points. The trajectories move along the paths in two directions, which are processed simultane- ously in the model. The initial DPGP clustering result is shown in fig. 2.5a. 6 motion patterns were initially found before splitting into sub-trajectories. The circles represent the ends of trajectories. The trajectories move along the paths in 2 directions. The top row and bottom row figs. correspond to each of the directions respectively. Both directions are processed simulta- neously and are separated in the fig. 2.5 only for clarity. Fig. 2.5b shows the clustering result after running the iterative algorithm. 10 motion pat- terns are found after iteratively splitting the initial 6 motion patterns into sub-trajectory patterns. Fig. 2.5c shows the transition points discovered with 95% confidence ellipses. The annotated blue bubbles in fig. 2.5c (bot- 28 Table 2.1: Transition Probabilities Between Transition Points States s1 s2 s3 s4 s5 s6 s1 0 1 0 0.43 0.57 0.43 s2 0 0 0 0.45 0.55 0.45 s3 0 0 0 1 0 0.93 s4 0 0 0 0 0 1 s5 0 0 0 0 0 0 s6 0 0 0 0 0 0 tom) correspond to the states in Table I. The iterative clustering algorithm successfully finds all ten motion patterns and all twelve transition points after iteratively splitting the initial six patterns into sub-trajectory patterns. The patterns and transition points match what intuition of the motion in the scene should be. Table 2.1 shows the TPM entries for the pedestrian states s1 − s6, which are automatically extracted from the simulated data shown in Fig. 2.5. For brevity, Table 2.1 only includes transition probabilities for the transition points shown in Fig. 2.5(c) (bottom). The full TPM includes transitions probabilities between all motion patterns and transition points. The val- ues in Table 2.1 match intuition of what transition probabilities should be between the states annotated in Fig. 2.5(c) (bottom). 29 2.5 Online Prediction of Motion and Anomaly Detection Once the offline model has been trained, the model can be used for (1) online predictive modeling and (2) online anomaly detection. 2.5.1 Predictive Modeling Online predictive modeling makes use of both low level trajectory motion patterns and high level transition points. Given a new observed trajectory to, the algorithm attempts to cluster that trajectory with an existing motion pattern. The model assigns to to the motion pattern mo with the highest likelihood from (2.7). That is, to is assigned to mo satisfying mo = arg maxm lw (2.16)j where lw is the likelihood from (2.7). Because it is desired to cluster to with local sub-trajectories, a sliding window is used in online clustering with window size ws (in time steps of the observed trajectory data). As to is ob- served, the last ws points within the trajectory are assigned to a learned model motion pattern. Given the assigned motion pattern, the model as- signs probabilities to future states S in the TPM. Aside from the motion pattern assignment, to may also pass through the spatial distribution of a transition point. If this occurs, the online model 30 continues clustering with the sliding window but gives precedence to the predictive distribution of future states of the transition point. 2.5.2 Anomaly Detection The same sliding window approach for predictive modeling is also used for anomaly detection. The maximum likelihood in (2.16) is compared to a pre- determined threshold lthresh to classify a sliding window trajectory segment as anomalous. First, (2.16) finds mo which maximizes lw. Then, if lw > lthresh the cluster assignment mo is kept for predictive modeling. If lw ≤ lthresh the sliding window segment is classified as an anomaly with no predictive states. 2.6 VALIDATION OF THE MOTION MODEL For validation, the simulated dataset shown in Fig. 2.5 and the Duke MTMC dataset [71] are used. The Duke dataset contains eight annotated videos taken at 60 fps on the Duke University campus. Annotations include man- ually labeled trajectories of pedestrians moving around a scene. Given the walking speed of the pedestrians, the data is down-sampled to 2 fps without losing any clarity in trajectories or motion patterns. The {x, y}measurements defined in a global frame given by the Duke dataset are used as inputs to the model. 31 In section 2.6.1, the robustness of the learned model to anomalous data is evaluated using the simulated dataset in Fig. 2.5 due to the ability to control the parameters of the simulation. The accuracy of online prediction and the anomaly detection accuracy are evaluated using the Duke dataset, given the challenges of an unstructured environment and not-ideal sensors, in sections 2.6.2 and 2.6.3 respectively. 2.6.1 Robustness of the Iterative Clustering Algorithm to Anomalous Training Data To evaluate the robustness of the learned model, varying amounts of anomalous trajectories are added to the simulated dataset (see Fig. 2.6). Anomalous trajectories are similar to adding noise to the data for learning. The amounts of anomalous trajectories added are in increments of 10% of the size of the original dataset, until up to 30%. The resulting trajectory cluster assignments are compared to the ground truth to find the percent of trajectories that have incorrect cluster assignments. The results were av- eraged over 10 trials for each increment of noisy data. Fig. 2.6 shows the original simulated dataset of 90 trajectories with 30% anomalies added, to- taling 117 trajectories. Table 2.2 shows the iterative clustering algorithm is robust to anomalies with an error of 9.1% with a training dataset of 30% anomalies when com- pared with the ground truth. The ground truth takes labels from the fully 32 Figure 2.6: Simulated dataset of 90 trajectories with 30% added anomalies. The original trajectories are in blue and the anomalous trajec- tories are in red. The circles signify the ends of trajectories. trained model in Fig. 2.5 and includes additional labels for anomalies. The algorithm is still able to extract the primary underlying motion patterns in Fig. 2.5 and iteratively split the trajectory data accordingly, while classifying the majority of the anomalies into their own singleton clusters. The num- ber of transition points found did not change with the added anomalies. The presence of outliers did not have a significant effect on the underlying 33 distribution of motion patterns discovered by the algorithm. Table 2.2: % Error for Different Amounts of Anomalies Anomalies (%) 10 20 30 Error (%) 4.6 6.5 9.1 Figure 2.7: Trajectories used from the Duke MTMC dataset, video number 8. The 191 trajectories used to train the offline model in k-folds validation are shown in blue with circles signifying the ends of trajectories. 40 anomalous trajectories used for online anomaly detection analysis are shown in red. 34 Figure 2.8: An example of the online predictive model. The observed tra- jectory is shown in magenta. The current state, which includes both transition points and motion patterns, is in green. Pos- sible future states are in shades of blue, scaled by probabil- ity, with darker shades corresponding to higher probability of transition, and improbable future states are shaded in gray. In 2.8(a), the trajectory has just started out so there are a wide range of possible future states. In 2.8(b), the trajectory moves within the overlap of clusters, and the probability of transition- ing to some states increases with a darker shade of blue, while other states are eliminated. In 2.8(c) the trajectory has only one possible future state to transition to, which is the confidence ellipse shaded in black, the color corresponding to the highest transition probability of one. 2.6.2 Pedestrian Prediction Accuracy For analysis of pedestrian prediction accuracy, k-fold cross validation with k = 10 was used. The model was evaluated on 191 trajectories extracted from the Duke MTMC dataset, video number 8. For each fold, 90% of the 191 trajectories are used for training the model and the remaining 10% are used to simulate online tracker readings of new trajectories; the trajectories are shown in blue in Fig. 2.7. To evaluate the model, two metrics are in- 35 troduced. 1) Prediction Accuracy: the percent of time in the total observed trajectory that the final state the trajectory will reach is part of the distribu- tion of predicted states and 2) Transition Prediction Time (TPT): the number of time steps after leaving a transition point needed for the model to accu- rately predict the next transition point. The sliding window size ws = 6 was manually chosen to balance resolution and long term characteristics. Pre- diction accuracy of the proposed model is compared with a constant veloc- ity prediction model as a baseline where the average velocity of the sliding window is extrapolated to predict future locations of the pedestrian. TPT is not applicable to a constant velocity prediction model, since high level transitional description of trajectories is the proposed models novelty and not available in the competing model. The metrics were evaluated across the 10 folds and the results are shown in Table 2.3. A statistical t-test is performed to test the significance of the difference between predictor accuracies. The proposed model performed significantly better than the constant velocity baseline (p < 0.01), yielding the correct final state in the predicted distribution at 95% of the time on average, compared with 39% for constant velocity predictions. For TPT, the proposed model finds the next transition point within 1.1s on average, i.e. 2.2 time steps at 0.5s measurement intervals, which is fast considering the speed of pedestrians. The algorithm is able to forecast long term final locations of new observed trajectories and can converge on future states fast enough for online prediction. 36 Table 2.3: K-folds Cross Validation Results Prediction Prediction TPT(s): Accuracy (%): Accuracy(%): Metrics Proposed Proposed Constant Model Model Velocity Mean 95 39 1.10 Max 99 51 1.50 Min 89 31 0.85 Fig. 2.8 shows an example of the online predictive model when observ- ing a new trajectory; the model is shown in one-directional flow of trajec- tories for brevity. Note the trajectory patterns all move in two directions. The observed trajectory is shown in magenta. The current assigned state is shown in green. Possible future states are shown in blue with darker shades indicating a higher probability of transition, and improbable future states are shown in grey. In Fig. 2.8(a), the newly observed trajectory can potentially transition to most future states. In Fig. 2.8(b), as the person moves within the first transition point, the transition probabilities toward the remaining future states are increased, and such future states are shown in a darker shade of blue than in Fig. 2.8(a), while other improbable states are eliminated. In Fig. 2.8(c), the observed trajectory can only transition to one future state, which includes a transition point where trajectories leave the scene. The model results match intuition and accurately predict the dis- tribution of future states of the observed trajectory. 37 Figure 2.9: Online anomaly detection. The trained model is shown in gray. If the trajectory is found to be aligned with a learned motion pattern, it is highlighted in green. If the sliding window is de- tected as an anomaly, it is highlighted in red. The segment of the observed trajectory that is no longer in the sliding window is highlighted in black. 2.6.3 Online Anomaly Detection Each of the 10 folds from the split k-folds data is evaluated on accuracy of detecting anomalous trajectories online. The same 40 anomalies, which are separate from the split k-folds data, shown in red in Fig. 2.7 are evaluated on each training fold. The trained model is evaluated based on the ability to detect sections of trajectories that exhibit anomalous behavior based on human labels of anomalies. The average correct anomaly detection rate was very high at 95%, with 38 a standard deviation of 3%. The accuracy is measured based on the per- centage of anomalous trajectories where all anomalous behavior is cor- rectly identified by the model. Fig. 2.9 shows a visualization of the on- line anomaly detection. Fig. 2.9(a) shows an example where the trajectory is initially assigned to a motion pattern. As the trajectory moves forward and additional info is collected (Fig. 2.9(b)), the trajectory is labeled an anomaly once it diverges from learned motion patterns. In summary, the online anomaly detection is able to both detect (1) whether a trajectory is an anomaly and (2) which parts of the trajectory are anomalous with high accuracy. 2.7 Model Complexity The most computationally expensive part of the model is trajectory cluster- ing with the GPs which is O(N3) due to inversion of matrices. Cholesky de- composition is used to speed up inversion and reduces complexity to O( N 3 3 ). The algorithm processes each new sliding window online in an average of 0.63s and is feasible to implement online. Experiments are done in MAT- LAB on a machine with the specifications: Intel(R) Xeon (R) CPU E5-1630 v3 @ 3.70GHz, 3701 Mhz, 4 Cores, 8 Logical Processors, 40 GB RAM, 64-bit Windows OS. 39 2.8 TAKEAWAYS A probabilistic, two level pedestrian model is developed that captures low level motion patterns using an iterative Dirichlet Process Gaussian Process clustering model, high level transition points through hypothesis testing, and transition probabilities between pedestrian states. Results show that the model captures high level discrete behavior such as discrete pedestrian decisions at an intersection, while also retaining low level clusters of tra- jectories. The iterative model is able to learn both a high level pedestrian decision model in conjunction with low level continuous motion patterns. That is, high level transition point estimation is influenced by the low level motion pattern clustering, and vice versa. Given a new observed trajectory, the model can both quickly generate a distribution over future states and also detect anomalous behavior. The novel motion model is a potentially powerful tool for many applications, such as mobile robot path planning to intercept a person of interest at a future time or identifying anomalous pedestrian behavior at an intersection for autonomous driving. 40 CHAPTER 3 DEEPSEMANTICHPPC: HYPOTHESIS-BASED PLANNING OVER UNCERTAIN SEMANTIC POINT CLOUDS 3.1 Motivation Path planning for complex outdoor environments is challenging due to the unstructured nature of environments that do not fall neatly into discretized space. Moreover, different terrain surface types can be difficult to detect with traditional sensing modalities. In indoor environments, a grid space representation with lidar sensors is sufficient [79, 39, 53]. Outdoor environ- ments exhibit complex geometries and surface types, which are difficult– if not impossible–to differentiate using just lidar data. Therefore, a more flexible scene representation, surface classification using computer vision techniques, and reasoning about scene uncertainties are necessary. Previous work for outdoor planning has focused on classifying ter- rain and surface roughness using SVM classifiers [89, 58], neural networks [13, 22], and various other computer vision techniques [59, 28]. While these techniques can differentiate between simple terrain types, they do not model the inherent uncertainties and ambiguities in complex scenes which make it difficult to differentiate between terrain types (e.g., an offroad robot driving through a patch of grass with small rocks). Many current outdoor planning approaches still rely on grid maps which do not model the com- 41 Figure 3.1: The DeepSemanticHPPC pipeline. (1) In the first stage, initial inputs generate a multi-hypothesis graph of possible paths. (2) In the second stage, the uncertainty in the scene is reduced and path costs are updated. (3) The second stage is repeated for a set number of iterations. This is terminated early if a safe path is confirmed or all paths are confirmed as unsafe. (4) A path is selected. plex geometry of an outdoor scene (e.g., a field with irregular bumps and rocks) [65, 26]. Recent work models outdoor maps for planning with a point cloud [48], which is more flexible and suitable for unstructured scenes; how- ever, [48] uses traditional lidar sensing which cannot differentiate between different surface types as broadly as computer vision. In this paper we present DeepSemanticHPPC (Deep Semantic Hypothesis-based Planner over Point Clouds), a novel algorithmic pipeline 42 for planning over uncertain semantic point clouds, which leverages a Bayesian neural network (BNN) [29, 44] to extract principled estimates of segmentation uncertainty. This allows our framework to reason about am- biguous terrain as well as robustly handle false positive detections by tak- ing additional measurements to reduce semantic uncertainty in the scene. However, each measurement is costly due to the computationally expen- sive nature of Bayesian neural networks operating on a robotic platform with limited computing power. Our planner hence employs next-best-view (NBV) techniques [7, 37, 21] to optimize for new measurements. DeepSe- manticHPPC includes: • the employment of a deep Bayesian neural network [29, 44] to obtain surface and obstacle semantics with uncertainty estimates for unstruc- tured outdoor environments; • a flexible point cloud scene representation; • a next-best-view planner which minimizes the uncertainty of terrain semantics using sparse visual measurements; • a hypothesis-based path planner (extending [48]) that proposes mul- tiple kinematically feasible paths with evolving safety confidences given the NBV measurements. Experimental results with real environments show that our pipeline plans safe paths in real-world environments where existing path planners typically fail. Fig. 3.1 illustrates DeepSemanticHPPC. In the first stage, 43 a multi-hypothesis planner generates multiple hypotheses of possible safe paths given a scene belief. In the second stage, a NBV function calculates NBV poses and associated rewards. These poses and rewards are input to a NBV selection block which selects the best feasible NBV. A BNN extracts semantic segmentations and associated uncertainties from the NBV mea- surement, which are used to generate a new scene belief. The new scene belief reduces hypothesis uncertainty, and the second stage is repeated for a set number of iterations. Finally, a safe path (hypothesis) with high con- fidence is selected; if all paths are classified as unsafe, then no path is se- lected. The algorithm terminates once a path is confirmed safe or all paths are confirmed unsafe. 3.2 Background 3.2.1 RRT-Based Non-Holonomic Planning over a Point Cloud We build upon an existing rapidly-exploring random tree (RRT) [51] based planner for finding kinematically feasible trajectories over non-planar point cloud environments [48]. A core contribution of [48] is computing 6D poses and trajectories on point cloud surface representations that are compliant with the robot’s physical (e.g. maximum sustainable roll and pitch angles) 44 and kinematic constraints. Please refer to [48] for full details. 6D robot poses are expressed by transformation matrices belonging to the Special Euclidean Group SE(3). A matrix TMR specifies the position and orientation of a robot-fixed coordinate frame R expressed in a given refer- ence map frame M. In particular, a pose expressed in coordinate frame A can be stated in terms of the position and orientation of another frame B w.r.t. A as  RAB tABT =   AB  ᵀ  ∈ SE(3),0 1 where RAB is a rotation matrix belonging to the Special Orthogonal Group SO(3) and t 3AB ∈ R is a translation vector. Robot poses can then be denoted by matrices TMR, which specify the position and orientation of a robot-fixed coordinate frame R expressed in a given reference map frame M. [48] considers the following planning problem: given start and goal poses TMS, TMG, and a point cloudM = {mi} with mi ∈ R3, compute a con- necting trajectory π : R>0 → SE(3). The trajectory has to satisfy a number of constraints up to a given degree of approximation: contact with the terrain surface, static traversability (e.g., bounded roll and pitch angles), and kine- matic constraints – including bounded continuous curvature. Trajectories are represented as piecewise continuous functions in the 6D space of robot poses, and are specified by a sequence of nodes π̂ = [N k], where each N k is a tuple (T k k kMRk , τ ,w , κ ). Here, TMRk is a 6D pose attached to the terrain surface, τk ∈ [0, 1] is the associated static traversability value, wk is a pa- 45 rameter vector specifying a short planar trajectory segment connecting TMRk to the next pose in the sequence, and κk is the curvature at the beginning of the trajectory segment. wk specifies a trajectory segment as a cubic cur- vature polynomial [63] evolving along the planar patch defined by the xy plane of the coordinate frame R attached to TMRk . The end point of such a trajectory segment gives the subsequent pose TMRk+1 through a projection on the terrain surface via f : (M,TMR) 7→ TMR; f queriesM for the K nearest- neighbors of the end point, which can be thought of as the points the robot will lie on at TMRk+1 (K depends on the size of the robot and point cloud den- sity). We use φ(N k+1) to denote such points. The actual start and goal poses are also obtained by applying f to TMS and TMG. Leveraging the above trajectory representation, [48] proposes to define a small set of motion primitives (short trajectory segments) and use them to grow two RRTs1, one from the start pose and one from the goal pose, and iteratively try to connect them. Each new pose is associated with a nodeN k, which is accepted in the tree only if τk > 0. Connections between close poses belonging to different trees can be attempted by generating short trajectory segments on demand. [48] also proposes a technique to derive a better tra- jectory (in terms of smoothness and distance) starting from an initial one. This second optimization stage is not explicitly considered in this work, be- cause it is easily generalized and applied to all “safe” trajectories according to our method. 1A standard approach in RRT-based planning [50]. 46 3.2.2 Segmentation with Bayesian Neural Networks Although segmentation networks for 3D point clouds exist [69, 70, 99, 67, 90], 3D data repositories are focused on object recognition and part seg- mentation (e.g., [11]) or only contain a small number of scenes [32]. In con- trast, existing large-scale image segmentation datasets [45, 10, 98, 4] contain varied surfaces and obstacles in diverse real-world outdoor scenes. There- fore, we leverage a state-of-the-art image segmentation network [14] (Sec- tion 3.4.1), and update the point cloud environment from per-pixel image segmentations (Section 3.4.2). Furthermore, we augment the network to es- timate output uncertainty, allowing uncertainty in surface predictions to be embedded in the point cloud. Uncertainty in surface type is used to guide path-safety evaluation. At inference time, forward passes with active dropout layers can be interpreted as an approximation of the posterior distribution of model weights of a neural network [29, 44]. The uncertainty of predictions are computed by taking the sample standard deviation across multiple forward passes. For each pixel (i, j)X of image X, the mean softmax vector over T for- ward passes is: 1 ∑T p(i, j)X s(i, j)= Xt (y|X) (3.1)T t=1 where s(i, j)X (y|X) ∈ RC is the softmax output of the network. The correspond- ing uncertainty vector σ(i, j)X on p(i, j)X is: 47 √√∑T ( (i, j)X )s (y|X) − p(i, j) 2Xt (i, j)X t=1σ = − (3.2)T 1 In our framework, image X corresponds to a view with known camera parameters. p(i, j)X and σ(i, j)X are combined with existing measurements for each point m ∈ M that maps to pixel (i, j)X (Section 3.4.2). 3.3 Approach Overview The planner presented in Section 3.2.1 does not leverage critical semantic information about terrain types. In this work, we assume an initial point cloud is given in the form M = {ei}, where each element ei is a tuple (mi,pi, σi). Here, mi ∈ R3 as before; pi = (pi , pi , . . . , pi1 2 C) is a vector specifying the probabilities that the point belongs to each one of the C possible seman- tic classes (gravel, water, etc.); σi = (σi1, σ i i 2, . . . , σC) is a vector specifiying the uncertainties of pi, as discussed in Section 3.2.2. Points are initialized with uniform semantic probabilities and maximum uncertainties. Updat- ing the semantic point cloud is discussed in Section 3.4.2. Fig. 3.2(b) shows a pointcloud obtained in a real (ambiguous) environment, where each point mi is associated with the color corresponding to the class label j whose pij is maximum. We assume the semantic classes have been partitioned into two sets: the safe set S (e.g., gravel, grass) and the unsafe set U (e.g., water, snow). For 48 (a) (b) (c) Figure 3.2: An example point cloud. (a) Image view of a portion of the en- vironment. (b) Point cloud colored with the most likely class predicted from image (a) (bright green: “grass”; dark green: “tree”; purple: “sidewalk”; dark grey: “road”; light grey: no in- formation available). All the classes except “tree” belong to the set S . The region around the tree is actually mulch/woodchips, which should be classified as “dirt” (belonging to U). (c) Point cloud colored to show safe (white), unsafe (black), unclear re- gions R partitioned according to most likely class (random col- ors), and points not associated with any class (orange). 49 ∑ ∑ each point mi,√the points√are defined as pi i i i iS = j∈S p j, pU = 1 − pS = j∈U p j, and σi ∑ = min( σi2 ∑ i2 j∈S j , j∈U σ j ). Each point is then classified as: • safe if piS − wσσi ≥ θs; • unsafe if pi iU − wσσ ≥ θu; • unclear otherwise. Intuitively, this implies that points are safe/unsafe given high probability (piS , p i U) and low uncertainty (σ i), and unclear otherwise. wσ, θs, θu are de- fined by the mission planner (with 1−θs < θu). We useMsafe,Munsafe,Munclear to denote the partition of M obtained from the above classification. Note that a point is labeled safe (unsafe) even with large uncertainty on piS (p i U), provided there is small uncertainty on piU (p i S ). This is captured by the min in the definition of σi. For example, the network may be uncertain between gravel and grass (both safe), but it is sure that the point is neither water nor snow (both unsafe). Consider now a trajectory π̂ = [N k], and recall that φ(N k) denotes the set of points on which the robot lies when at pose TMRk . Depending on the semantic information initially available, it might be very difficult –if not impossible– to immediately find a trajectory whose node points φ(N k) all belong toMsafe. DeepSemanticHPPC works in two stages: 1) Compute a set of candidate paths traversing different unclear regions, and 50 2) Reduce the uncertainty of such paths by taking new views in the prox- imity of the robot’s starting position of the most promising path. We relax the path planning problem in a natural way – instead of reach- ing a specific goal pose, we require the robot to reach a goal pose region G defined around TMG. Then, the points in Munclear are organized into a set R of unclear regions. To build the set R, we use the following two-stage clustering process: first, DBSCAN [25] performs a large-scale clustering of the points inMunclear, obtaining a set of large unclear regions R̂. Then, the points of each r̂ ∈ R̂ are further partitioned according to their most likely class (treating the points not associated with any prediction as belonging to a special class); DBSCAN is called again on each partition. Fig. 3.2(c) shows the result of this process on our example with θs = 0.9, θu = 0.3,wσ = 3. The remaining task is to compute a set of candidate paths from TMS to G. Section 3.4.3 presents a variant of a standard RRT algorithm which constructs multiple hypothesis for the safest path, traversing different un- clear regions. These paths are stored in the directed graph G = (V, A), where each v ∈ V is associated with a potential trajectory node N v and each a ∈ A represents the existence of a short trajectory segment con- necting∑two poses2. With a slight∑abuse of notation, we define pvS =1 i v 1 i |φ(TMRv )| i∈φ(T v ) pS and σ = | (T )| i∈φ(T v ) σ for each v ∈ V . The costMR φ MRv MR for each no∑de is based on how far it is from satisfying our safety constraint: p̄v = 1 i i| (Nv)| i∈φ(Nv) min(1,max(0, θs − pS + wσσ )) for for each v ∈ V . However,φ 2G can be thought as a generalization of the trajectory representation π̂ of Section 3.2.1. 51 if the node region sufficiently intersects with an unsafe region, the cost is infinite; and if the node lies entirely in a safe region, the cost is zero. Summarizing, c(v) is defined as: 0 if |Msafe ∩ φ(N v)| = |φ(N v)| c(v) =  v ∞ if |Munsafe ∩ φ(N )| ≥ φv (3.3) p̄v otherwise, where φv is a user-defined threshold. The first condition can be relaxed for the nodes in close proximity of the starting pose. Although a vertex with infinite cost is never obtained when the candidate paths are initially com- puted, its cost might tend to infinity when additional views are taken dur- ing the NBV stage (Section 3.4.4). A predefined number of NBV iterations are run. At each iteration, the m most promising (shortest) paths accord- ing to the above cost function are computed by a k-shortest paths algorithm (we use Yen’s [94]). The associated vertices v such that 0 < c(v) < ∞ are then considered in the function that computes the best additional view. The value of m is also decided by the mission planner: m = 1 corresponds to an aggressive setting, while m > 1 is preferred given a large temporal budget for taking additional views. 52 3.4 Technical Details 3.4.1 Predicting Semantic Labels Dataset We curate a segmentation dataset for outdoor navigation in unstructured environments from the existing large-scale COCO panoptic dataset [45]. First, images of unstructured outdoor scenes are selected using a Places365 [97] classifier. An image is kept if (a) the classifier’s top one (highest) pre- diction is an unstructured outdoor category with > 50% probability, or (b) two or more of the top five predictions are unstructured outdoor cat- egories. Second, the 133 categories in COCO panoptic are merged: (a) all outdoor terrains (e.g., grass, dirt, snow, pavement) are retained; (b) ob- stacles are merged into four categories: fixed obstacles (e.g., buildings), moving human-made obstacles (e.g., vehicles), humans, and animals; (c) all indoor categories are removed. Our final dataset consists of 34K train- ing images with 22 categories. Our navigation segmentation categories (from COCO panoptic), and filtered list of COCO panoptic images are at: https://deepsemantichppc.github.io For our network architecture, we use DeepLabv3+[14] with Xcep- tion65 [17] backbone augmented with dropout in the middle and exit flow blocks for semantic segmentations. For details, refer to [14, 17]. ASPP atrous 53 rates are set to 6,12,18 for output stride 16 and 12,24,36 for output stride 8 as in [14]. Training uses output stride 16 while inference uses output stride 8. Training Details We initialize from network pretrained on ImageNet [?] + MSCOCO [?] + Pascal VOC [?], and train for 160000 iterations using SGD with momentum with batchsize 4 on our dataset. Initial learning rate 0.01, momentum 0.9, and polynomial learning rate decay with power 0.9 are used. Images are resized to 513x513. Inference. At inference time, 50 forward passes are used to predict seman- tics and uncertainties (Eqs. (3.1)-(3.2)). 3.4.2 Associating Semantic Labels to a Point Cloud Given a viewpoint with known pose and camera intrinsics, image segmen- tation probabilities and uncertainties are mapped to the point cloud. To map pixel (i, j)X: 1. Estimate depth map DX for view X. 2. Each pixel is backprojected to a single point corresponding to the cen- ter of the projected pixel. This approximation does not hold for pixels 54 with very large depths, but typically produces good results. Backpro- jected pixel point m̃(i, j)X is computed as: m̃(i, j)X (i, j)= PX[DX I3×3|[0 0 1]T ]T K−1X [i j 1]T (3.4) where [i j 1] are pixel coordinates in homogeneous coordinates, KX is 3 × 3 intrinsic matrix, D(i, j)X is depth at pixel (i, j), and PX is 4 × 4 pose matrix for view X. 3. Each backprojected point is merged with its nearest neighbor mnn in the point cloud M within threshold distance R. If a backprojected point does not have a neighbor within the threshold, the point is dis- carded 4. p(i, j)X and σ(i, j)X are merged with existing measurements for point mnn. The combined measurement is the best linear unbiased estimator un- der the simplifying assumption that the per-class predictions are in- dependent. Given a set of K measurements {(p(k),σ(k))}, the combined measure- ment (p̃, σ̃) for class c is: ( 1 ∑ √K ∑K ) p̃ = w(k) p(k) , σ̃ (k) 2 (k) 2c Z c c = (w ) (σ ) k=1 k=1 (k) −2 ∑ where w(k)c = ∑(σc )(k) , Z s.t. p̃c′ = 1 (3.5)(σc′ )−2′ c′∈Cc ∈C 55 3.4.3 RRT-Based Multi-hypothesis Planner The algorithm to compute G = (V, A) starts by building an initial RRT from a root vertex vs associated with the projected start pose TMS via a predefined set of motion primitives. Instead of building two RRTs as is done in [48], we build a single RRT from the start pose and bias the sampling to the point that is closest to the projected ideal goal pose. Sampling is performed on the points in M not lying in the forbidden points set F , which is initialized as F ← Munsafe. Once the first path π̂ = [N v] is found, the algorithm exam- ines which regions in R are traversed by the vertices of π̂ by checking their intersections with each set of points φ(N v). Except for regions containing points belonging to the K-nearest neighbors of TMS and TMG, all regions tra- versed by π̂ are placed into the removal candidates set C. A heuristic is then used to decide which region(s) should be removed from subsequent plan- ning stages. In this work, we simply remove the largest region ĉ, and F is updated as F ← F ∪ ĉ. When ĉ is removed, the RRT vertices lying on it, in- cluding those of π̂, are removed from the RRT. The algorithm then proceeds to optimize and find a new path to the goal, by continuously expanding the RRT component containing vs. This time, however, the algorithm also tries to connect (by computing ad hoc trajectory segments) new vertices to those that were disconnected from the RRT due to the removal of ĉ. When a new path is found, the process repeats. If at any iteration, the algorithm finds a path only contained in the regions of TMS and TMG, it can be restarted with a different random seed. The two graphs are merged at a later stage. 56 Figure 3.3: An example graph G = (V, A), with poses. Vertices with c(v) = 0 are in green, while vertices with 0 < c(v) ≤ 1 are in red (the darker, the closer to 1). The blue, green, and red axes indicate the pose of the robot. Fig. 3.3 shows an example multi-hypothesis graph computed on the exam- ple of Section 3.3 (Fig. 3.2). If any path computed on the multipath graph G = (V, A) has zero cost, the robot can start following that path since all the underlying points lie in Msa f e. Otherwise, the robot enters the NBV stage described in section 3.4.4. 3.4.4 Next-Best-View (NBV) Planning A set of n viable NBV poses Tviable = {TMV1 , . . . ,TMVn} is computed by grow- ing a RRT starting from TMS. The candidate poses Tviable are a subsample of 57 the RRT vertices. All points lying withinMunclear are treated as unsafe, and sampling is performed on the safe points lying within a given radius r from TMS. The robot should not travel too far to take a new view; otherwise it is more appropriate to follow the most promising path of G = (V, A). A reward J(TMV j) is calculated for each candidate pose TMV j . The NBV pose TNBV is selected by picking the pose with the highest value of J that also allows a safe path back to a safe pose from which all the paths can be followed. J(TMV j) = βdD + βγγ + βvisNvis + βQQ, (3.6) where D is a distance metric, γ is a viewing angle metric, Nvis is the num- ber of visible vertices from TMV j , and Q is the average information gain over visible vertices from TMV j . The weights βd, βγ, βvis, βQ sum to 1 and D, γ, Nvis, and Q are normalized. Distance and change in viewing angle are used based on the assumption that closeness and view diversity will reduce vertex uncertainty. Nvis puts more weight on candidate poses which have higher chances of reducing the uncertainty of multiple segments of the mul- tipath graph G = (V, A). Q represents the expected reduction in uncertainty in the graph vertices given TMV j . Each of these components are defined as follows. Begin by defining the set of vertices v ∈ VNBV, where VNBV is the set of unclear vertices belonging to the m most promising paths (see the end of 58 Section 3.3). For each candidate pose TMV j ∈ Tviable, only vertices visible from TMV j are considered in calculating the reward. Visible vertices are de- fined as vertices which occupy greater than a predefined number of pixels in the image plane rendering of the point cloud from TMV j . Visible vertices are added to the set v ∈ Vvis, j, where Vvis, j ∈ VNBV. To calculate D, the distances from TMV j to v ∈ Vvis, j are normalized. Since a lower distance should correspond to a higher reward, we subtract the nor- malized distances from 1. To calculate γ: the set of negative cosine distances of the angle between TMS and TMV j to v ∈ Vvis, j are used. Nvis is the size of Vvis, j. The information gain metric Q(v) is calculated for each v ∈ VNBV. Q(v) represents the expected reduction in uncertainty for each v ∈ VNBV, and is a function of the visibility and uncertainty of the points lying in φ(N v). The visibility I(v,TMV j) of a vertex v ∈ VNBV, given a candidate pose TMV j , is the pixel coverage of φ(N v) in the rendered image plane of TMV j . Per- point bounding squares of size equal to half the point cloud resolution are used to compute occlusions and pixel coverage for surface points. The num- ber of pixels that φ(N v) occupies in the rendering is the predicted visibility I(v,TMV j) of v at TMV j . The uncertainty σ(v) of a vertex v ∈ VNBV is the average sum of the un- certainties of the points in φ(N v): 59 1 ∑ ∑C σ(v) j= |φ(N v)| σi (3.7) i∈φ(Nv) j=1 The information gain metric Q(v) of vertex v ∈ VNBV can be formally written as Q(v) = αiI(v,TMV j) + ασσ(v), (3.8) where αI , ασ are weights that sum to 1 and I(v,TMV j) and σ(v) are normal- ized. 3.5 Validation 3.5.1 Validation Scenes and Overview For validation, we use both a simulated scene and a real scene. The sim- ulated scene is generated using a prebuilt environment from AirSim [75]. Fig. 3.4 shows a top down view of the Africa environment from AirSim which includes five different semantic classes. The simulated scene is used for testing the efficacy of the multipath planner in planning a diverse set of paths on a point cloud map (Section 3.4.3). A point cloud map of a subset of the Africa environment is generated and subsampled to 20cm resolution. 60 (a) Figure 3.4: The prebuilt Africa environment from Airsim used for testing the multipath planner. We implement our full pipeline in the AirSim simulator [75]. How- ever, preliminary experiments show that the BNN trained on the real-world dataset (Section 3.4.1) performs poorly on the synthetic AirSim environ- ment. The full pipeline with a simpler BNN trained on the AirSim envi- ronment and corresponding experimental results are shown in the accom- panying video. For real world validation, we collect data of two different scenes using the ZED stereo camera from Stereolabs. The first scene is next to the Mann Library in Cornell University (Fig. 3.2(a)) and the second scene is at Cass Park in Ithaca (Fig. 3.5). These scenes are selected due to their varying 61 Figure 3.5: Left: Image from Cass Park in Ithaca. There are multiple dif- ferent terrains in the scene including grass, mud, and water. Right: Annotated safe (blue) and unsafe (red) regions for the Cass Park point cloud. (but common) terrain types. The scenes are representative of common un- structured outdoor environments. The ZED camera API is used to extract depth maps and generate point cloud reconstructions of the scenes. Seman- tic segmentations and uncertainties are acquired and mapped to the point cloud using techniques described in Sections 3.4.1, 3.4.2. The videos used for reconstruction are recorded at 60 fps with 720p resolution. The data from real world scenes are used for evaluating uncertainty reduction us- ing the NBV function and for evaluating the safety of the paths planned by the full pipeline. For these scenes, we heuristically select a set of candidate NBV poses instead of growing a RRT from TMS, and assume a path between these poses and the start pose exists. Candidate NBV poses are selected to be oriented in the general direction of the goal while covering a wide range of the scene. Our method (and baseline methods) choose NBVs from the set of candidate NBV poses. 62 3.5.2 NBV evaluation To evaluate the performance of the NBV function, we examine the change in uncertainty of the path vertices as the number of NBVs increases. The complete NBV reward function (Eq. 3.6) is compared against: (a) random selection, (b) geometry-only reward, and (c) uncertainty-only reward. For the geometry-only NBV reward, we set {βQ} to zero, and for the uncertainty- only NBV reward, we set {βd, βγ, βvis} to zero. The change in uncertainties summed across all the classes and points for each path vertex averaged over 500 trials is shown in Fig. 3.6. In our experiments, the full NBV reward weights are set as follows: {βd = 0.4, βγ = 0.05, βvis = 0.25, βQ = 0.3, αI = 0.5, ασ = 0.5} for the Mann Library scene, and {βd = 0.15, βγ = 0.05, βvis = 0.2, βQ = 0.6, αI = 0.3, ασ = 0.7} for the Cass Park scene. A higher weight is assigned to uncertainty terms for the Cass scene because the boundaries between surface types (e.g., water, mud, grass) are more ambiguous than in the Mann scene. For both scenes, the full NBV reward function consistently achieves the lowest uncertainty with two or more NBVs (Fig. 3.6). This illustrates the importance of both geometry and uncertainty terms in the reward function. In the Mann scene, the baseline reward functions converge to a higher un- certainty than the full reward function. Due to the small size of the scene and the nature of all candidate NBVs being oriented towards the goal pose, random selection performs quite well. It initially outperforms uncertainty- only which does not take into account point visibility or viewpoint diver- 63 Figure 3.6: Change in uncertainty of path vertices (y-axis) as the number of NBV measurements increase (x-axis). Left: Mann Library. Right: Cass Park. sity, while random sampling implicitly selects diverse views. In the Cass scene, the baseline reward functions converge to a higher uncertainty than the full reward function, except for uncertainty-only which converges to the same point as the full reward function. The overall uncertainty in the BNN predictions are much higher at Cass park, so heavily weighting the uncer- tainty in the reward function performs well. In the Mann scene, geometry- only outperforms uncertainty-only, whereas the opposite result holds for the Cass scene. Mann has less inherent ambiguity so geometry terms are more important, while Cass has more ambiguous regions so uncertainty terms are more important. 64 3.5.3 Path Safety Evaluation To evaluate the real world application of DeepSemanticHPPC, we study the safety of selected paths. We annotate a point cloud (Fig. 3.5 right) with MeshLab [18] into ground truth safe and unsafe regions. For Mann library mulch (dirt) is labeled as unsafe, and for Cass Park mud (dirt) and wa- ter are labeled as unsafe. Any path which contains a vertex that overlaps with the unsafe region with over Nunsafe points is unsafe. We set Nunsafe = 4. Two baselines are considered: (a) B1: planning without semantic informa- tion (based on [48]) and (b) B2: planning with semantic information from a single initial view without taking any NBV measurements to reduce path uncertainty. We also study the performance of DeepSemanticHPPC as the number of NBVs increase. Table 3.1 shows the path safety results for the Mann Library and Cass Park scenes over 500 trials. Start and goal poses are varied for each trial within a one meter radius of reference poses. A path is confirmed safe (CS) if it has cost zero. A graph G = (V, A) is confirmed unsafe (CN) if all paths have infinite cost. Since both safe and unsafe terrain surfaces can be geomet- rically similar, baseline B1 cannot reliably avoid unsafe semantic regions. Baseline B2 performs significantly better than B1 because of the inclusion of semantic surface types in the planner. However, because the semantic segmentations can be incorrect, especially in regions with high uncertainty, this planner still plans over unsafe terrain. 65 Mann B1 B2 1N 2N 3N 4N 5N Safe % 0 23.4 81.6 79.8 81.4 83.6 86.0 Unsafe % 100 76.6 18.4 15.4 11.2 6.6 3.8 CS % N/A N/A 0 0 4.0 18.0 28.0 CN % N/A N/A 0 4.8 7.4 9.8 10.2 Cass B1 B2 1N 2N 3N 4N 5N Safe % 13.2 33.4 54.0 57.4 57.4 59.0 59.2 Unsafe % 86.8 66.6 46.0 41.6 39.8 37.4 36.6 CS % N/A N/A 0 0 0 0 0 CN % N/A N/A 0 1.0 2.8 3.6 4.2 Table 3.1: 500 trials of path safety evaluation. The columns are the path planning methods used: B1 is the planner based on [48], B2 is the variant of our framework which does not utilize any NBVs, XN is DeepSemanticHPPC (ours) with X NBVs. The rows are the metrics: Safe is the number of trials where a safe path is selected, Unsafe is the number of trials where an unsafe path is selected (lower is better), CS is the number of trials where a safe path is confirmed, CN is the number of trials where all multipaths are confirmed as unsafe (and no path is selected). Our DeepSemanticHPPC framework is significantly better than the two baselines. NBVs allow the planner to discard unsafe paths as semantic un- certainties decrease. With just one NBV, the percentage of safe paths taken increases drastically from 23.4% to 81.6% (Mann) and increases from 33.4% to 54.0% (Cass). As NBVs increase, the percentage of safe paths selected generally increases while the percentage of unsafe paths selected decreases. With 5 NBVs, 86% (59.2%) of paths selected for Mann (Cass) are safe, and only 3.8% (36.6%) of paths selected for Mann (Cass) are unsafe. With 5 NBVs, uncertainty is sufficiently reduced so that in 10.2% (4.2%) of trials, all multipaths are confirmed to be unsafe for Mann (Cass) and no path is 66 selected. The complexity of the Cass scene is reflected in these results. 3.6 Takeaways In this paper we present DeepSemanticHPPC, a novel framework for plan- ning in unstructured outdoor environments while accounting for uncertain terrain types. Our experiments show DeepSemanticHPPC reduces seman- tic uncertainty in planned paths and increases the safety of paths planned in environments with unsafe terrains. We plan to implement DeepSeman- ticHPPC on a robot for physical experiments. Other interesting directions include exploring the ability to build the point cloud online and incorporat- ing geometric uncertainties. 67 CHAPTER 4 PLANNING THROUGH UNKNOWN SPACE BY IMAGINING WHAT LIES THEREIN 4.1 Motivation Planning paths for robots in environments with unknown spaces, such as from occlusions, is a challenging task due to the difficulty in modeling what actually lies therein. The customary approach to tackling this challenge in traditional grid-based representations is to assume any unknown cells as free. Any discrepancy between such an optimistic assumption and the ac- tual map developed while executing the path and collecting information is used to trigger the computation of a new plan [47]. The very nature of this problem is online, making an approach based on constant replanning reasonable. We humans typically use scene con- text to reason about unknown spaces. To reduce running into dead ends, humans predict possible realizations of unseen space and plan paths accord- ingly, thereby reducing the number of replans. The goal of this work is to endow robots with a way of reasoning about unknown spaces, as humans do, in order to plan more efficient paths. Our approach, sketched in Figure 4.1, takes as input a partial map in the form of a point cloud with semantic annotations, which can be obtained from a single lidar scan or a single stereo image pair. The semantic point 68 cloud is converted into a bird’s eye view (BEV) image, and the input into an image inpainting neural network [95]. The network fills in empty pixels with semantic labels in order to obtain a reasonable prediction of an obstacle map for path planning. Figure 4.1: from left to right. Current partial map represented as a point cloud annotated with semantic labels (e.g. purple = road); bird’s eye view of the annotated point cloud, where white cor- responds to unlabeled pixels whose semantic class will be pre- dicted via image inpainting; inpainted image; obstacle map as- sociated with the inpainted image (white = free space, black = obstacles), shown along with obstacles discovered online (red) and final path to the goal (blue) in one of our simulations. The main contributions of this paper are: • a novel framework for modeling unknown/occluded spaces for path planning using image inpainting to generate more informed maps • the generation of a dataset for training an image inpainting network suitable for planning in outdoor urban environments. The proposed framework is compared against a typical planner which treats all unknown space as free. Experimental results demonstrate the 69 novel framework greatly reduces (1) the lengths of final paths which are traversed through occlusion heavy environments, (2) the number of replans online, and (3) the number of times the planner replans before it discovers no path is possible. 4.2 Related Works Planning paths for mobile robots in the presence of a fully known, com- pletely deterministic map is a well-understood problem in robotics. The classical approach leverages a grid-based representation of the environ- ment, and uses a graph-search algorithm to plan a distance-optimal path to the goal. Among these, A* [35] is the most widely used due to its effi- ciency. When the map is not fully known in advance, the customary ex- pedient for adapting classical pathfinding algorithms is that of treating all the unknown cells as free, and computing new plans as needed. New al- gorithms were developed to handle the replanning phase more efficiently (D* [81] and its variants [82, 47]); however, unknown space has always been treated as free up to the present day (see, for example, [73, 86]). Learning-based approaches for navigation in partially unknown envi- ronments have started being explored only recently. Ref. [80] focuses on partially mapped indoor environments represented as grids, and presents a method for selecting the “frontier” (i.e. the boundary between free and un- known space) to guide the robot in order to reach the goal. A utility function 70 for the frontiers is learned via a neural network, but can only be used in in- door environments and with traditional (dense) partial grid maps. Ref. [33] presents a next-best view planning approach that addresses unknown ter- rain semantics; however, the geometric map of the environment is assumed to be known in advance. This problem is also being actively investigated by the deep reinforcement learning community [61, 84, 96, 42]. These works give up explicit map representations in favor of other features learned from experience. While these methods demonstrate promising initial results, they only work on simple indoor environments [84, 42] or mazes [61, 96]. We argue that such a sophisticated approach is not actually required for en- abling more robust navigation in realistic, partially unknown environments. Pairing standard pathfinding techniques with reasonable inferences of the unknown portions of the map provides a better approach to address this problem. In this work, we use an image inpainting network [95] to fill in a bird’s eye view (BEV) of an initial sensor measurement of an urban environment. The initial measurement has unobserved regions due to occlusions and sen- sor resolution, which make robust planning difficult. By filling in these re- gions with a data-driven inpainting network, a more informed map can be used for planning. While there are other works demonstrating the use of image inpainting to model urban environments [57, 66, 74], these papers do not actually demonstrate their use in path planning tasks. In addition, these works inpaint behind segmented foregound object classes, which can be restrictive as occluded regions behind background classes can also be in- 71 formative. For example, a building may be labeled as a background class, but it would be useful for a planner to reason about what is behind the building. 4.3 Technical Approach Figure 4.1 shows the flow chart of the proposed framework. Initially, a raw image is taken. Depth data is acquired either through stereo depth or from a lidar scan. Semantic segmentation [14] is then performed on the image or lidar point cloud [3]. The depth and semantic labels are mapped to a point cloud which is rendered from a BEV for use as a map for a planner. Due to geometries in urban scenes, there will likely be unobserved pixels due to occlusions or sensor resolution. These unobserved pixels are then filled in using general adversarial network (GAN) based inpainting. The inpainted output of the network is then used as a more informed map for a planner. 4.3.1 Image Inpainting Image inpainting is a technique traditionally used for restoring old or dam- aged pictures and paintings. The goal is to modify the images to be visually realistic to a viewer and restore the original state of the image [23, 5]. Gen- erally, the user labels regions of the image to be filled in, and the inpainting algorithm automatically fills in those regions. Image inpainting techniques 72 include patch-based techniques, diffusion-based techniques, the use of con- volutional neural networks (CNN), and the use of generative adversarial networks (GAN) [95, 23]. GAN-based inpainting networks simultaneously train a generative network to hallucinate images and a discriminative net- work to evaluate the hallucinated images. GAN-based methods are used in this paper due to their recent superior performance [95, 93]. The GAN-based network used in this paper improves on previous work by introducing a contextual attention layer that utilizes distant image features during training to improve inpainting performance by reducing blurry textures and distortions [95]. Typical inpainting applications include filling in missing parts of an outdoor scene or a persons face. In this work, the goal is to fill in a bird’s eye view (BEV) of a semantic map such as the one pictured in Figure 4.1 (middle left). A set of C semantic classes are defined for the input and output images. Each pixel is defined as pc(i, j), where i and j are the image coordinates of the pixel and c is the semantic class of the pixel. In the input image the pixel can additionally be part of the unobserved class due to sensor range or occlusions. The goal of inpainting is to fill in the unobserved pixels (white) with a color corresponding to one of the c semantic classes. The inpainted image is then used as a more informed map for the path planner. To use the inpainted image as a map for planning, the set of C seman- tic classes are divided into subsets of obstacle classes Co and non-obstacle classes Cn, where C = Co ∪ Cn and Co ∩ Cn = ∅. Pixels belonging to Co are 73 labeled as obstacles and pixels belonging to Cn are considered free space. 4.3.2 Path Planner Any planner suitable for traditional grid-based planning can be used in our framework; as the inpainting will improve the performance of each ap- proach (for example, A* [35], Theta* [20], or D* [47]). A* [35] is used in the experimental section of this paper given its extensive usage. The inpainted image is converted into a 2D obstacle map where each pixel is a node. Pixels belonging to Co (such as wall or vegetation) are labeled as obstacles (black), and pixels belonging to Cn (such as sidewalk or road) are labeled as free space (white). As the robot executes the path, the occupied and free spaces are continuously recomputed (e.g. at a given frequency) based on the most recent sensory information (and hence obstacle map). Local planning in- corporating the robot’s kinematics (and possibly dynamics) can be easily performed by setting a local goal along the computed path, and is out of the scope of this work. Kinematics and dynamics of the robot are not considered at this stage of planning, as the impact of inpainting can be evaluated without this level of detail. However, kinematics and dynamics can be easily taken into account locally by iteratively defining a close cell along the planned path as a local goal, and by using, for example, a sampling-based planner such as RRT [52] or RRT* [43] to compute a feasible path to such local goal or its immediate 74 surroundings. At every step the robot can transition to any of the six neighboring pixels that are not an obstacle. Kinematics and dynamics are out of the scope of this paper not considered for the planner in the experimental section. 4.3.3 Planning with Lidar Data In the proposed model, a dataset is required which includes pairs of data samples: an initial observation (Figure 4.2 (left)), and a filled in ground truth semantic map (Figure 4.2 (right)). The ground truth map is used for network training. Our framework (Figure 4.1) can be used with any lidar sensor that creates a semantically segmented point cloud [68, 3]. The key challenge is labeling ground truth filled in semantic maps for network training. Given adequate training data, the framework can easily be applied to any envi- ronment with unobserved space with any grid-based path planner. While there are several datasets with semantically labeled images or point clouds [31, 19], they do not provide the density from a BEV to gen- erate ground truth semantic maps (Figure 4.2). Datasets which do provide BEV labels [9] generally do not provide ground truth semantic measure- ments as shown in figure 4.2. 75 Lidar Urban Dataset Generation The recent SemanticKITTI dataset [3], provides the tools to generate a dataset with pairs of data samples such as the one in Figure 4.2. More specifically, for each of the 22 sequences in the KITTI odometry dataset [31], SemanticKITTI conglomerates all the lidar scans together into a single point cloud and manually labels 28 semantic classes. The point cloud is then vox- elized and used for voxel-based semantic scene completion tasks. After ev- ery five time steps in each sequence, a ground truth semantic scene comple- tion voxel map is provided within a predetermined volume of 256×256×32 voxels (with voxel resolution of 0.2m) in front of the car. The SemanticKITTI dataset also provides individual lidar scans within the same volume for each time step in each sequence along with the manually labeled semantics. The semantic lidar scans and completed voxel maps are shown in Figure 4.2 with voxels converted to 3D points. These images are rendered from a BEV for the path planner applications. Semantic classes for moving ob- jects and unidentifiable pixels are removed from the renderings as a static environment is assumed. The goal of inpainting is then to fill the empty spaces in Figure 4.2 (left) with the semantics in Figure 4.2 (right); these unknown spaces are a func- tion of occlusions or sensor resolution. To be compatible with semantic la- bels generated from a color camera image, only points which would be seen from a camera frame are kept. In order to do this with the SemanticKITTI dataset, all 3D points in the scan and completed map are projected into the 76 Figure 4.2: Left: The semantic input point cloud of an initial lidar scan. Right: The ground truth completed semantic map. camera frame: x = Tr ∗ P ∗ X (4.1) where x is the coordinates of a point in the image frame, Tr is the trans- formation from lidar coordinates to the rectified camera coordinates, P is the projection matrix after image rectification, and X represents the coordi- nates of a point in the lidar frame. All transformation matrices are provided by the KITTI dataset [31]. Only points which are projected into the image frame are kept. The points are then reprojected into lidar coordinates. An example of a time step from the generated dataset is shown in Fig- ure 4.3. Because the GAN-based inpainting network uses all white pixels as pixels to be inpainted during training, when rendering the semantic point clouds from a BEV, all unknown pixels are rendered as grey. The adversarial network trains itself by blanking out sampled rectangles from the training 77 Figure 4.3: An example from the generated dataset of point clouds ren- dered from a BEV. Left: Input semantically labeled point cloud with occluded and unknown spaces in white. Middle: Mask of image to be inpainted. Known pixels are black and pixels to be filled are white. Note that grey (unobserved) pixels in the left image are labeled as known in the image mask as they are not the targets for inpainting. Right: Ground truth labeled semantic point cloud. image by setting the color in those rectangles as white. Therefore, in the original training image there should not be an excessive amount of white pixels as that would interfere with the network training. Figure 4.3 (left) shows the initial input image to be filled via inpainting. White pixels are required to be filled by the network. A convex hull is fit to the initial lidar scan and unknown pixels inside that convex hull are colored white. Only unknown pixels within the convex hull are designated for inpainting be- cause the network is designed to inpaint unknown pixels using contextual information. In general, the network empirically performs poorly when a majority of pixels are designated for inpainting. Figure 4.3 (middle) shows the input mask to the network, where black pixels are known and white pixels are unknown. Figure 4.3 (right) is the ground truth filled in semantic map which is used for the actual training of the network. 78 Figure 4.4: A distribution of the semantic pixel labels of the training dataset of the first ten sequences. Each semantic class is shown in the same color as displayed in the dataset. Note that the y-axis is log scale. SemanticKITTI provides the first 11 of the 22 labeled KITTI odometry se- quences publicly. In this work, the first ten sequences are used for training, and the eleventh sequence is used for testing. The fully processed dataset (all 11 sequences) of images shown in Figure 4.3 consists of 4,649 total pairs of training and input images. There are 241 images in the eleventh sequence which are used for testing. Two semantic classes of bus and other-vehicle did not have any corresponding labeled pixels. Figure 4.4 shows the dis- tribution of semantic pixel labels in the training dataset of the first ten se- quences. 4.3.4 Planning with Stereo Camera Data The proposed novel framework can also use stereo camera data as input. Given the SemanticKITTI dataset, lidar depth and ground truth semantic labels are used for experimental purposes. However, the framework has the flexibility to also use stereo depth and semantic segmentations from 79 any network. While the SemanticKITTI dataset contains ground truth for lidar depth which is used for the quantitative evaluation, a qualitative eval- uation of semantic map inpainting with only a stereo camera as a sensor is performed to show the potential of the framework to be used with stereo depth. The pipeline for using stereo depth is the same as for using lidar depth (Figure 4.1). Stereo results are expected to be worse than lidar due to quadratically increasing error in disparity with respect to depth. However, stereo cameras are frequently more convenient than lidar so it is important to have the flexibility to use stereo cameras in the framework. For evaluation with stereo depth, learned stereo disparity from PSM- NET [12] is used due to its high performance on KITTI benchmarks. How- ever, other stereo depth algorithms could be used without loss of generality. To perform semantic segmentation, a DeepLabv3+ network [14] pretrained on Cityscapes [19] is used. While the Cityscapes dataset is not necessarily the same as the KITTI dataset, there is a large overlap of classes and the Se- manticKITTI dataset uses a pretrained network from Cityscapes for image segmentation. 80 4.4 Experimental Evaluation Both qualitative and quantitative experimental evaluations are conducted for lidar depth input. Only qualitative evaluations are conducted for stereo depth input as there is no ground truth data for stereo depth. The out- put of the inpainting network is qualitatively evaluated based on its visual comparisons to the environment being reconstructed. Inpainting on input images generated from lidar scans and stereo cameras are evaluated. Quan- titative evaluation is conducted on the inpainting accuracy compared with ground truth using the mean Intersection-over-Union (mIoU) [3]. The abil- ity of the inpainting network to generate better informed maps for path planning is quantitatively evaluated by comparing the (1) number of plan- ner steps taken, (2) the number of times the planner replans, and (3) the number of times the planner replans before discovering no path is possible. 4.4.1 Network Training The GAN-based network is trained using output (filled in semantic maps) ground truth images generated of the first ten sequences from Se- manticKITTI dataset. There are a total of 4,408 images in the training dataset (700×580 pixels each). The network is trained on a GeForce GTX 1080 Ti GPU with 11 GB of memory. The network was trained for two and a half days. 81 Figure 4.5: Qualitative evaluation of the inpainting network performance. Odd rows are the RGB semantic maps and even rows are the obstacle maps generated from the semantic BEV images. Columns from left to right: (1) lidar input, (2) inpainted image, (3) stereo input, (4) inpainted image, (5) ground truth (GT). For the obstacle maps, black pixels are obstacles and white pixels are free space. 82 4.4.2 Qualitative Evaluation Figure 4.5 shows qualitative results for three images of different scenes (45, 150, 285) from the test set. Results are from the input BEV images from lidar depth described in section 4.3.3 (Figure 4.5 columns 1, 2) and also input BEV images from stereo depth described in section 4.3.4 (Figure 4.5 columns 3, 4). Figure 4.4 shows the color key for semantics in Figure 4.5. Note that the input images are based only on a single sensor measurement, while the ground truth is based on all measurements for each sequence. This is clear for image 45 in Figure 4.5, where there is much less vegetation (dark green pixels) in the top left of the image compared to the ground truth, because the ground truth takes all past and future measurements into account. For input images from lidar depth, the network reasonably fills in road pixels (pink). For image 150, the network is able to fill in the sidewalk (pur- ple) enclosed by vegetation (green) and fence (orange) in the center of the image next to the road. For image 285, the network does not accurately predict the sidewalk perpendicular to the road in the middle of the im- age. However, it does a good job of filling in the bottom half of the image with vegetation and buildings (yellow). Due to the small amount of avail- able data for training, the network does not learn all the possible structures based on the input images but reasonably fills in empty spaces occupied by road, buildings, and vegetation. For input images from stereo depth with semantic labels from the 83 deeplabv3+ network, the images are generated from a different distribution than the trained inpainting network. Thus, it is expected that the network will perform worse than on the lidar derived images. In image 45 in Figure 4.5, the network is unable to predict the road structure well and fills most of the unobserved space as terrain. In images 150 and 285, the road structure is cutoff early, but the network is able to predict the building structures in the bottom center of the maps. Because the stereo depth inputs are from a dif- ferent distribution than the lidar depth inputs, an exact comparison cannot be made between the two as the network is trained on only ground truth lidar data. However, the potential of the inpainting network to be applied to a single stereo camera sensor is demonstrated. 4.4.3 Quantitative Evaluation Quantitative evaluation is conducted on (1) network inpainting perfor- mance compared to ground truth and (2) path planner performance using inpainted images. While planner performance improvement using inpaint- ing is the goal of this paper, inpainting network performance is provided as a reference to the reader. For evaluation of the inpainting performance of the network, the mean Intersection-over-Union (mIoU) is used: 1 ∑C TPc (4.2) C TP c=1 c + FPc + FNc 84 where TPc is the number of true positives for each class c, FPc is the number of false positives, FNc is the number of false negatives, and C is the number of classes. Only pixels within the inpainting mask or convex hull are used to calculate the mIoU. The mIoU for the testing dataset is 13.10. As a comparison, the top per- forming baseline for scene completion from SemanticKITTI had a mIoU of 17.70. Note that the SemanticKITTI scene completion task is 3D seman- tic voxel completion with a dataset of 19,130 input and target pairs used for training. The mIoU measured in this paper is for image inpainting with a to- tal of 4,408 input and target pairs available for training. While the mIoU for inpainting is relatively low, it should greatly increase given a larger dataset. To evaluate the performance of the framework in creating better maps for path planning, experiments are conducted with an A* planner. A custom simulator is used, where the robot is assumed to have a 360◦ sensor within a range of 30 pixels that can be used to update the initial lidar-based map, and replans whenever the path is discovered to be crossing through an obstacle (based on the ground truth map). The obstacle maps shown in Figure 4.5 are used for planning. These images were chosen to represent three levels of difficulty: easy (image 45), medium (image 150), and hard (image 285). Difficulty level is decided based on number of classes and structure of the scene. Image 285 is considered more difficult than image 150 because image 285 has cars (light blue pixels), more buildings (yellow pixels), and contains an intersection structure. For 85 the planner, each pixel of the image is considered a node and the euclidean distance from the path node to the goal is used as the distance heuristic. Images are resized to 25% of their original size (175×145 pixels) to reduce computation time. The obstacle maps are evaluated for (1) the final number of path steps taken from start to goal, (2) the number of times the planner replans, and (3) the number of times the planner replans before determining no final path is found. Table 4.1 displays results for the quantitative path planner evaluation. Trials were run and compared for the single lidar scan input image, the in- painted image, and the ground truth (GT). For each of the three scenes (45, 150, 285), trials were conducted using random goal locations, and start lo- cations either random or fixed near the sensor with a Gaussian distribution. For each study, 50 trials were conducted, with the mean (std) given in Ta- ble 4.1. The inpainted image outperforms the single lidar scan for every experiment except for the easy image (45) with starting locations fixed near the sensor. This is due to the simplicity of the scene, where the inpainting image is not necessary to fill in any occluded spaces. The inpainted image outperforms the baseline in path steps taken, number of times replanned, and number of replans before discovering no path is available, for all other experiments. Inpainting allows the planner to ”see” around corners and behind obstacles to predict what types of semantic labels are likely to be present in those unobserved spaces. This allows the planner to plan more informed paths which are less likely to run into dead ends. This is evident as not only do the number of path steps decrease when using the inpainted 86 Table 4.1: Evaluation of inpainting for planning with the baseline and ground truth (GT). Metrics are given as mean (std). Each study has 50 trials with goal positions chosen randomly. Smaller numbers are better. Image 45 (easy), Fixed Start Input (baseline) Inpainted (ours) GT Path Steps 70.8 (54.6) 75.3 (67.3) 64.9 (35.4) Replans 3.5 (8.1) 3.2 (8.6) 0 (0) Replans (no path found) 53.0 (1.0) 43.5 (1.5) 0 (0) Image 45 (easy), Random Start Input Inpaint GT Path Steps 118.6 (84.0) 113.5 (80.4) 91.1 (49.7) Replans 11.04 (12.2) 9.7 (10.9) 0 (0) Replans (no path found) 48.0 (0.0) 42.0 (0.0) 0 (0) Image 150 (medium), Fixed Start Input Inpaint GT Path Steps 173.9 (171.3) 105.6 (94.1) 86.4 (45.5) Replans 10.4 (11.7) 6.1 (6.3) 0 (0) Replans (no path found) 5.5 (1.6) 4.0 (0.9) 0 (0) Image 150 (medium), Random Start Input Inpaint GT Path Steps 125.8 (134.8) 121.7 (128.7) 95.5 (79.1) Replans 10.1 (15.6) 7.7 (10.9) 0 (0) Replans (no path found) 24.3 (18.8) 19.6 (13.5) 0 (0) Image 285 (hard), Fixed Start Input Inpaint GT Path Steps 149.5 (158.9) 95.3 (56.3) 91.6 (52.0) Replans 9.2 (12.5) 5.2 (6.2) 0 (0) Replans (no path found) 42.6 (8.0) 40.7 (10.6) 0 (0) Image 285 (hard), Random Start Input Inpaint GT Path Steps 169.3 (123.0) 140.6 (97.7) 124.4 (68.4) Replans 15.1 (14.7) 7.5 (7.0) 0 (0) Replans (no path found) 46.1 (21.2) 32.4 (12.1) 0 (0) 87 map, so do the number of times the planner needs to replan. By using previ- ous experience (training data) to predict unknown spaces for path planning, the proposed framework imitates human behavior and greatly improves performance. The ability to fill in occluded sections of vegetation or walls and build- ings is incredibly helpful for path planning, as shown in Figure 4.6. There, a trial on image 150 (medium difficulty) with start locations near the sensor is shown. For the lidar input obstacle map, there is initially an empty space in the middle of the map shown by the light blue circle. The planner attempts to plan a path through the gap but eventually observes there is a wall; the robot replans by continuing to move right and runs into a dead end. Even- tually, the robot must plan all the way back around causing it to take many (399) additional steps. The inpainted obstacle map (Figure 4.6) fills in that gap and the robot eventually takes a path (129 steps) much closer to the ground truth shortest path. 4.5 Takeaways In this paper, a novel framework is presented to model unobserved and occluded regions for planning in outdoor urban environments. The plan- ner uses GAN-based inpainting to generate better informed obstacle maps for planning compared to an initial sensor detection. Experiments validate the ability of the novel framework to reduce the number of steps taken, 88 Figure 4.6: A comparison of a path planner trial on the lidar input obstacle map, the inpainted obstacle map, and the GT map shown in figure 4.5 (image 150). The path is in blue pixels. The start is in the middle left and the goal is in the top right. Red pixels are obstacles the robot observes online. Green pixels show where the path overlaps itself. The light blue circles show a gap in the lidar input map which is filled in on the inpainted image. the number of times replanned, and the number of times replanned be- fore discovering no possible path. The impact of the inpainting is higher as the scene becomes more challenging with more occlusions, as it allows the planner to predict obstacles and dead ends in unobserved spaces and avoid them. 89 CHAPTER 5 SEMANTICALLY ENABLED LONG HORIZON PLANNING 5.1 Motivation Long horizon planning in outdoor environments is a difficult challenge due to limitations in sensor equipment. Lidar-based sensors are able to collect accurate depth measurements but are unable to collect dense measurements at far distances due to sensor resolution. Image based detectors are able to provide semantic information at far distances but are unable to give accu- rate depth at those ranges. In order to address the shortcomings of both sensor paradigms, this chapter presents a semantically aware long-horizon planner which uses receding horizon on a semantic long distance map gen- erated using short distance metrical information and long distance ordinal information. The proposed planner is novel due to planning long distance paths in the long distance map, which combines both short distance metric (metric range) and long distance semantic information (ordinal range with rough depth information). The planner executes the segment of the path which passes through the metric range and replans after taking a new mea- surement. Fig. 5.1 shows a motivating scenario for this work. The image is taken from an outdoor scene with a ZED stereo camera. The purple represents a region close to the robot where metric depth information is available (for 90 example, from a lidar sensor or stereo camera) in order to plan short dis- tance paths such as over traditional occupancy grids. The red arrows point to potential obstacles further off in the distance which the robot must plan around. However, these obstacles may be too far away for a sensor to detect. The orange region in the image represents the portion of the environment where there is only semantic info (e.g. from a camera) but no range info. In order to use information about obstacles in the scene, the robot must be able to detect instance segmentations of the obstacles. For outdoor en- vironments this includes trees and buildings which are not represented in current state of the art datasets. This work also contributes an instance seg- mentation dataset for tree trunks in outdoor environments which is used to train a detector for real world experiments. The main contributions of this work are: • A semantically aware long horizon map for planning which uses metrical information in close ranges and semantic information in far ranges. • A novel planning paradigm which plans a path in the combined metri- cal map (short distance) and semantic map (long distance), and which executes on the planned path in the metric range before replanning. • An instance segmentation dataset for tree trunks to outdoor planning in forested environments. 91 Figure 5.1: Motivating scenario taken at the Mann Library at Cornell Uni- versity. The purple region represents where the robot is able to get accurate depth information and perform planning using traditional metrical planning methods. The red arrows repre- sent far away obstacles which the robot must plan around. The orange represents the region where the robot is able to capture semantic but not depth information in the scene. 5.2 Related Works 92 Metric Based Planners There has been substantial work done in the robotics community for met- ric based planners which rely on accurate metric information from sensors such as lidar [87], [2]. However these methods generally rely on the high accuracy of lidar, which are unable to account for semantic information of obstacles. Stereo cameras are able to capture both depth and semantics but can only plan accurate paths in short distances. Image Based Planners Ref. [65] presents a planner which directly plans on the image space of a detector. By planning directly on the image space, [65] is able to plan over longer distances compared to metric based planners and are able to avoid obstacles as long as they are visible in the image. However, the approach uses a simple color based obstacle detector which maps pixels to obstacles based on their color. A more sophisticated approach is desired, especially with the advent of modern computer vision technology. Ref. [72] proposes a semantically aware planner which uses semantic segmentation information in order enable an unmanned aerial vehicle to track a road. This work demonstrates the ability to bypass the use of a heavy lidar sensor and is able to plan at long ranges using semantic information, and outperforms traditional metric based planners. However, this approach seeks only to track a visible road and does not plan in occluded spaces or in 93 off-road environments. Ref. [78] proposes a sparse map representation which can be derived from monocular vision. This technique estimates the high level structure of the environment and is able to reason about the structure between bound- aries in the environment. However, the approach does not plan in more realistic outdoor environments. Receding Horizon Methods The receding horizon paradigm [30], has found a lot of success in the con- trols and planning community due to its ability to create smooth trajectories by accounting for future actions. Receding horizon methods plan actions over multiple time steps, but only execute the first step in the sequence of planned actions. At the next time step, the planner plans over multiple time steps again, before again only executing the first step. This allows the plan- ner to create smooth trajectories because long term information is encoded into the action of the first time step each iteration. Instance Segmentation There has been a plethora of works in the computer vision community for instance segmentation [36], [92]. These works have performed very well at instance segmentation tasks. However, the datasets available in instance 94 segmentation for outdoor scenes [31], [19], generally only include fore- ground objects such as cars, pedestrians, cyclists and do not include back- ground obstacles such as buildings and trees which are relevant to long horizon path planning. 5.3 Approach First, we demonstrate a basic image space based planner which expands upon [65]. We show the limitations of the basic image spaced based planner map representation. Then, we introduce a semantically aware long horizon map representation, which takes advantage of semantic instance informa- tion with rough long range depth estimates. 5.3.1 Image Based Planning Expanding on [65]. A semantically aware map representation of the envi- ronment is extracted by projecting the image pixels into a 3D point cloud. Ref. [14] is used to extract the semantic segmentation information from the image and then depth from [56] is used to project the semantic pixels into a point cloud. Note that these semantic labels do not include instance in- formation. This map representation expands on the work done in [65] by using a deep learning approach to extract semantic information for obstacle detection, improving on the previous color based classification approach, 95 while also allowing for planning in occluded spaces. Fig. 5.2 shows the key steps for extracting the 3D point cloud. A Rapidly-exploring Random Tree (RRT) planner is then used to plan a path over the ground plane points of the point cloud. The ground plane points are labeled based on the semantic segmentation labels. Fig. 5.3 shows the RRT planner. Fig. 5.3(a) show the nodes (spheres) and edges (lines) of the RRT planner. To plan in occluded spaces on the point cloud, a ground plane is fit to the point cloud using Principle Component Analysis (PCA) and RRT nodes are projected onto the ground plane in regions where obsta- cles where not observed. While this expansion on [65] shows some promise for a image based planner with semantic information, it is unable to reason about the different instances of vegetation in the image, as all vegetation is segmented into the same class. All the trees are labeled as vegetation and the planner is unable to separate the cloud of vegetation points into multiple tree instances. In- stance segmentations are necessary to enable more intelligent decision mak- ing. 5.3.2 Instance Segmentation In order to perform image segmentation and detect relevant background images, a dataset is manually labeled using the Makesense AI tool [77]. The Makesense tool allows the user to manually annotate polygons which rep- 96 (a) (b) (c) Figure 5.2: Example of generating a point cloud for an image based plan- ner. (a) Raw image of the environment. (b) A semantically seg- mented image using [14]. (c) A 3D point cloud projected using depth classification from [56]. 97 (a) (b) Figure 5.3: An example of an RRT image space based planner. (a) Sample space with plan hypotheses. The start is in the bottom center and the goal is in the top right. The start and goal are both annotated as red spheres. RRT nodes are shown as blue spheres and edges shown as lines. Blue lines are visible and red lines are occluded. (b) The planner back-projected into the image space. resent instance segmentations of objects in the scene. A total of 50 images of a local park in Ithaca NY, are manually annotated to label tree trunks. Trees are obstacles for planning in this scene Fig. 5.4 shows an example image taken from the manually labeled in- stance segmentation dataset with tree trunk labels. The white polygons 98 represent individual instance labels for trees. Figure 5.4: A sample image from the manually labeled dataset for instance segmentations. The white polygons represent individual in- stances of trees. A pre-trained model of Mask R-CNN [36] is fine-tuned with the labeled dataset in order to detect instance segmentations of tree trunks which are obstacles in the environment. 5.3.3 Long Distance Map A long distance map is built using instance segmentations from [36] and depth information extracted from the ZED2 stereolabs camera SDK. Each image contains a set of n obstacle instances O = {o1, o2, ...on}. Each instance 99 is classified as a metric obstacle or ordinal obstacle based on their distance from the camera. So there are two sets of obstacles: Om (metric) and Oo (ordinal) where Om ∪ Oo = O, and Om ∩ Oo = ∅. A predefined metric range threshold tmetric is used to classify obstacles as metric or ordinal; obstacle instance masks with greater than 50% of pixels inside tmetric are placed into Om, and all other obstacles instances placed into Oo. Figure 5.5 shows tree trunk detections classified into Om (blue) and Oo (purple). Figure 5.5: Mask R-CNN detections and distance classifications. Tree trunk instances in Om are shown in blue and instances in Oo are shown in purple. To generate the map for planning, the field of view (FOV) is split into the 100 metric FOV and the ordinal FOV using the same tmetric. The metric FOV is the region within tmetric and the ordinal FOV represents the region outside of tmetric. Instance distances from the sensor for Om are calculated based on the average distance of the instance mask pixels within tmetric. This is because depths of edges of objects are extruded backwards and have poor depth information. For tree trunks, a reasonable heuristic is to use the depths of the instance mask that are inside the predefined tmetric, as the size of a tree trunk is relatively small compared. Instance distance from the sensor for Oo are calculated using the average depth of the instance pixels. For ordinal instances, a rough estimate of the depth suffices for the map so taking the average over all of the instance pixels is reasonable. Obstacles in Om are directly projected into the metric FOV. Obstacles in Oo are sorted by their rough stereo depth estimate and the projected into the ordinal FOV with the following steps: 1. The instances in Oo are sorted by distance from the sensor. 2. The instances are projected into the ordinal FOV based on their sorted order but rather than using the rough depth estimate, their depth are separated by a predefined distance dord which is added onto tmetric. For example, if tmetric = 5m, dord = 2.5m and there happened to be three instances in Oo, the closest instance would be projected at 7.5m, the middle instance would be projected at 10m, and the farthest instance would be projected to 12.5m. 101 The long distance map includes both close metrical information and also long distance ordinal information. By combining information from two range spaces a representation is developed to enable planning over larger ranges with metrical and semantic information. (Figure 5.1). Figure 5.6 shows the map generated based on the detections in Figure 5.5. Figure 5.6: The map representation based on the detections show in Figure 5.5. The metric FOV is shown in light blue with the individual tree trunk instances in dark blue. The ordinal FOV is shown khaki, with individual tree trunk instances in purple. The ob- stacles in the ordinal FOV are ordered based on their rough depth estimates. 102 5.3.4 Receding Horizon Planner The proposed planner is inspired by the work the community has done with receding horizon planners. First, the field of view is project to a 2D birds eye view (BOV) image space such as in Figure 5.6. Next, the closest free point (image pixel) in the metric and ordinal FOVs to the global goal is found. Then, an off-the-shelf planner such as A* is used to plan a path from the sensor to that closest free point. However, the robot only traverses the seg- ment of the path that is inside the metric FOV. This intuitively allows the path planner to encode information from the ordinal FOV into the metric FOV, while the robot only traverses the region where there is accurate met- rical information. Once the robot has reached the end of the path segment in the metric FOV, a new measurement is taken and the planner replans a path. By combining both the metric and ordinal regions, the planner is able to plan more informed short distance trajectories, based on long distance information. 5.4 Results The long horizon planner is validated both in simulation and in a real world experiment. 103 5.4.1 Simulation A simulator based on a manually annotated real world outdoor environ- ment is used for simulation experiments of the long horizon planner. To generate a map for the simulator, a Google Maps image (1028 x 709) of an outdoor park in Ithaca (Cass Park), is manually annotated using Makesense AI, with trees as obstacles. Figure 5.7 shows the simulation environment with manually labeled instances of obstacles shown as blue polygons. Figure 5.7: The map used for simulations with manually annotated poly- gons for obstacles in blue. Random start and goal locations are chosen for the simulator with the 104 only constraint being that the start and goal locations have to be at least 600 pixels apart. A map representation such as the one in Figure 5.6 is used for the simulator with a metric range of 100 pixels and ordinal detections of up to 250 pixels from the sensor. In other words, for the simulator, it is assumed that it can detect accurate depth for a range of 100 pixels and it can detect rough depth for a range of 250 pixels. For each new measurement, the sensor is assumed to be pointing in the direction of the goal. The baseline planner used for comparison to the semantically-aware long horizon planner uses the metric FOV to plan. The baseline planner finds the closest point to the goal in the metric FOV and plans a path to that point. Then, the planner takes a new measurement from that point and replans again until it reaches the goal. In 1000 trials with random start and goal locations, where both the long horizon planner and the baseline are able to find a path from the start to the goal, the long horizon planner takes an average of 2.8% fewer steps to reach the goal, with a standard deviation of 2.7%. Table 5.1 shows the num- ber of trials where the baseline took a greater number of steps to reach the goal, less number of steps, and an equal number of steps compared with the long horizon planner. The baseline takes more steps than the long horizon planner 97% of the cases. The improvement of the long horizon planner on the baseline is extremely consistent as it almost always is a speed up. This demonstrates the efficacy of encoding long distance information into the planner, by attaching a qualitative ordinal map onto the metric range of 105 Greater Less Equal 970 28 2 Table 5.1: Number of trials out of 1000, where the baseline took a greater number of steps to reach the goal, less steps, and an equal num- ber of steps compared with the long horizon planner. the sensor, at reducing the number of steps taken. 5.4.2 Real World Experiment Real world experiments were completed in Cass Park in a region shown on Figure 5.8. A ZED2 stereo camera is used as a sensor to collect raw images and depth detections. Measurements are always taken with the sensor fac- ing the goal point. The fine-tuned Mask R-CNN model is used to detect instances of obstacles online. The metric range for the metric FOV (both sensor and planner) is 5 meters. Figures 5.9 and 5.10 show a comparison between the baseline and long horizon planners. In Figure 5.9, the baseline planner detects the mask of the tree trunk from far away but is unable to generate a plan that accounts for the obstacle because the tree trunk is not inside the metric range 5.9 (1b, 2b, 3b). It is not until the sensor is very close to the tree trunk in Figure 5.9 (4a) that the tree is accounted for the planner (see Figure 5.9 (4b)). This causes the planner to have to plan around the obstacle in close range, which is inefficient in close quarters. 106 Figure 5.8: The map of Cass Park in Ithaca used for real world experi- ments. The start and goal points are annotated on in light blue. In Figure 5.10, the long horizon planner is able to detect the tree trunk from far away (1b, 2b) and plans a path which avoids the tree trunk early on, so that by the time the sensor reaches the tree trunk it is able to com- pletely avoid the obstacle (3b). By accounting for the obstacle which was outside of the metric planner range, and by using the proposed receding horizon inspired planner, the planner is able to avoid the cost of avoiding the obstacle at close range, such as in Figure 5.9. The baseline planner to- taled 238 steps and the long horizon planner totaled 230 steps, representing a 3.5% decrease in number of steps taken. 107 (1a) (1b) (2a) 108 (2b) (3a) (3b) 109 (4a) (4b) Figure 5.9: Real world experiment with the baseline planner. (1a-b) show the first measurement and planned path in the sensor FOV, (2a- b) show the second and so on. Images labeled (a) show the raw image detection with the detected instance mask projected on to them. Purple masks indicate the instance is outside the met- ric range, and blue masks indicate the instance is inside the metric range. Images labeled (b) show the birds eye view FOV given the sensor location and image detection. The metric FOV is in light blue and obstacles in the metric range are blue. The global start location is the green circle and the global goal loca- tion is the white circle. The planned path segment in the metric FOV is shown in white. 110 (1a) (1b) (2a) 111 (2b) (3a) (3b) 112 Figure 5.10: Real world experiment with the long horizon planner. (1a-b) show the first measurement and planned path, (2a-b) show the second and so on. Images labeled (a) show the raw im- age detection with the detected instance mask projected on to them. Purple masks indicate the instance is outside the metric range, and blue masks indicate the instance is inside the met- ric range. Images labeled (b) show the birds eye view FOV given the sensor location and image detection. The metric FOV is in light blue and the ordinal FOV is in khaki. Obstacles in the metric range are blue and obstacles outside of the metric range are purple. The global start location is the green circle and the global goal location is the white circle. The planned path segment in the metric FOV is shown in white and the path segment in the ordinal FOV is shown in orange. 5.5 Takeaways This chapter presents a semantically-aware long horizon planner enabled by a map representation of the environment which allows for long distance planning in outdoor environments. Long distance ordinal information is stitched onto a short distance metrical map for a receding horizon path planner which is able to encode information of far away obstacles. The pro- posed long range planner is able to bypass the limitations of depth sensors, and utilize the rich semantic image information for long distance planning. Experimental results in simulation and a real world scenario show that the long horizon planner is able to reduce path length compared to a traditional metric based path planner. 113 CHAPTER 6 SUMMARY AND CONCLUSIONS Some of the key challenges in robotics moving forward are in making intelligent decisions in complex unstructured environments which cannot be cleanly discretized. My work focuses on how to use perception systems to enable robots to interact and plan in unstructured environments. Chapter 2 presents a model for pedestrian motion, which combines low level velocity flow fields and high level decision points. The model is capa- ble of predicting a distribution over future motion and performing online anomaly detection. Chapter 3 presents a framework for planning in uncertain outdoor envi- ronments with different semantic surface types. A NBV planner iteratively decreases uncertainty in a multiple hypothesis path planner until a safe path is found or all paths are eliminated as unsafe. Chapter 4 presents a framework to use a data-driven model to navigate in occlusion heavy environments. The framework is intuitive and is ca- pable of inferring semantic information of occluded regions in addition to geometric information. Chapter 5 develops and demonstrates a semantically aware long hori- zon planner which allows for encoding of long distance detections in image space into a short distance metric range based planner. 114 The main contributions of my research are: • A novel pedestrian motion model which is capable of modeling low level velocities and high level discrete transitions. • A framework for utilizing the uncertainties from a segmentation net- work for safe path planning in unstructured outdoor environments. • A novel data-driven approach to planning in occluded environments which outperforms traditional planners. • A long horizon planner which bridges the gap between long distance semantic detections in image space with a short distance metrical planner. Key areas of future research should focus on innovative approaches to using techniques from the computer vision community to expand the boundaries of robot capabilities in uncertain, unstructured, and outdoor environments. These are challenging environments to operate in, and the fusion of semantic information (from advances in computer vision) with traditional metric robot planning will be the direction of my work. In par- ticular, combining the semantically aware long horizon planner presented in Chapter 5, with the inpainting based planning approach in Chapter 4, is a potentially innovative approach to intelligent planning over long dis- tances with occluded spaces. In addition, leveraging the uncertainties for robust navigation with a method similar to the one from Chapter 3 would make the planner robust from complex environmental conditions such as weather or lighting. 115 BIBLIOGRAPHY [1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Ro- bicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016. [2] B.L.E.A. Balasuriya, B.A.H. Chathuranga, B.H.M.D. Jayasundara, N.R.A.C. Napagoda, S.P. Kumarawadu, D.P. Chandima, and A.G.B.P. Jayasekara. Outdoor robot navigation using gmapping based slam al- gorithm. In 2016 Moratuwa Engineering Research Conference (MERCon), pages 403–408, 2016. [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. SemanticKITTI: A Dataset for Semantic Scene Understand- ing of LiDAR Sequences. In Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV), 2019. [4] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recognition in the wild with the materials in context database. In Proc. CVPR, pages 3479–3487, 2015. [5] Marcelo Bertalmı́o, Guillermo Sapiro, Vicent Caselles, and C. Ballester. Image inpainting. pages 417–424, 01 2000. [6] C. Bishop. Pattern Recognition and Machine Learning. Springer Sci- ence+Business Media, 2006. [7] R. Border, J. D. Gammell, and P. Newman. Surface edge explorer (SEE): Planning next best views directly from 3d observations. In Proc. ICRA, pages 6116–6123, 2018. [8] J. Brandt. At the intersection of business and technology. PinHawk- Blog, 2015. [9] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Os- car Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019. 116 [10] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proc. CVPR, pages 1209–1218, 2018. [11] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. [12] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching net- work. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 5410–5418, 2018. [13] R. Omar Chavez-Garcia, Jérôme Guzzi, Luca Maria Gambardella, and Alessandro Giusti. Learning ground traversability from simulations. IEEE RAL, 3:1695–1702, 2017. [14] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolu- tion for semantic image segmentation. In Proc. ECCV, pages 801–818, 2018. [15] Y. Chen, M. Liu, S. Liu, J. Miller, and J. How. Predictive modeling of pedestrian motion patterns with bayesian nonparametrics. In AIAA Guidance, Navigation, and Control Conference, 2016. [16] Yu Fan Chen, Miao Liu, and Jonathan P. How. Augmented dictio- nary learning for motion prediction. In IEEE International Conference on Robotics and Automation (ICRA), 2016. [17] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proc. CVPR, pages 1251–1258, 2017. [18] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepi- ane, Fabio Ganovelli, and Guido Ranzuglia. Meshlab: an open-source mesh processing tool. In Eurographics Italian chapter conference, vol- ume 1, pages 129–136, 2008. [19] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and 117 Bernt Schiele. The cityscapes dataset for semantic urban scene under- standing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [20] Kenny Daniel, Alex Nash, Sven Koenig, and Ariel Felner. Theta*: Any- angle path planning on grids. J Artif Intell Res, 39:533–579, 2010. [21] Jonathan Daudelin and Mark Campbell. An adaptable, probabilistic, next best view algorithm for reconstruction of unknown 3d objects. IEEE RAL, 2(3):1540–1547, 2017. [22] Jeffrey A. Delmerico, Alessandro Giusti, Elias Mueggler, Luca Maria Gambardella, and Davide Scaramuzza. ”on-the-spot training” for ter- rain classification in autonomous air-ground collaborative teams. In Proc. ISER, 2016. [23] Omar El Harrouss, Noor Almaadeed, Somaya Al-ma’adeed, and Y. Ak- bari. Image inpainting: A review. Neural Processing Letters, 12 2019. [24] D. Ellis, E. Sommerlade, and I. Reid. Modelling pedestrian trajectory patterns with gaussian processes. In IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, 2009. [25] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD- 96 Proceedings, 1996. [26] Dave Ferguson and Anthony Stentz. Using interpolation to improve path planning: The field D* algorithm. J Field Robot, 23:79–101, 2006. [27] N. Ferreira, J. Klosowski, C. Scheidegger, and C. Silva. Vector field k-means: Clustering trajectories by fitting multiple vector fields. Com- puter Graphics Forum, 2013. [28] Paul Filitchkin. Visual Terrain Classification For Legged Robots. PhD the- sis, University of California, Santa Barbara, 2011. [29] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approxima- 118 tion: Representing model uncertainty in deep learning. In Proc. ICML, 2016. [30] Carlos E. Garcı́a, David M. Prett, and Manfred Morari. Model predic- tive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989. [31] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3354– 3361, 2012. [32] Timo Hackel, N. Savinov, L. Ladicky, J. Wegner, K. Schindler, and M. Pollefeys. SEMANTIC3D.NET: A new large-scale point cloud clas- sification benchmark. ISPRS Annals, 4(1):91–98, 2016. [33] Yutao Han, Hubert Lin, Jacopo Banfi, Mark Campbell, and Kavita Bala. DeepSemanticHPPC: Hypothesis-based planning over uncertain semantic point clouds. In Proc. ICRA, 2020. To appear. [34] Jason Hardy, Frank Havlak, and Mark Campbell. Multi-step prediction of nonlinear gaussian process dynamics models with adaptive gaus- sian mixtures. The International Journal of Robotics Research, 2015. [35] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE T Syst Sci Cyb, 4(2):100–107, 1968. [36] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. [37] Benjamin Hepp, Debadeepta Dey, Sudipta N. Sinha, Ashish Kapoor, Neel Joshi, and Otmar Hilliges. Learn-to-score: Efficient 3d scene ex- ploration by predicting view utility. In Proc. ECCV, pages 437–452, 2018. [38] T. Ikeda, Y. Chigodo, D. Rea, F. Zanlungo, M. Shiomi, and T. Kanda. 119 Modeling and prediction of pedestrian behavior based on the sub-goal concept. In Robotics: Science and Systems, 2012. [39] Alexander Ivanov and Mark E. Campbell. Uncertainty constrained robotic exploration: An integrated exploration planner. IEEE T Contr Syst T, 27:146–160, 2016. [40] H. Jeung, H. Shen, and X. Zhou. Mining trajectory patterns using hidden markov models. In Data Warehousing and Knowledge Discovery, 2007. [41] J. Joshua, F. Doshi-Velez, A. Huang, and N. Roy. A bayesian non- parametric approach to modeling motion patterns. Autonomous Robots, 2011. [42] Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, and Sergey Levine. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In Proc. ICRA, pages 5129– 5136, 2018. [43] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. Inter J Robot Res, 30(7):846–894, 2011. [44] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Proc. NIPS, pages 5574–5584, 2017. [45] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In Proc. CVPR, pages 9404–9413, 2019. [46] T. Klinger, F. Rottensteiner, and C. Heipke. Probabilistic multi-person tracking using dynamic bayes networks. ISPRS Annals of Photogram- metry, Remote Sensing and Spatial Information Sciences, 2015. [47] Sven Koenig and Maxim Likhachev. D* lite. In Proc. AAAI, pages 476– 483, 2002. 120 [48] Philipp Krüsi, Paul Furgale, Michael Bosse, and Roland Siegwart. Driving on point clouds: Motion planning, trajectory optimization, and terrain assessment in generic nonplanar environments. J Field Rob, 34(5):940–984, 2017. [49] T. Laursen, N. B. Pedersen, J. J. Nielsen, and T. K. Madsen. Hidden markov model based mobility learning fo improving indoor tracking of mobile users. In 9th Workshop on Positioning, Navigation and Commu- nication, 2012. [50] S. LaValle and J. Kuffner. Randomized kinodynamic planning. Int J Robot Res, 20(5):378–400, 2001. [51] Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998. [52] Steven M LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998. [53] Steven M. LaValle. Planning Algorithms. Cambridge University Press, New York, NY, USA, 2006. [54] J. Lee, J. Han, and K. Whang. Trajectory clustering: A partition-and- group framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2007. [55] Kuan-Hui Lee, J. Hwang, G. Okapal, and J. Pitton. Driving recorder based on-road pedestrian tracking using visual slam and constrained multiple-kernel. In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2014. [56] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018. [57] C. Lu and G. Dubbelman. Semantic foreground inpainting from weak supervision. IEEE RA-L, 5(2):1334–1341, 2020. 121 [58] Khairul Azmi Mahadhir, Shing Chiang Tan, Cheng Yee Low, Roman Dumitrescu, Adam Tan Mohd Amin, and Ahmed Jaffar. Terrain clas- sification for track-driven agricultural robots. In Proc. SYSINT, pages 775 – 782, 2014. [59] R. Manduchi, A. Castano, A. Talukder, and L. Matthies. Obstacle de- tection and terrain classification for autonomous off-road navigation. Auton Robot, 18(1):81–102, 2005. [60] W. Mathew, R. Raposo, and B. Martins. Predicting future locations with hidden markov models. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, 2012. [61] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environ- ments. arXiv preprint arXiv:1611.03673, 2016. [62] Robin R Murphy, Satoshi Tadokoro, Daniele Nardi, Adam Jacoff, Paolo Fiorini, Howie Choset, and Aydan M Erkmen. Search and rescue robotics. In Springer handbook of robotics. Springer, 2008. [63] B. Nagy and A. Kelly. Trajectory generation for car-like robots using cubic curvature polynomials. In Proc. FSR, 2001. [64] J. J. Nielsen, N. Amiot, and T. K. Madsen. Directional hidden markov model for indoor tracking of mobile users and realistic case study. In European Wireless 2013; 19th European Wireless Conference, 2013. [65] Michael W. Otte, Scott G. Richardson, Jane Mulligan, and Gregory Z. Grudic. Path planning in image space for autonomous robot naviga- tion in unstructured environments. J Field Robot, 26:212–240, 2009. [66] P. Purkait, C. Zach, and I. Reid. Seeing behind things: Extending se- mantic segmentation to occluded regions. In Proc. IROS, pages 1998– 2005, 2019. [67] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. 122 Frustum pointnets for 3d object detection from rgb-d data. In Proc. CVPR, pages 918–927, 2018. [68] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Point- net: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016. [69] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmen- tation. In Proc. CVPR, pages 77–85, 2017. [70] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. In Proc. NIPS, pages 5099–5108, 2017. [71] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi- camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016. [72] Markus Ryll, J. Ware, J. Carter, and N. Roy. Semantic trajectory plan- ning for long-distant unmanned aerial vehicle navigation in urban en- vironments. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1551–1558, 2020. [73] Markus Ryll, John Ware, John Carter, and Nick Roy. Efficient trajec- tory planning for high speed flight in unknown environments. In Proc. ICRA, pages 732–738, 2019. [74] Samuel Schulter, Menghua Zhai, Nathan Jacobs, and Manmohan Kr- ishna Chandraker. Learning to look around objects for top-view repre- sentations of outdoor scenes. ArXiv, abs/1803.10870, 2018. [75] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Air- sim: High-fidelity visual and physical simulation for autonomous ve- hicles. In Field and Service Robotics, 2017. [76] A. Shaw and N. Gopalan. Finding frequent trajectories by clustering 123 and sequential pattern mining. Journal of Traffic and Transportation En- gineering, 2014. [77] Piotr Skalski. Make Sense. https://github.com/SkalskiP/make- sense/, 2019. [78] Gregory J. Stein, Christopher Bradley, Victoria Preston, and N. Roy. En- abling topological planning with monocular vision. 2020 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 1667–1673, 2020. [79] Gregory J. Stein, Christopher Bradley, and Nicholas Roy. Learning over subgoals for efficient navigation of structured, unknown environ- ments. In Proc. CORL, pages 213–222, 2018. [80] Gregory J Stein, Christopher Bradley, and Nicholas Roy. Learning over subgoals for efficient navigation of structured, unknown environ- ments. In Proc. CORL, pages 213–222, 2018. [81] Anthony Stentz. Optimal and efficient path planning for partially known environments. In Proc. ICRA, 1994. [82] Anthony Stentz. The focussed D* algorithm for real-time replanning. In Proc. IJCAI, 1995. [83] Hang Su, Jun Zhu, Yinpeng Dong, and Bo Zhang. Forecast the plau- sible paths in crowd scenes. In Proceedings of the Twenty-Sixth Interna- tional Joint Conference on Artificial Intelligence (IJCAI-17), 2017. [84] Lei Tai, Giuseppe Paolo, and Ming Liu. Virtual-to-real deep reinforce- ment learning: Continuous control of mobile robots for mapless navi- gation. In Proc. IROS, pages 31–36, 2017. [85] Y. W. Teh. Dirichlet processes. In Encyclopedia of Machine Learning. Springer, 2010. [86] Jesus Tordesillas, Brett T Lopez, and Jonathan P How. FASTER: Fast and safe trajectory planner for flights in unknown environments. In Proc. IROS, pages 1934–1940, 2019. 124 [87] Emmanouil Tsardoulias, Katerina Iliakopoulou, Andreas Kargakos, and Loukas Petrou. A review of global path planning methods for oc- cupancy grid maps regardless of obstacle density. Journal of Intelligent Robotic Systems, 84, 12 2016. [88] Dizan Vasquez, Thierry Fraichard, and Christian Laugier. Incremental learning of statistical motion patterns with growing hidden markov models. IEEE Transactions on Intelligent Transportation Systems, 2009. [89] Krzysztof Walas. Terrain classification and negotiation with a walking robot. J Intell Robot Syst, 78:401–423, 2015. [90] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bron- stein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018. [91] Stefan Witwicki, Jose Carlos Castillo, Joao Messias, Jesus Capitan, Francisco S Melo, Pedro U Lima, and Manuela Veloso. Autonomous surveillance robots: A decision-making framework for networked muiltiagent systems. IEEE Robotics & Automation Magazine, 2017. [92] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019. [93] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative mod- els. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6882–6890, 2017. [94] J. Yen. Finding the k shortest loopless paths in a network. Manage Sci, 17(11):712–716, 1971. [95] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892, 2018. [96] Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, and Wol- fram Burgard. Deep reinforcement learning with successor features 125 for navigation across similar environments. In Proc. IROS, pages 2371– 2378, 2017. [97] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [98] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proc. CVPR, pages 633–641, 2017. [99] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proc. CVPR, pages 4490–4499, 2018. 126