DEEP LEARNING FOR ROBOTICS
A Dissertation Presented to the Faculty of the Graduate School
of Cornell University in Partial Fulﬁllment of the Requirements for the Degree of
Doctor of Philosophy
by Ian Nathaniel Lenz
May 2016

c 2016 Ian Nathaniel Lenz ALL RIGHTS RESERVED

DEEP LEARNING FOR ROBOTICS Ian Nathaniel Lenz, Ph.D. Cornell University 2016
Robotics faces many unique challenges as robotic platforms move out of the lab and into the real world. In particular, the huge amount of variety encountered in real-world environments is extremely challenging for existing robotic control algorithms to handle. This necessistates the use of machine learning algorithms, which are able to learn controls given data. However, most conventional learning algorithms require hand-designed parameterized models and features, which are infeasible to design for many robotic tasks. Deep learning algorithms are general non-linear models which are able to learn features directly from data, making them an excellent choice for such robotics applications. However, care must be taken to design deep learning algorithms and supporting systems appropriate for the task at hand. In this work, I describe two applications of deep learning algorithms and one application of hardware neural networks to diﬃcult robotics problems. The problems addressed are robotic grasping, food cutting, and aerial robot obstacle avoidance, but the algorithms presented are designed to be generalizable to related tasks.

BIOGRAPHICAL SKETCH Ian Lenz was born February 22, 1988 in Pullman, WA. He ﬁrst became interested in robotics through FIRST Robotics Team 639 at Ithaca High School in Ithaca, NY, acting as president for the 2006 season. He received a Bachelor of Science in Electrical and Computer Engineering with a minor in Robotics from Carnegie Mellon University in 2010.
iii

ACKNOWLEDGEMENTS
I’d like to thank my committee – Ashutosh Saxena, Ross Knepper, Noah Snaveley, and Rajit Manohar – for all their advice and help. I’d also like to thank all my collaborators, in particular Honglak Lee and Mevlana Gemici, and my lab-mates, in particular Jaeyong Sung, Ashesh Jain, Yun Jiang, and Hema Koppula. A special thanks also to the huge team that made our aerial robot work possible, Dharmendra Modha, Shyamal Chandra, Thomas Zimmerman, and Myron Flickner at IBM, and Dale Cassidy, Jerry Yeh, Jasdeep Hundal, and Brian Wojcik at Cornell.
A huge thanks also to all my wonderful teachers through the years. In particular, I want to thank Rosely Teukolsky for being the greatest AP CS teacher any student could ask for. I can’t overstate how much having such a strong foundation in programming and a wonderful teacher who encouraged interest has helped. Thanks also to all the advisors of the IHS Code Red Robotics team, including Scott Breigle, Mike Peters, Dave Buchner, Ian Krywe, Bill Brooks, Doug Fornell, and Roger Simpson. Code Red was my ﬁrst taste of “real” robotics, and wouldn’t have been possible without them.
On a more personal note, a huge thanks to my family, especially Mom and Eliot, who were a huge source of support throughout my PhD.
On a lighter note, I’d like to thank the folks that kept me fed through my PhD, especially the crews at Gorgers, the Red and White Cafe, and Mexeo, which is sorely missed. On an even ligher note, I’d like to thank Aaron Rodgers, since watching the Packers win has been more of a morale boost than it probably should’ve been.
iv

TABLE OF CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction

1

1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Deep Learning for Robotics . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . 15

1.4 This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Deep Learning for Detecting Robotic Grasps

18

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Robotic Grasping . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Deep Learning for Grasp Detection:

System and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Inference and Learning . . . . . . . . . . . . . . . . . . . . . 29

2.4 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . 31

2.4.2 Preserving Aspect Ratio . . . . . . . . . . . . . . . . . . . . 32

2.5 Structured Regularization for Feature Learning . . . . . . . . . . . 34

2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.6.3 Metrics for Detection . . . . . . . . . . . . . . . . . . . . . . 40

2.7 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7.1 Deep Learning for Robotic Grasp Detection . . . . . . . . . 42

2.7.2 Multimodal Group Regularization . . . . . . . . . . . . . . . 46

2.7.3 Two-stage Detection System . . . . . . . . . . . . . . . . . . 47

2.8 Robotic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.9 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 57

2.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3 DeepMPC: Learning Deep Latent Features for Model Predictive

Control

61

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

v

3.2.1 Robotic Control . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Model Learning for Control . . . . . . . . . . . . . . . . . . 3.2.3 Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . 3.3 Problem Deﬁnition and System . . . . . . . . . . . . . . . . . . . . 3.3.1 Model-Predictive Control: Background . . . . . . . . . . . . 3.4 Modeling Time-Varying Non-Linear Dynamics with Deep Networks 3.4.1 Deep Learning - Background . . . . . . . . . . . . . . . . . . 3.4.2 DeepMPC Architecture . . . . . . . . . . . . . . . . . . . . . 3.5 Learning and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Real-time Robotic DeepMPC System . . . . . . . . . . . . . . . . . 3.8 Prediction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Robotic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Handling Variety . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66 67 68 69 69 73 73 74 77 80 84 86 88 92 96 99

4 Low-Power Parallel Algorithms for Single Image based Obstacle

Avoidance in Aerial Robots

102

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.3 Neuromorphic Hardware . . . . . . . . . . . . . . . . . . . . . . . . 106

4.4 Obstacle Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.4.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.4.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.4.3 Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4.4 Spatially-varying models and multiple models . . . . . . . . 112

4.4.5 Tuning for Power Consumption . . . . . . . . . . . . . . . . 113

4.5 Obstacle Avoidance Manuevers . . . . . . . . . . . . . . . . . . . . 114

4.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 118

4.6.1 Oﬄine Learning Experiments . . . . . . . . . . . . . . . . . 118

4.6.2 Robotic Experiments in Real Environments . . . . . . . . . 121

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5 Conclusion

126

vi

LIST OF TABLES 2.1 Recognition results for Cornell grasping dataset. . . . . . . 41 2.2 Recognition results for diﬀerent modalities, for a deep net-
work pre-trained using SAE. . . . . . . . . . . . . . . . . . . . . . 43 2.3 Detection results for point and rectangle metrics, for various
learning algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4 Results for robotic experiments for Baxter, sorted by object
category, for a total of 100 trials. Tr. indicates number of trials, Acc. indicates accuracy (in terms of success percentage.) . . . . . . 50 2.5 Results for robotic experiments for PR2, sorted by object category, for a total of 100 trials. Tr. indicates number of trials, Acc. indicates accuracy (in terms of success percentage.) . . . . . . 50 3.1 Conﬁdence at 0.5s: Mean L2 error and 95% conﬁdence interval at prediction time of 0.5s (all in mm) . . . . . . . . . . . . . . . . . 91 4.1 Obstacle classiﬁcation results: Classiﬁer precision and recall, in percent, for four obstacle classes, and average across classes. Results presented for baseline local classiﬁer and MRF model, with and without spatially varying model, and varying weight cardinalities.116 4.2 Robotic experiment results: Error rates presented for classiﬁcation, motion execution, and overall successful avoidance . . . . . 121
vii

LIST OF FIGURES
1.1 Variety of robotic tasks: In the real world, robots must perform a huge range of tasks in order to function eﬀectively. Here, I show several tasks which I address in this work – from left to right, grasping novel objects, cutting novel food items, and avoiding various obstacles. Each of these tasks themselves also involves a huge amount of variety – the robot must be able to pick up any household object, cut any food item, or avoid any obstacle. . . . . 1
1.2 Deep network and auto-encoder: Left: A deep network with two hidden layers, which transform the input representation, and a logistic classiﬁer at the top layer, which uses the features from the second hidden layer to predict output. Right: An auto-encoder, used for pretraining. A set of weights projects input features to a hidden layer. The same weights are then used to project these hidden unit outputs to a reconstruction of the inputs. In the sparse auto-encoder (SAE) algorithm, the hidden unit activations are also penalized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Detecting robotic grasps: Left: A cluttered lab scene labeled with rectangles corresponding to robotic grasps for objects in the scene. Green lines correspond to robotic gripper plates. I use a twostage system based on deep learning to learn features and perform detection for robotic grasping. Center: A Baxter robot “Yogi” successfully executing a grasp detected by my algorithm. Right: The grasp detected for this case, in the RGB (top) and depth (bottom) images obtained from Kinect. . . . . . . . . . . . . . . . . . . . . . 18
2.2 Detecting and executing grasps: From left to right: My system obtains an RGB-D image from a Kinect mounted on the robot, and searches over a large space of possible grasps, for which some candidates are shown. For each of these, it extracts a set of raw features corresponding to the color and depth images and surface normals, then uses these as inputs to a deep network which scores each rectangle. Finally, the top-ranked rectangle is selected and the corresponding grasp is executed using the parameters of the detected rectangle and the surface normal at its center. Red and green lines correspond to gripper plates, blue in RGB-D features indicates masked-out pixels. . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Illustration of my two-stage detection process: Given an image of an object to grasp, a small deep network is used to exhaustively search potential rectangles, producing a small set of topranked rectangles. A larger deep network is then used to ﬁnd the top-ranked rectangle from these candidates, producing a single optimal grasp for the given object. . . . . . . . . . . . . . . . . . . . 26
viii

2.4 Preserving aspect ratio: Left: a pair of sunglasses with a potential grasping rectangle. Red edges indicate gripper plates. Center: image taken from the rectangle and rescaled to ﬁt a square aspect ratio. Right: same image, padded and centered in the receptive ﬁeld. Blue areas indicate masked-out padding. When rescaled, the rectangle incorrectly appears graspable. Preserving aspect ratio and padding allows the rectangle to correctly appear non-graspable. 31
2.5 Improvement from mask-based scaling: Left: Result without mask-based scaling. Right: Result with mask-based scaling. . . . . 31
2.6 Three possible models for multimodal deep learning: Left: fully dense model—all visible features are concatenated and modality information is ignored. Middle: modality-speciﬁc sparse model - separate ﬁrst layer features are trained for each modality. Right: group-sparse model—a structured regularization term encourages features to use only a subset of the input modes. . . . . . . . . . . 32
2.7 Features learned from grasping data: Each feature contains seven channels - from left to right, depth, Y, U, and V image channels, and X, Y, and Z surface normal components. Vertical edges correspond to gripper plates. Left: eight features with the strong positive correlations to rectangle graspability. Right: similar, but negative correlations. Group regularization eliminates many modalities from many of these features, making them more robust. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Example objects from the Cornell grasping dataset: [61]. This dataset contains objects from a large variety of categories. . . 38
2.9 Learned 3D depth features: 3D meshes for depth channels of the four features with strongest positive (top) and negative(bottom) correlations to rectangle graspability. Here X and Y coordinates corresponds to positions in the deep network’s receptive ﬁeld, and Z coordinates corresponds to weight values to the depth channel for each location. Feature shapes clearly correspond to graspable and non-graspable structures, respectively. . . . . . . . . . . . . . . 41
2.10 Visualization of grasping scores for diﬀerent grippers: Red indicates maximum score for a grasp with left gripper plane centered at each point, blue is similar for the right plate. Best-scoring rectangle shown in green/yellow. . . . . . . . . . . . . . . . . . . . 45
2.11 Improvements from group regularization: Cases where my group regularization approach produces a viable grasp (shown in green and yellow), while a network trained only with simple L1 regularization does not (shown in blue and red). Top: RGB image, bottom: depth channel. Green and blue edges correspond to gripper. 45
ix

2.12 Improvements from two-stage system: Example cases where the two-stage system produces a viable grasp (shown in green and yellow), while the single-stage system does not (shown in blue and red). Top: RGB image, bottom: depth channel. Green and blue edges correspond to gripper. . . . . . . . . . . . . . . . . . . . . . . 47
2.13 Robotic experiment objects: Several of the objects used in experiments, including challenging cases such as an oddly-shaped RC car controller, a cloth towel, plush cat, and white ice cube tray. 48
2.14 Robots executing grasps: My robots grasping several objects from the experimental dataset. Top row: Baxter grasping a quadrotor casing, coﬀee mug, ice cube tray, knife, and electric shaver. Middle row: Baxter grasping a desk lamp, cheese grater, umbrella, cloth towel, and hot glue gun. Bottom row: PR2 grasping a plush cat, RC car controller, cereal box, toy elephant, and glove. . . . . . 49
3.1 Cutting food: My PR2 robot uses my algorithms to perform complex, precise food-cutting operations. Given the large variety of material properties, it is challenging to design appropriate controllers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Food materials: Some of the 20 diverse food materials which make up my material interaction dataset. These include tough vegetables like carrots and potatoes, thick-skinned fruits like lemons and limes, and soft items like butter and bananas, all of which require diﬀerent techniques to cut properly. . . . . . . . . . . . . . 65
3.3 Variation in cutting dynamics: plots showing desired (green) and actual (blue) trajectories, along with error (red) obtained using a stiﬀness controller while cutting butter (left) and a lemon at low (middle) and high (right) rates of vertical motion. Butter resists the knife signiﬁcantly less than the lemon. Even though only the vertical cutting rate is the only change between the middle and right-hand plots, dynamics along the sawing axis also change signiﬁcantly. Dynamics also vary with time for the lemon as the knife cuts through the skin and into the ﬂesh. . . . . . . . . . . . . . . 70
3.4 Gripper axes: PR2’s gripper with knife grasped, showing the axes used in this chapter. The X (“sawing”) axis points along the blade of the knife, Y points normal to the blade, and Z points vertically. 72
3.5 Deep predictive model: Architecture of my recurrent conditional deep predictive dynamics model. Transforming recurrent units (TRUs) on the left model time-varying latent properties which aﬀect system dynamics. On the right, conditional multiplicative modulation is used again to condition future system responses on past observed dynamics and latent features. . . . . . . . . . . . . 77
x

3.6 Online system: Block diagram of my DeepMPC system. Parameters learned using my three-stage deep learning algorithm are loaded by the optimization process, which then continually predicts future states and updates future controls based on these predictions. The control process takes state information from the robot, transmits it to the optimization process, and transmits controls optimized by that process to the robot. . . . . . . . . . . . . . . . 87
3.7 Prediction error: Mean L2 distance (in mm) from predicted to ground-truth trajectory from 0.01s to 0.5s in the future. While most models give similar performance up to 0.1s, models with linear components give very weak long-term results. Non-parametric methods give better results, but are hampered by expensive inference which scales poorly. Recurrent deep networks give the best results, with my approach outperforming all others after 0.1s, reducing error at 0.5s by 46% as compared to the baseline recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.8 Robotic experiment results: Mean cutting rates, with bars showing normalized standard deviation, for ten diverse materials. Red bar uses the same controller for all materials, blue bar uses the same for each cluster given by [42], purple uses a tuned stiﬀness controller for each, and green is my online MPC method. My algorithm consistently gives higher mean rates, making statistically signiﬁcant improvements for all materials except butter and tofu. Particularly signiﬁcant improvements are seen for tough, varying materials such as carrots and potatoes. . . . . . . . . . . . . . . . . 93
3.9 Cutting food: Time-series of my PR2 robot using my DeepMPC controller to cut several of the food items in my dataset. My algorithm is able to adapt its strategy for diﬀerent materials. Note in particular that it picks up the knife slightly, then chops back down when cutting the carrot, and uses more “sawing” motion on tougher materials. Video of these experiments is available at http://deepmpc.cs.cornell.edu . . . . . . . . . . . . . . . . . . 94
3.10 Diﬀerent tools: Diﬀerent knives used to test my algorithm. From left to right, the paring knife used to collect data and train the algorithm, a shorter, sharper paring knife, a long kitchen knife, a wedge-shaped chef’s knife, and a serrated steak knife. In all cases except the serrated knife, my algorithm, trained only with the paring knife on the left, was able to maintain comparable cutting rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1 Avoiding obstacles: I use low-power parallel hardware to compute an obstacle map, given a single image from the camera onboard the aerial robot. I then use these results to select an evasive maneuver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xi

4.2 Classiﬁcation improvements from MRF: Left: input images. Middle: initial classiﬁcation. Right: classiﬁcation with full MRF model with belief propagation. Initial classiﬁcation results which would present problems for navigation, but are greatly improved by integrating non-local information using my MRF. . . . . . . . . 109
4.3 Filter set: My ﬁlter set, which includes two scales of Gabor ﬁlters at six orientations and two scales of diﬀerence-of-Gaussian ﬁlters. . 112
4.4 BP in action: Time series of belief propagation in action. Left: original image and baseline classiﬁcation result. Top: internal node potential. Middle: spike locations. Bottom: spike counts. Time proceeds from left to right. Best viewed in color. . . . . . . . . . . 115
4.5 Visual results: Figure showing the results of my algorithm for a variety of obstacles. Top: Input image. Bottom: Inferred obstacle labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.6 Precision-recall curve: Precision vs. recall for cell-based error metric, demonstrating improvements over baseline results for navigation purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.7 Varying lighting: Classiﬁcation results for a tree trunk classiﬁer using my algorithm, on the same tree under very diﬀerent lighting conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.8 Avoiding obstacles in series: My aerial robot avoiding a fence and a pole in sequence. Left: Overhead map of area, red indicates obstacles avoided, blue robot path. Robot started at the blue dot, behind and below the level of the fence, moved upwards to a safe altitude, and proceeded over the fence and through the clear area. It then stopped at the pole, detected it as an obstacle, moved diagonally to the right to avoid it, and then forwards again through the following clear area. . . . . . . . . . . . . . . . . . . . . . . . . 122
4.9 Avoiding obstacles: AirRobot avoiding obstacles based on my classiﬁcation results (robot circled in red in cases where it’s diﬃcult to see) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xii

CHAPTER 1 INTRODUCTION
Figure 1.1: Variety of robotic tasks: In the real world, robots must perform a huge range of tasks in order to function eﬀectively. Here, I show several tasks which I address in this work – from left to right, grasping novel objects, cutting novel food items, and avoiding various obstacles. Each of these tasks themselves also involves a huge amount of variety – the robot must be able to pick up any household object, cut any food item, or avoid any obstacle.
The ﬁeld of robotics is at a very exciting point. Owing both to advances in algorithms and increasing computational power, robots are poised to move out of the lab and other special purpose applications such as manufacturing, and into our everyday lives. This can be seen especially from the recent upsurge in interest in autonomous vehicles, but robotics has the potential for even greater impact.
As robotics begins to move from the lab to the real world, robots face many new challenges, some of which are shown in Figure 1.1. Consider a household personal assistant robot. Such a robot must perform many complex tasks, such as sorting and folding clothes, operating appliances, picking up and cleaning, and preparing food in the kitchen. Moreover, it must handle the huge variety of objects, materials, and the like associated with these tasks – for example, picking up diﬀerent objects, some of which it may never have seen before, or preparing diﬀerent food items. For many of these problems, there exists only an abstract relationship between the
1

robot’s visible inputs and the task at hand – for example, attempting to control from vision data, or determine material properties from haptic feedback.
Traditionally, a roboticist, or team thereof, would hand-design controllers for each task we want a robot to perform. Even for tasks which human users can perform intuitively, such as grasping objects or cutting food, these controllers can be very diﬃcult to design because we aren’t able to easily translate this natural intuition into code. It can also be extremely challenging to scale these approaches up to the huge amount of variety our robots must deal with in the real world – grasping every object in your home, cutting any food item, etc.
For these reasons, in recent years machine learning algorithms have seen widespread use for robotics applications. Rather than forcing the engineer to hand-code an entire end-to-end robotic system, machine learning allows portions of the system to be learned from some training data. This approach allows us to model concepts which might be diﬃcult or impossible to properly hand-model. It also allows for adaptable models – as long as the form of the model is general, it can be adapted to more or diﬀerent cases simply by providing training data for these new cases.
While machine learning algorithms have many advantages for robotic applications, they can still be diﬃcult to apply to new problems. First, many learning algorithms require time-consuming optimization to perform inference, making them infeasible for robotic applications with strict time constraints. Designing models general enough to apply to other cases of the same problem, while still speciﬁc enough to be a good ﬁt for the problem, can be very challenging, particularly with the huge variety seen in real-world robotics. In addition, most machine learning algorithms require signiﬁcant hand-engineering in terms of designing features –
2

transformations of the raw input, e.g. images, dynamics information, etc. given to the robot, into a form more useful for the learning algorithm. While designing good features is critical to the success of a machine learning algorithm for a particular problem, such features are often unitiutive, problem-speciﬁc, and take signiﬁcant eﬀort to design.
More recently, deep learning approaches have shown impressive performance across a wide range of domains, including computer vision, audio processing, natural language processing, and others. These algorithms are based on neural networks, highly-parameterized models which use multiple layers of representation to transform data into a task-speciﬁc representation. By using unsupervised feature learning algorithms, deep learning approaches are able to pre-initialize these networks with useful features, avoiding the overﬁtting problems commonly seen when neural networks are trained without this initialization. This and several other useful properties, described in more detail below, make deep networks an excellent choice for robotic applications.
In this work, I will begin with a general description of deep learning algorithms and their strengths. I will then discuss their particular advantages as learning algorithms for robotics applications. Finally, I will present three applications of deep/neural network algorithms to diverse robotics problems – grasping, cutting unknown food materials, and aerial robot obstacle avoidance – highlighting the strengths of these algorithms in real-world robotics applications.
3

Figure 1.2: Deep network and auto-encoder: Left: A deep network with two hidden layers, which transform the input representation, and a logistic classiﬁer at the top layer, which uses the features from the second hidden layer to predict output. Right: An autoencoder, used for pretraining. A set of weights projects input features to a hidden layer. The same weights are then used to project these hidden unit outputs to a reconstruction of the inputs. In the sparse auto-encoder (SAE) algorithm, the hidden unit activations are also penalized.
1.1 Deep Learning
While deep learning is a broad class of learning algorithms, with many diﬀerent associated inference models, most approaches learn a set of connection weights to be used by a neural network model for inference. A neural network, such as that shown in Fig. 1.2-left typically consists of multiple layers of artiﬁcial “neurons.” Each neuron has weighted connections to each neuron in the previous layer, or to the network’s inputs if the neuron sits at the lowest layer. The neuron then forms output by passing the weighted sum of its inputs through some nonlinear activation function (for example a sigmoid, σ(a) = 1/(1 + exp(−a)).) This output is then sent to the next layer. Ideally, each layer of features will represent a better abstraction of the input data, so that the ﬁnal layer of features will be a better representation for some classiﬁer than the raw features fed into the network.
4

For example, take W [ℓ] as the network weights for layer ℓ, h[ℓ] as the corresponding “hidden” representation generated using these weights, Kℓ as the size of this hidden representation for layer l, x as the raw input features, N as the number of these features, and yˆ as the network’s predicted output, e.g. we want to model P (yˆ = 1|x). Then, a simple two-layer deep network with a logistic classiﬁer at the top layer might proceed as:

h[j1] = σ h[j2] = σ P (yˆ = 1|x; Θ) = σ

N
xiWi[,1j]
i=1
K1
h[i1]Wi[,2j]
i=1
K2
h[i2]Wi[3]
i=1

(1.1)

Note that, while the above network performs a binary classiﬁcation task, it is simple to swap out the top-layer classiﬁcation function for other tasks, e.g. using a softmax function for multi-class classiﬁcation, or a linear weighting of the toplayer features for regression. This modularity is a major strength of neural network approaches.

Back-propagation: The above deﬁned the inference procedure for a neural network, assuming the weights W had already been initialized. However, for this network to be useful, these weights must be learned from data to represent the nonlinearity we want to model. As with most learning algorithms, we have some cost function C(P (yˆ = 1|x), y∗) which gives the cost of prediction probability P (yˆ = 1|x) given ground-truth label y∗. During learning, we will optimize this cost over our entire dataset of training examples (x, y∗), e.g. taking Θ = {W [1], W [2], W [3]} and using t to index M total training cases, and C(X, Y, Θ)
5

as the cost across the entire dataset, learning proceeds as:

M
C(X, Y, Θ) = C(P (yˆ(t) = 1|x(t)), y∗(t))
k=1
Θ∗ = arg min C(X, Y, Θ)
Θ

(1.2) (1.3)

In most cases, we will use a gradient-based learning algorithm to optimize Θ. This requires the gradient of the cost function with respect to each parameter being optimized, i.e. ∂C(X, Y, Θ)/∂Θ.

While gradients for the top-layer weights W [3] are the same as those for a

standard logistic classiﬁer taking h[2] as input, gradients for the previous layers’

weights are more complicated. For these, we use back-propagation, iteratively

computing the gradient of the cost function with respect to the previous layer’s

hidden units, then using this gradient to compute the gradient with respect to that

layer’s weights, continuing this process until we have gradients for all these weights.

Since the gradient of the cost function with respect to each layer’s hidden units

depends only on the same gradient for the next layer and the next layer’s weights,

this back-propagation requires only a single backwards pass through the network

to compute all these gradients. Simplifying notation by deﬁning ∂C(t)/∂h[2](t) as

the gradient of the cost for case t with respect to h[2], we would back-propagate to

the ﬁrst-layer weights as:

∂ C (t) ∂ Wi[1]

=

∂C(t) ∂h[2](t) ∂h[1](t) ∂h[2](t) ∂h[1](t) ∂Wi[1]

(1.4)

This gives an eﬃcient method for computing the gradients of the cost function with respect to each layer’s weights. In Chapter 3, I will use a similar idea to eﬃciently compute cost-function gradients with respect to network inputs.

6

Unsupervised Feature Learning: Even moderately-sized neural networks will have a huge number of parameters which must be learned – for example, a network passing a 20x20 pixel image through two 200-unit layers will have 120,000 (20x20x200 + 200x200) diﬀerent weights. This huge degree of parameterization has both advantages and disadvantages. While it allows these models to act as general nonlinear learners, capable of ﬁtting nearly any function, it also makes them extremely vulnerable to overﬁtting, learning models which perform well on their training data but do not generalize to new cases. Such generalization is extremely important for robotics, which must operate in the real world, handling cases never seen before by the learning algorithm. This can be partly addressed using regularization, or by increasing the amount of training data available. In Chapter 2 I will show that using specialized regularization, in particular, improves results. However, regularization is not a complete solution to overﬁtting, and can impact model performance if applied too strongly. Additional training data is more eﬀective, but can be time-consuming to collect for complex robotic tasks.
Modern deep learning approaches use new learning algorithms to avoid these overﬁtting issues. Historical neural network methods would simply randomly initialize network weights, then back-propagate some cost function as described above. Since neural network optimization is inherently non-convex, it will converge only to a local minimum. Initializing the network randomly often leads this optimization to reach minima which overﬁt the training data. The core problem here is that, while each layer of the network is supposed to represent a more-useful abstraction of the input data, randomly initialized weights will not do so, forcing the algorithm to simultaneously learn a feature representation and a classiﬁer from that initial poor feature representation.
7

Modern deep learning algorithms remedy this problem by using unsupervised feature learning algorithms. Rather than simultaneously learning a classiﬁer and the features used for that classiﬁer, these algorithms use unsupervised feature learning algorithms to initialize each layer of the network to a good representation. For example, we might intialize the previous network’s ﬁrst-layer weights using the sparse autoencoder (SAE) algorithm [46], as illustrated in Fig. 1.2-right, which optimizes a layer’s weights to give a sparse representation which is able to reconstruct the layer’s inputs. Taking xˆ as the reconstruction of x, g(a) as some sparsity function penalizing hidden-unit activations (e.g. the L1 sum of unit activations), and λ as a scaling factor deﬁning the weighting between the two, we can initialize W [1] as:

M K1

W [1]∗ = arg min

(||xˆ(t) − x(t)||22 + λ

g(h[j1](t)))

W [1]

t=1

j=1

N

h[j1](t) = σ(

x(it)Wi[,1j])

i=1

K

xˆ(it) =

h(jt)Wi[,1j]

j=1

(1.5) (1.6)

Since this algorithm is generic to its inputs, it could be re-used similarly to learn W [2] to give a sparse representation for h[2] which can reconstruct h[1]. In this way, we can use the same feature learning algorithm to iteratively initialize our entire network, ﬁrst learning lower layers, then ﬁxing their outputs and learning features from them to learn the next layer’s weights.

These feature learning approaches are one of the major strengths of modern deep learning methods. Since these algorithms are able to learn good features from data, they are much less sensitive to input representations than other conventional

8

learning algorithms such as support vector machines, Gaussian processes, and others. Deep learning algorithms are able to learn good representations and solve problems even from basic representations such as raw pixels, avoiding the need to hand-design features as with other learning algorithms, saving signiﬁcant engineering eﬀort for many of the complex problems encountered in robotics, where features can be unintuitive and hard to design.
1.2 Deep Learning for Robotics
Robotics presents many unique challenges for learning algorithms. First, robots must perform a wide range of tasks, and it is often time-consuming or even infeasible to code completely new learning algorithms and features for each task. Second, robots must handle a huge amount of variety in the real world, which is diﬃcult for many learning algorithms to handle. Finally, time is at a premium in most robotic applications, so learning algorithms must lend themselves to fast inference to be useful for robotic applications. Below, I will describe how the strengths of deep learning algorithms make them ideal choices for robotics.
Generality: Because deep networks are non-linear models with an extremely high number of parameters (typically on the order of millions), they are eﬀectively general non-linear models, capable of learning any functional mapping from inputs to outputs. This is extremely useful for robotics applications, which typically encounter a huge range of nonlinearities, many of which are diﬃcult or impossible to model. For example, both the mapping of pixels to graspability shown in Chapter 2 and the food-cutting dynamics shown in Chapter 3 would be extremely diﬃcult to model by hand, but are able to be modeled by a deep network.
9

However, one signiﬁcant caveat is that the huge number of learned parameters of a deep network also makes these models susceptible to high degrees of overﬁtting. This can be mitigated by designing network structures and learning algorithms which are a better ﬁt for the form of the function being modeled. In Chapter 2, I will give a new learning algorithm which improves results for multimodal RGB-D data, and in Chapter 3 I will give a new recurrent deep architecture which learns latent features integrating long-term information for food-cutting and other time-varying dynamics tasks. These two applications demonstrate the need for careful design of networks and learning algorithms, as using the right approach signiﬁcantly improves results.
Feature Learning: As described above, modern deep learning techniques make use of unsupervised feature learning algorithms to learn good features from data to initialize the network. This allows the ﬁnal back-propagation step to obtain better, more general results by starting from a good representation of the problem.
These feature learning methods are particularly important in robotics, as for many robotic tasks, it is very diﬃcult to design useful features by hand. For example, hand-designing visual features useful for grasping, as discussed in Chapter 2 or features allowing us to model the complex dynamics involved in tasks like the food cutting discussed in Chapter 3 is extremely challenging. In Chapter 2, I will show that learned features can even outperform carefully hand-designed features.
Feature learning also aids genearlizability of these algorithms by adapting even the basic features used by the algorithm based on the given training data. This is useful because even applications using similar input data might require very diﬀerent representations of that data. By contrast, using a traditional hand-engineering approach, we would either have to use the same features for both applications,
10

which might weaken performance, or design new features for the second application, which would require signiﬁcant eﬀort. For example, adapting a system designed to cut food, as in Chapter 3, to another problem, such as scrubbing dishes, would require a very diﬀerent set of features, but the deep learning algorithm I present in that chapter would be able to automatically learn these from data for that task.
Eﬃciency and Parallelism: Another major advantage of deep networks is their eﬃcient, and natively parallel, inference. Fast inference is very important in robotics – a grasp detection algorithm which takes ten minutes to detect a grasp is not very useful in the real world. In more extreme cases, eﬃciency is a hard requirement – for example, real-time controllers often operate at rates of 100 or even 1000 Hz, and models used for model-predictive control (MPC), as in Chapter 3, must operate at similar rates to ensure the given controls are truly optimal.
In deep networks, inference typically consists of a series of matrix multiplications to weight inputs followed by element-wise non-linear operations (e.g. applying a sigmoid activation function.) Thus, inference does not require optimization, as is the case in other models such as conditional random ﬁelds (CRFs) and many others, allowing my model in Chapter 3 to predict at a rate of 1.2 kHz. Furthermore, such operations are extremely parallelizable – GPU implementations of deep network inference have become ubiquitious, signiﬁcantly increasing performance. However, the structure of a deep network, composed of many individual “neurons,” all peforming the same operation (except using diﬀerent parameters), also lends itself well to direct parallel hardware implementations, which can be extremely efﬁcient both in terms of time and power consumption, as shown in Chapter 4. The
11

later, in particular, is critical for many battery-limited robotic platforms, such as miniature aerial vehicles.
Deep network inference time is also easily scalable – if faster inference (or a lower hardware footprint) is required, the number of units in the network can simply be reduced. While this might trade accuracy for performance, a similar tradeoﬀ will likely be encountered when trying to speed up any machine learning algorithm. However, most other learning algorithms would require signiﬁcant work to similarly reduce the feature set, since the engineer would have to test diﬀerent feature sets and weight the tradeoﬀs each gives in terms of accuracy vs. performance. Deep learning approaches let us simply deﬁne the size of the feature set to be learned and allow the algorithm to learn an optimal task-speciﬁc feature set of that size. I will demonstrate the power of this scalability in Chapter 3 as it allows that system to run at real-time rates with only a slight decrease in accuracy.
1.3 Related Work
In this section, I will describe some general related work in deep learning and robotic manipulation. In the following chapters, I will describe work related to each chapter’s speciﬁc applications and methods.
1.3.1 Deep Learning
Modern deep learning methods retain the advantages of neural networks, while using new algorithms and network architectures to overcome their drawbacks. Due to their eﬀectiveness as general non-linear learners [7], deep learning has been applied
12

to a broad spectrum of problems, including visual recognition [50, 79], natural language processing [27], acoustic modeling [95], and many others. Recurrent deep networks have proven particularly eﬀective for time-dependent tasks such as text generation [141] and speech recognition [47]. Factored conditional models using multiplicative interactions have also been shown to work well for modeling shortterm temporal transformations in images [91]. More recently Taylor and Hinton [143] applied these models to human motion, but did not model any control inputs, and treated the conditioning features as a set of fully-observed “motion styles”. In Chapter 3, I will use both recurrent and factored conditional units to model the response of a dynamic system to control inputs.
Deep Learning for Manipulation
A few works have applied deep learning directly to robotic manipulation. Sung et al. [139] use deep learning to perform transfer learning for trajectories for manipulating household appliances. In both cases, their deep learning methods are limited to determining a manipulation plan – a grasping pose in the former case, and an end-eﬀector trajectory in the latter – and then standard motion control algorithms are used to execute this plan. No deep networks are used for online control. Levine et al. [82] use a deep network to learn control policies. Their approach does not apply deep learning to modeling system dynamics, as I do in Chapter 3, and will be discussed in more detail in that chapter.
Deep Learning for Visual Detection: The majority of work in deep learning focuses on classiﬁcation problems. Only a handful of previous works have applied these methods to detection problems [107, 78, 24]. For example, Osadchy et al. [107] and LeCun et al. [78] applied a deep energy-based model to the problem of face detection, Sermanet et al. [131] applied a convolutional neural network for
13

pedestrian detection, and Coates et al. [24] used a deep learning approach to detect text in images. Girshick et al. [44] used learned convolutional features over image regions for object detection, while Szegedy et al. [142] used a multi-scale approach based on deep networks for the same task.
All these approaches focused on object detection and similar problems, in which the goal is to ﬁnd a bounding box which tightly contains the item to be detected, and for each item, all valid bounding boxes will be similar. In Chapter 2 I will apply a deep network to visual detection for robotic grasping with RGB-D data. In the grasp detection problem, there may be several valid grasps for an object in diﬀerent regions, making it more important to select the one with the highest chance success. In addition, orientation matters much more to robotic grasp detection, as most grasps will only be viable for a small subset of the possible gripper orientations. My approach to grasp detection will also generalize across object classes, and even to classes never seen before by the system, as opposed to the class-speciﬁc nature of object detection.
Multimodal Deep Learning: Recent works in deep learning have extended these methods to handle multiple modalities of input data, such as audio and video [104], text and image data [138], and even RGB-D data [134, 14]. However, all of these approaches have fallen into two camps - either learning completely separate low-level features for each modality [104, 138], or simply concatenating the modalities [134, 14]. The former approaches have proven eﬀective for data where the basic modalities diﬀer signiﬁcantly, such as the aforementioned case of text and images, while the latter is more eﬀective in cases where the modalities are more similar, such as RGB-D data.
For some new combinations of modalities and tasks, it may not be clear which
14

of these approaches will give better performance. In fact, in the ideal feature set, diﬀerent features may use diﬀerent subsets of the modalities. In Chapter 2, I will give a structured regularization method which guides the learning algorithm to select such subsets, without imposing hard constraints on network structure.
1.3.2 Robotic Manipulation
In this section, I will give a brief overview of related work in robotic manipulation problems similar to those presented in Chapters 2 and 3. I will present more detailed related work on those applications in those chapters.
Robotic Grasping
Robotic grasping has been an active research area for many years, but continues to be a challenging, unsolved problem. Many current works [32, 45, 148] synthesize grasps assuming a known 3D object model. While this allows for high-quality grasps, it does not allow these algorithms to generalize to new, unknown objects. My approach, presented in Chapter 2, and many others [128, 36, 54, 76, 61] use learning algorithms to detect grasps for novel objects from vision data. In contrast to these existing methods, my approach will learn even the basic features used to detect grasps, from raw pixel data. For a much more extensive overview of these and other robotic grasping methods, see Section 2.2.
Manipulating Deformable Objects
Many robotics works which manipulate deformable objects create task-speciﬁc systems and controllers. For example, Bollini et al. [16] developed such a system for baking cookies, Beetz et al. [6] for making pancakes, and Maitin-Shepard et al.
15

[86] for towel folding. Gemici and Saxena [42] developed a more general system which learns to perform a range of tasks based on deﬁned object properties. In Chapter 3, I will present a system which is general both to the task at hand and the speciﬁc objects and materials involved, learning a latent representation of these objects’ properties. I refer the reader to Section 3.2 for a more detailed overview of these and other related works.
1.4 This Work
The remainder of this work will present three diverse applications of either deep learning or neural network methods to robotic tasks:
Chapter 2 – Deep Learning for Detecting Robotic Grasps: Here, the robot’s goal is to grasp a novel object – i.e. one the robot has never seen before – using only a single frame of image and depth (RGB-D ) data taken from a sensor such as a Microsoft Kinect. I treat this as a visual detection problem, and use a deep network to rank candidate grasps. This shows deep networks’ ability to learn even abstract nonlinearities such as mapping RGB-D pixels to graspability. To improve detection eﬃciency, I employ a two-pass detection system where a smaller deep network is used to obtain a small set of top-ranked candidate grasps, which are then re-ranked by a larger network. I also present a new algorithm which uses structured regularization to learn more robust multimodal features, better integrating, for example, color and depth information obtained from Kinect. I validate these algorithms both in oﬄine detection experiments on a large-scale grasping dataset and in real-world robotic experiments on both a PR2 and Baxter robot.
16

Chapter 3 – DeepMPC: Learning Deep Latent Features for Model Predictive Control: In this work, the robot’s goal is to cut through an unknown food item at the fastest rate possible. This is an interesting control problem because of the complex, nonlinear dynamics (static friction, adhesion, deformation, etc.) and variety (diﬀerent materials, varying temperatures, varying layers, etc.) involved in the problem. Due to this complexity, hand-coding eﬀective controllers is extremely diﬃcult, so instead I use a model-predictive controller (MPC) which optimizes control inputs for some cost over predicted future system states. The chief diﬃculty in implementing an MPC algorithm, particularly for such a complex problem, lies in obtaining an accurate predictive dynamics model. To this end, I develop a new deep network and learning algorithm designed to handle the complex dynamics and variations present in food-cutting and other similar problems. I validate this model both in oﬄine prediction experiments and using a real-time MPC system on a PR2 robot.
Chapter 4 – Low-Power Parallel Algorithms for Single Image based Obstacle Avoidance in Aerial Robots: Here, the goal is for an aerial robot to detect and avoid various types of obstacles from monocular vision. For aerial robots with limited battery life, power is at a premium, typically making processing hardware powerful enough to run e.g. vision algorithms. Thus, here, I use novel neural hardware with extremely low power consumption to implement the obstacle detection system. Since the initial obstacle maps inferred using only local classiﬁers are extremely noisy and would not be well-suited to use for realworld control, I implement belief propagation over a conditional random ﬁeld in this neural hardware to integrate non-local information, signiﬁciantly improving results.
17

CHAPTER 2 DEEP LEARNING FOR DETECTING ROBOTIC GRASPS
Figure 2.1: Detecting robotic grasps: Left: A cluttered lab scene labeled with rectangles corresponding to robotic grasps for objects in the scene. Green lines correspond to robotic gripper plates. I use a two-stage system based on deep learning to learn features and perform detection for robotic grasping. Center: A Baxter robot “Yogi” successfully executing a grasp detected by my algorithm. Right: The grasp detected for this case, in the RGB (top) and depth (bottom) images obtained from Kinect.
2.1 Introduction
Robotic grasping is a challenging problem involving perception, planning, and control. Some recent works [124, 128, 61, 151] address the perception aspect of this problem by converting it into a detection problem in which, given a noisy, partial view of the object from a camera, the goal is to infer the top locations where a robotic gripper could be placed (see Figure 2.1). Unlike generic vision problems based on static images, such robotic perception problems are often used in closed loop with controllers, so there are stringent requirements on performance and computational speed. In the past, hand-designing features has been the most popular method for several robotic tasks [85, 71]. However, this is cumbersome and time-consuming, especially when incorporating new input modalities such as RGB-D cameras.
This work originally presented as a conference paper at Robotics: Science and Systems (RSS) 2013, and in the International Journal of Robotics Research (IJRR) Special Issue on Robot Vision
18

Recent methods based on deep learning [7] have demonstrated state-of-the-art performance in a wide variety of tasks, including visual recognition [75, 136], audio recognition [80, 95], and natural language processing [28]. These techniques are especially powerful because they are capable of learning useful features directly from both unlabeled and labeled data, avoiding the need for hand-engineering.
However, most work in deep learning has been applied in the context of recognition. Grasping is inherently a detection problem, and previous applications of deep learning to detection have typically focused on speciﬁc vision applications such as face detection [107] and pedestrian detection [131]. The goal here is not only to infer a viable grasp, but to infer the optimal grasp for a given object that maximizes the chance of successfully grasping it. This diﬀers signiﬁcantly from the problem of object detection. Thus, the ﬁrst major contribution of my work is to apply deep learning to the problem of robotic grasping, in a fashion which could generalize to similar detection problems.
The second major contribution of my work is to propose a new method for handling multimodal data in the context of feature learning. The use of RGB-D data, as opposed to simple 2D image data, has been shown to signiﬁcantly improve grasp detection results [61, 32, 128]. In this work, I present a multimodal feature learning algorithm which adds a structured regularization penalty to the objective function to be optimized during learning. As opposed to previous works in deep learning, which either ignore modality information at the ﬁrst layer (i.e., encourage all features to use all modalities) [134] or train separate ﬁrst-layer features for each modality [104, 138], my approach allows for a middle-ground in which each feature is encouraged to use only a subset of the input modalities, but is not forced to use only particular ones.
2015. This was joint work with Honglak Lee and Ashutosh Saxena.
19

I also propose a two-stage cascaded detection system based on deep learning. Here, I use fewer features for the ﬁrst pass, providing faster, but only approximately accurate detections. The second pass uses more features, giving more accurate detections. In my experiments, I found that the ﬁrst deep network, with fewer features, was better at avoiding overﬁtting but less accurate. I feed the top-ranked rectangles from the ﬁrst layer into the second layer, leading to robust early rejection of false positives. Unlike manually designed two-step features as in [61], my method uses deep learning, which allows me to learn detectors that not only give higher performance, but are also computationally eﬃcient.
I test my approach on a challenging dataset, where I show that my algorithm improves both recognition and detection performance for grasping rectangle data. I also show that my two-stage approach is not only able to match the performance of a single-stage system, but, in fact, improves results while signiﬁcantly reducing the computational time needed for detection.
In summary, the contributions of this work are:
• I present a deep learning algorithm for detecting robotic grasps. To the best of my knowledge, this is the ﬁrst work to do so.
• In order to handle multimodal inputs, I present a new way to apply structured regularization to the weights to these inputs based on multimodal group regularization.
• I present a multi-step cascaded system for detection, signiﬁcantly reducing its computational cost.
• My method outperforms the state-of-the-art for rectangle-based grasp detection, as well as previous deep learning algorithms.
• I implement my algorithm on both a Baxter and a PR2 robot, and show
20

success rates of 84% and 89%, respectively, for executing grasps on a highly varied set of objects.
The rest of this work is organized as follows: I discuss related work in Section 2.2. I present my two-step cascaded detection system in Section 2.3, and some additional details in Section 2.4. I then describe my feature learning algorithm and structured regularization method in Section 2.5. I present my experiments in Section 2.6, and discuss results in Section 2.7. I then present experiments on both Baxter and PR2 robots in Section 2.8. I present several interesting directions for future work in Section 2.9, then conclude in Section 2.10.
2.2 Related Work
2.2.1 Robotic Grasping
In this section, I will focus on perception- and learning-based approaches for robotic grasping. For a more complete review of the ﬁeld, I refer the reader to review papers by Bohg et al. [15], Sahbani et al. [122], Bicchi and Kumar [12] and Shimoga [133].
Most works deﬁne a “grasp” as an end-eﬀector conﬁguration which achieves partial or complete form- or force-closure of a given object. This is a challenging problem because it depends on the pose and conﬁguration of the robotic gripper as well as the shape and physical properties of the object to be grasped, and typically requires a search over a large number of possible gripper conﬁgurations. Early works [74, 105, 113] focused on testing for form- and force-closure, and synthesizing grasps fulﬁlling these properties according to some hand-designed “quality score”
21

[39]. More recent works have reﬁned these deﬁnitions [117]. These works assumed full knowledge of object shape and physical properties.
Grasping Given 3D Model: Fast synthesis of grasps for known 3D models remains an active research topic [32, 45, 148], with recent methods using advanced physical simulation to ﬁnd optimal grasps. Gallegos et al. [41] performed optimization of grasps given both a 3D model of the object to be grasped and the desired contact points for the robotic gripper. Pokorny et al. [112] deﬁne spaces of graspable objects, then map new objects to these spaces to discover grasps. However, these works are only applicable when the full 3D model of the object is exactly known, which may not be the case when a robot is interacting with a new environment. I note that some of these physics-based approaches might be combined with my approach in a multi-pass system, discussed further in Sec. 2.9.
Sensing for Grasping: In a real-world robotic setting, a robot will not have full knowledge of the 3D model and pose of an object to be grasped, but rather only incomplete information from some set of sensors such as color or depth cameras, tactile sensors, etc. This makes the problem of grasping signiﬁcantly more challenging [15], as the algorithm must use more limited and potentially noisier information to detect a good grasp. While some works [25, 108] simply attempt to estimate the poses of known objects and then apply full-model grasping algorithms based on these results, others avoid this assumption, functioning on novel objects which the algorithm has not seen before.
Such works often made use of other simplifying assumptions, such as assuming that objects belong to one of a set of primitive shapes [110, 17], or are planar [98]. Other works produced impressive results for speciﬁc cases, such as grasping the corners of towels [85]. While such works escape the assumption of a fully-
22

known object model, hand-coded grasping rules have a hard time dealing with the wide range of objects seen in real-world human environments, and are diﬃcult and time-consuming to create.
Learning for Grasping: Machine learning methods have proven eﬀective for a wide range of perception problems [146, 50, 79, 134, 14], allowing a perception system to learn a mapping from some feature set to various visual properties. Early work by Kamon et al. [65] showed that learning approaches could also be applied to the problem of grasping from vision, introducing a learning component to grasp quality scores.
Recent works have employed richer features and learning methods, allowing robots to grasp known objects which might be partially occluded [60] or in an unknown pose [31] as well as fully novel objects which the system has not seen before [124]. Here, I will address the latter case. Earlier work focused on detecting only a single grasping point from 2D partial-view data, using heuristic methods to determine a gripper pose based on this point. [126]. The use of 3D data was shown to signiﬁcantly improve these results [128] thanks to giving direct physical information about the object in question. With the advent of low-cost RGB-D sensors such as the Kinect, the use of depth data for robotic grasping has become ubiquitous.
Several other works attempted to use the learning algorithm to more fully constrain the detected grasps. Ekvall and Kragic [36] and Huebner and Kragic [54] used shape-based approximations as bases for learning algorithms which directly gave an approach vector. Le et al. [76] treated grasp detection as a ranking problem over sets of contact points in image space. Jiang et al. [61] represented a grasp as a 2D oriented rectangle in image space, with two edges corresponding to the grip-
23

per plates, using surface normals to determine the grasp approach vector. These approaches allow the detection algorithm to detect more exactly the gripper pose which should be used for grasping. In this work, I will follow the rectangle-based method.
Learning-based approaches have shown impressive results in grasping novel objects, showing that learning some parameters of the detection system can outperform human tuning. However, these approaches still require a signiﬁcant degree of hand-engineering in the form of designing good input features.
Other Applications with RGBD Data. Due to the availability of inexpensive depth sensors, RGB-D data has been a signiﬁcant research focus in recent years for various robotics applications. For example, Jiang et al. [63] consider robotic placement of objects, while Teuliere and Marchand [144] used RGB-D data for visual servoing. Several works, including those of Endres et al. [37] and Whelan et al. [149] have extended and improved Simultaneous Localization and Mapping (SLAM) for RGB-D data. Object detection and recognition has been a major focus in research on RGB-D data [26, 73, 20]. Most such works use hand-engineered features such as [121]. The few works that perform feature learning for RGB-D data [134, 14] largely ignore the multimodal nature of the data, not distinguishing the color and depth channels. Here, I present a structured regularization approach which allows me to learn more robust features for RGB-D and other multimodal data.
2.2.2 Deep Learning
Structured Learning and Structured Regularization: Several approaches have been proposed which attempt to use a specially-designed regularization func-
24

Figure 2.2: Detecting and executing grasps: From left to right: My system obtains an RGB-D image from a Kinect mounted on the robot, and searches over a large space of possible grasps, for which some candidates are shown. For each of these, it extracts a set of raw features corresponding to the color and depth images and surface normals, then uses these as inputs to a deep network which scores each rectangle. Finally, the topranked rectangle is selected and the corresponding grasp is executed using the parameters of the detected rectangle and the surface normal at its center. Red and green lines correspond to gripper plates, blue in RGB-D features indicates masked-out pixels.
tion to impose structure on a set of learned parameters without directly enforcing it. Jalali et al. [59] used a group regularization function in the multitask learning setting, where one set of features is used for multiple tasks. This function applies high-order regularization separately to particular groups of parameters. Their function regularized the number of features used for each task in a set of multi-class classiﬁcation tasks solved by softmax regression. Intuitively, this encodes the belief that only some subset of the input features will be useful for each task, but this set of useful features might vary between tasks.
A few works have also explored the use of structured regularization in deep learning. The Topographic ICA algorithm [55] is a feature-learning approach that applies a similar penalty term to feature activations, but not to the weights themselves. Coates and Ng [23] investigate the problem of selecting receptive ﬁelds, i.e., subsets of the input features to be used together in a higher-level feature. The structure of the network is learned ﬁrst, then ﬁxed before learning the parameters of the network.
25

Figure 2.3: Illustration of my two-stage detection process: Given an image of an object to grasp, a small deep network is used to exhaustively search potential rectangles, producing a small set of top-ranked rectangles. A larger deep network is then used to ﬁnd the top-ranked rectangle from these candidates, producing a single optimal grasp for the given object.
2.3 Deep Learning for Grasp Detection:
System and Model
In this work, I will present an algorithm for robotic grasp detection from a single RGB-D view. My approach will be based on machine learning, but distinguish itself from previous approaches by learning not only the weights used to rank prospective grasps, but also the features used to rank them, which were previously hand-engineered.
I will do this using deep learning methods, learning a set of RGB-D features which will be extracted from each candidate grasp, then used to score that grasp. My approach will include a structured multimodal regularization method which improves the quality of the features learned from RGB-D data without constraining network structure.
In my system for robotic grasping, as shown in Fig. 2.2, the robot ﬁrst obtains an RGB-D image of the scene containing objects to be grasped. A small deep network is used to score potential grasps in this image, and a small candidate set of the top-ranked grasps is provided to a larger deep network, which yields a single
26

best-ranked grasp.
In this work, I will represent potential grasps using oriented rectangles in the image plane as seen on the left in Fig. 2.2, with one pair of parallel edges corresponding to the robotic gripper [61]. Each rectangle is thus parameterized by the X and Y coordinates of its upper-left corner, its width, height, and orientation in the image plane, giving a ﬁve-dimensional search space for potential grasps. Grasps will be ranked based on features extracted from the RGB-D image region contained inside their corresponding rectangle, aligned to the gripper plates, as seen in the center of Fig. 2.2.
To translate a rectangle such as that shown on the right in Fig. 2.2 into a gripper pose for grasping I ﬁnd the point with the minimum depth inside the central third (horizontally) of the rectangle. I then use the averaged surface normal around this point to determine the approach vector for the gripper. The orientation of the detected rectangle is translated to a rotation around this vector to orient the gripper. I use the X-Y coordinates of the rectangle center along with the depth of the closest point to determine a grasping point in the robot’s coordinate frame. I compute a pre-grasp position by shifting 10 cm back from the grasping point along this approach vector and position the gripper at this point. I then approach the object along the approach vector and grasp it.
Using a standard feature learning approach such as sparse auto-encoder [46], a deep network can be trained for the problem of grasping rectangle recognition (i.e., does a given rectangle in image space correspond to a valid robotic grasp?). However, in a real-world robotic setting, my system needs to perform detection (i.e., given an image containing an object, how should the robot grasp it?). This task is signiﬁcantly more challenging than simple recognition.
27

Two-stage Cascaded Detection: In order to perform detection, one naive approach could be to consider each possible oriented rectangle in the image (perhaps discretized to some level), and evaluate each rectangle with a deep network trained for recognition. However, such near-exhaustive search of possible rectangles (based on positions, sizes, and orientations) can be quite expensive in practice for real-time robotic grasping.
Motivated by multi-step cascaded approaches in previous work [61, 146], I instead take a two-stage approach to detection: First, I use a reduced feature set to determine a set of top candidates. Then, I use a larger, more robust feature set to rank these candidates.
However, these approaches require the design of two separate sets of features. In particular, it can be diﬃcult to manually design a small set of ﬁrst-stage features which is both quick to compute and robust enough to produce a good set of candidate detections for the second stage. Using deep learning allows me to circumvent the costly manual design of features by simply training networks of two diﬀerent sizes, using the smaller for the exhaustive ﬁrst pass, and the larger to re-rank the candidate detection results.
Model: To detect robotic grasps from the rectangle representation, I model the probability of a rectangle G(t), with features x(t) ∈ RN being graspable, using a random variable yˆ(t) ∈ {0, 1} which indicates whether or not I predict G(t) to be graspable. I use a deep network similar to that previously deﬁned in Section 1.1 to model the probability of this rectangle being graspable, P (yˆ(t) = 1|x(t)).
28

2.3.1 Inference and Learning

During inference, my goal is to ﬁnd the single grasping rectangle with the maximum probability of being graspable for some new object. With G representing a particular grasping rectangle position, orientation, and size, I ﬁnd this best rectangle as:

G∗ = arg max P (yˆ(t) = 1|φ(G); Θ)
G

(2.1)

Here, the function φ extracts the appropriate input representation for rectangle G.

During learning, my goal is to learn the parameters Θ that optimize the recognition accuracy of my system. Here, input data is given as a set of pairs of features x(t) ∈ RN and ground-truth labels y(t) ∈ {0, 1} for t = 1, . . . , M . As described above I use a two-phase learning approach. First, I pre-train the hiddenlayer weights W [1] and W [2] using an algorithm similar to the SAE approach given in Equation 1.6, but also including a regularization term, f (W ), weighted by β:

MK

W ∗ = arg min (||xˆ(t) − x(t)||22 + λ g(h(jt))) + βf (W )

W t=1

j=1

(2.2)

During the supervised phase of the learning algorithm, I then jointly learn classiﬁer weights W [3] and ﬁne-tune hidden layer weights W [1] and W [2] for recognition,

using back-propagation as described in Section 1.1. I maximize the log-likelihood

of the data along with regularization penalties on hidden layer weights:

M

Θ∗ = arg max

log P (yˆ(t) = y(t)|x(t); Θ)

Θ t=1

− β1f (W [1]) − β2f (W [2])

(2.3)

29

Two-stage Detection Model: During inference for two-stage detection, I will ﬁrst use a smaller network to produce a set of the top T rectangles with the highest probability of being graspable according to network parameters Θ1. I will then use a larger network with a separate set of parameters Θ2 to re-rank these T rectangles and obtain a single best one. The only change to learning for the two-stage model is that these two sets of parameters are learned separately, using the same approach.
2.4 System Details
In this section, I will deﬁne the set of raw features which my system will use, forming x in the equations above, and how they are extracted from an RGB-D image. Some examples of these features are shown in Fig 2.2.
My algorithm uses only local information - speciﬁcally, I extract the RGB-D sub-image contained within each rectangle, and use this to generate features for that rectangle. This image is rotated so that its left and right edges correspond to the gripper plates, and then re-scaled to ﬁt inside the network’s receptive ﬁeld.
From this 24x24 pixel image, seven channels’ worth of features are extracted, giving 24x24x7 = 4032 input features. The ﬁrst three channels are the image in YUV color space, used because it represents image intensity and color separately. The next is simply the depth channel of the image. The last three are the X, Y, and Z components of surface normals computed based on the depth channel. These are computed after the image is aligned to the gripper so that they are always relative to the gripper plates.
30

Figure 2.4: Preserving aspect ratio: Left: a pair of sunglasses with a potential grasping rectangle. Red edges indicate gripper plates. Center: image taken from the rectangle and rescaled to ﬁt a square aspect ratio. Right: same image, padded and centered in the receptive ﬁeld. Blue areas indicate masked-out padding. When rescaled, the rectangle incorrectly appears graspable. Preserving aspect ratio and padding allows the rectangle to correctly appear non-graspable.
Figure 2.5: Improvement from mask-based scaling: Left: Result without maskbased scaling. Right: Result with mask-based scaling.
2.4.1 Data Pre-Processing
Whitening data is critical for deep learning approaches to work well, especially in cases such as multimodal data where the statistics of the input data may vary greatly. While PCA-based approaches have been shown to be eﬀective [56], they are diﬃcult to apply in cases such as mine where large portions of the data may be masked out.
Depth data, in particular, can be diﬃcult to whiten because the range of values may be very diﬀerent for diﬀerent patches in the image. Thus, I ﬁrst whiten each depth patch individually, subtracting the patch-wise mean and dividing by the
31

Figure 2.6: Three possible models for multimodal deep learning: Left: fully dense model—all visible features are concatenated and modality information is ignored. Middle: modality-speciﬁc sparse model - separate ﬁrst layer features are trained for each modality. Right: group-sparse model—a structured regularization term encourages features to use only a subset of the input modes.
patch-wise standard deviation, down to some minimum.
For multimodal data, the statistics of the data for each modality should match as closely as possible, to avoid learning features which are biased towards or away from using particular modes. This is particularly important when regularizing each modality separately, as in my approach. Thus, I drop mean values for each feature separately, but scale the data for each channel by dividing by the standard deviation of all its features combined.
2.4.2 Preserving Aspect Ratio
It is important for to preserve aspect ratio when feeding features into the network. This is because distorting image features may cause non-graspable rectangles to appear graspable, as shown in Fig. 2.4. However, padding with zeros can cause rectangles with less padding to receive higher graspability scores, as the network will have more nonzero inputs. It is important to account for this because in many cases the ideal grasp for an object might be represented by a thin rectangle which would thus contain many zero values in its receptive ﬁeld from padding.
32

To address this problem, I scale up the magnitude of the available input for each rectangle based on the fraction of the rectangle which is masked out. In particular, I deﬁne a multiplicative scaling factor for the inputs from each modality, based on the fraction of each mode which is masked out, since each mode may have a diﬀerent mask.

In the multimodal setting, I assume that the input data x is known to come from

R distinct modalities, for example audio and video data, or depth and RGB data.

I deﬁne the modality matrix S as an RxN binary matrix, where each element

Sr,i indicates membership of visible unit xi in a particular modality r, such as

depth or image intensity. The scaling factor for mode r is then deﬁned as: Ψ(rt) =

N i=1

Sr,i/

N i=1

Sr,i

µ(it)

, where µ(it) is 1 if x(it) is masked in, 0 otherwise.

The

scaling factor for case i is: ψi(t) =

R r=1

Sr,i

Ψ(rt).

I could simply scale up each value of x by its corresponding scale factor when

training

my

model,

as

x′

(t) i

=

ψi(t)x(it).

However, since my sparse autoencoder

penalizes squared error, scaling x linearly will scale the error for the corresponding

cases quadratically, causing the learning algorithm to lend increased signiﬁcance

to cases where more data is masked out. Instead, I can use the scaled x′ as input

to the network, but penalize reconstruction based on the original x, only scaling

after the squared error has been computed:

M
W ∗ = arg min
W t=1

NK

ψi(t)(xˆ(it) − x(it))2 + λ

g(h(jt))

i=1 j=1

(2.4)

I redeﬁne the hidden units to use the scaled visible input:

h(jt) = σ

N
x′(it)Wi,j
i=1

(2.5)

This approach is equivalent to adding additional, potentially fractional, ‘virtual’

visible units to the model based on the scaling factor for each mode. In practice,

33

(a) Features corresponding to positive grasps. (b) Features corresponding to negative grasps.
Figure 2.7: Features learned from grasping data: Each feature contains seven channels - from left to right, depth, Y, U, and V image channels, and X, Y, and Z surface normal components. Vertical edges correspond to gripper plates. Left: eight features with the strong positive correlations to rectangle graspability. Right: similar, but negative correlations. Group regularization eliminates many modalities from many of these features, making them more robust.
I found it necessary to limit the scaling factor to a maximum of some value c, as Ψ′(rt) = min(Ψ(rt), c).
As shown in Table 2.3 my mask-based scaling technique at the visible layer improves grasping results by over 25% for both metrics. As seen in Figure 2.5, it removes the network’s inherent bias towards square rectangles, exhibiting a much wider range of aspect ratios that more closely matches that of the ground-truth data.
2.5 Structured Regularization for Feature Learning
A naive way of applying feature learning to multimodal data is to simply take x (as a concatenated vector) as input to the model described above, ignoring information about speciﬁc modalities, as seen on the lefthand side of Figure 2.6. This approach may either 1) prematurely learn features which include all modalities, which can lead to overﬁtting, or 2) fail to learn associations between modalities with very diﬀerent underlying statistics.
34

Instead of concatenating multimodal input as a vector, Ngiam et al. [104] proposed training a ﬁrst layer representation for each modality separately, as shown in Figure 2.6-middle. This approach makes the assumption that the ideal low-level features for each modality are purely unimodal, while higher-layer features are purely multimodal. This approach may work better for some problems where the modalities have very diﬀerent basic representations, such as the video and audio data (as used in [104]), so that separate ﬁrst layer features may give better performance. However, for modalities such as RGB-D data, where the input modes represent diﬀerent channels of an image, learning low-level correlations can lead to more robust features – my experiments in Section 2.6 show that simply concatenating the input modalities signiﬁcantly outperforms training separate ﬁrst-layer features for robotic grasp detection from RGB-D data.
For many problems, it may be diﬃcult to tell which of these approaches will perform better, and time-consuming to tune and comparatively evaluate multiple algorithms. In addition, the ideal feature set for some problems may contain features which use some, but not all, of the input modalities, a case which neither of these approaches are designed to handle.
To solve these problems, I propose a new algorithm for feature learning for multimodal data. My approach incorporates a structured penalty term into the optimization problem to be solved during learning. This technique allows the model to learn correlated features between multiple input modalities, but regularizes the number of modalities used per feature (hidden unit), discouraging the model from learning weak correlations between modalities. With this regularization term, the algorithm can specify how mode-sparse or mode-dense the features should be, representing a continuum between the two extremes outlined above.
35

Regularization in Deep Learning: In a typical deep learning model, L2 regularization (i.e., f (W ) = ||W ||22) or L1 regularization (i.e., f (W ) = ||W ||1) are commonly used in training (e.g., as speciﬁed in Equations (2.2) and (2.3)). These are often called a “weight cost” (or “weight decay”), and are left implicit in many works.
Applying regularization is well known to improve the generalization performance of feature learning algorithms. One might expect that a simple L1 penalty would eliminate weak correlations in multimodal features, leading to features which use only a subset of the modes each. However, I found that in practice, a value of β large enough to cause this also degraded the quality of features for the remaining modes and lead to decreased task performance.

Multimodal Regularization: Structured regularization, such as in [59], takes

a set of groups of weights, and applies some regularization function (typically

high-order) separately to each group. In my structured multimodal regularization

algorithm, each modality will be used as a regularization group separately for each

hidden unit. For example, a group-wise p-norm would be applied as:

KR
f (W ) =
j=1 r=1

N
Sr,i|Wip,j |
i=1

1/p

(2.6)

where Sr,i is 1 if feature i belongs to group r and 0 otherwise. Using a high value of p allows me to penalize higher-valued weights from each mode to each feature

more strongly than lower-valued ones. This also means that forming a high-valued

weight in a group with other high-valued weights will accrue a lower additional

penalty than doing so for a group with only low-valued weights. At the limit

(p → ∞), this group regularization becomes equivalent to the inﬁnity (or max)

36

norm:

KR

f (W ) =

max i

Sr,i|Wi,j

|

j=1 r=1

(2.7)

which penalizes only the maximum weight from each mode to each feature. In

practice, the inﬁnity norm is not diﬀerentiable and therefore is diﬃcult to apply

gradient-based optimization methods; here, I use the log-sum-exponential as a

diﬀerentiable approximation to the max norm.

In experiments, this regularization function produces ﬁrst-layer weights concentrated in fewer modes per feature. However, I found that at values of β suﬃcient to induce the desired mode-wise sparsity patterns, penalizing the maximum also had the undesirable side-eﬀect of causing many of the weights for other modes to saturate at their mode’s maximum, suggesting that the features were overly constrained. In some cases, constraining the weights in this manner also caused the algorithm to learn duplicate (or redundant) features, in eﬀect scaling up the feature’s contribution to reconstruction to compensate for its constrained maximum. This is obviously an undesirable eﬀect, as it reduces the eﬀective size (or diversity) of the learned feature set.

This suggests that the max-norm may be overly constraining. A more desirable

sparsity function would penalize nonzero weight maxima for each mode for each

feature without additional penalty for larger values of these maxima. I can achieve

this eﬀect by applying the L0 norm, which takes a value of 0 for an input of 0, and

1 otherwise, on top of the max-norm from above:

KR

f (W ) =

I{(max i

Sr,i |Wi,j |)

>

0}

j=1 r=1

(2.8)

where I is the indicator function, which takes a value of 1 if its argument is true,

0 otherwise. Again, for a gradient-based method, I used an approximation to the

37

Figure 2.8: Example objects from the Cornell grasping dataset: [61]. This dataset contains objects from a large variety of categories.
L0 norm, such as log(1 + x2). This regularization function now encodes a direct penalty on the number of modes used for each weight, without further constraining the weights of modes with nonzero maxima.
Figure 2.7 shows features learned from the unsupervised stage of my groupregularized deep learning algorithm. I discuss these features, and their implications for robotic grasping, in Section 2.7.
2.6 Experiments
2.6.1 Dataset
I used the extended version of the Cornell grasping dataset for my experiments. This dataset, along with code for this chapter, is available at
38

http://pr.cs.cornell.edu/deepgrasping. I note that this is an updated version of the dataset used in [61], containing several more complex objects, and thus results for their algorithms will be diﬀerent from those in [61]. This dataset contains 1035 images of 280 graspable objects, several of which are shown in Fig. 2.8. Each image is annotated with several ground-truth positive and negative grasping rectangles. While the vast majority of possible rectangles for most objects will be non-graspable, the dataset contains roughly equal numbers of graspable and nongraspable rectangles. I will show that this is useful for an unsupervised learning algorithm, as it allows learning a good representation for graspable rectangles even from unlabeled data.
I performed ﬁve-fold cross-validation, and present results for splits on per image (i.e., the training set and the validation set do not share the same image) and per object (i.e., the training set and the validation set do not share any images from the same object) basis. Hyper-parameters were selected by validating performance on a separate set of 300 grasps not used in any of the cross-validation splits.
I take seven 24x24 pixel channels as described in Section 2.4 as input, giving 4032 input features to each network. I trained a deep network with 200 hidden units each at the ﬁrst and second layers using my learning algorithm as described in Sections 2.3 and 2.5. Training this network took roughly 30 minutes. For trials involving my two-pass system, I trained a second network with 50 hidden units at each layer in the same manner. During inference I performed an exhaustive search using this network, then used the 200-unit network to re-rank the 100 highestranked rectangles found by the 50-unit network.
39

2.6.2 Baselines
I compare my recognition results in the Cornell grasping dataset with the features from [61], as well as the combination of these features and Fast Point Feature Histogram (FPFH) features [120]. I used a linear SVM for classiﬁcation, which gave the best results among all other kernels. I also report chance performance, obtained by randomly selecting a label in the recognition case, and randomly assigning scores to rectangles in the detection case.
I also compare my algorithm to other deep learning approaches. I compare to a network trained only with standard L1 regularization, and a network trained in a manner similar to [104], where three separate sets of ﬁrst layer features are learned for the depth channel, the combination of the Y, U, and V channels, and the combination of the X, Y, and Z surface normal components.
2.6.3 Metrics for Detection
For detection, I compare the top-ranked rectangle for each method with the set of ground-truth rectangles for each image. I present results using two metrics, the “point” and “rectangle” metric.
For the point metric, similar to Saxena et al. [126], I compute the center point of the predicted rectangle, and consider the grasp a success if it is within some distance from at least one ground-truth rectangle center. I note that this metric ignores grasp orientation, and therefore might overestimate the performance of an algorithm for robotic applications.
For the rectangle metric, similar to Jiang et al. [61], let G be the top-ranked
40

Negative Positive

Figure 2.9: Learned 3D depth features: 3D meshes for depth channels of the four features with strongest positive (top) and negative(bottom) correlations to rectangle graspability. Here X and Y coordinates corresponds to positions in the deep network’s receptive ﬁeld, and Z coordinates corresponds to weight values to the depth channel for each location. Feature shapes clearly correspond to graspable and non-graspable structures, respectively.

Table 2.1: Recognition results for Cornell grasping dataset.

Algorithm Chance Jiang et al. [61] Jiang et al. [61] + FPFH Sparse AE, separate layer-1 feat. Sparse AE Sparse AE, group reg.

Accuracy (%) 50
84.7 89.6 92.8 93.7
93.7

grasping rectangle predicted by the algorithm, and G∗ be a ground-truth rectangle. Any rectangles with an orientation error of more than 30o from G are rejected. From the remaining set, I use the common bounding box evaluation metric of intersection divided by union - i.e. Area(G ∩ G∗)/Area(G ∪ G∗). Since a groundtruth rectangle can deﬁne a large space of graspable rectangles (e.g., covering the entire length of a pen), I consider a prediction to be correct if it scores at least 25% by this metric.

41

2.7 Results and Discussion
2.7.1 Deep Learning for Robotic Grasp Detection
Figure 2.7 shows the features learned by the unsupervised phase of my algorithm which have a high correlation to positive and negative grasping cases. Many of these features show non-zero weights to the depth channel, indicating that it learns the correlation of depths to graspability. I can see that weights to many of the modalities for these features have been eliminated by my structured regularization approach. In particular, many of these features lack weights to the U and V (3rd and 4th) channels, which correspond to color, allowing the system to be more robust to diﬀerent-colored objects.
Figure 2.9 shows 3D meshes for the depth channels of the four features with the strongest positive and negative correlations to valid grasps. Even without any supervised information, my algorithm was able to learn several features which correlate strongly to graspable cases and non-graspable cases. The ﬁrst two positivecorrelated features represent handles, or other cases with a raised region in the center, while the second two represent circular rims or handles. The negativelycorrelated features represent obviously non-graspable cases, such as ridges perpendicular to the gripper plane and “valleys” between the gripper plates. From these features, I can see that even during unsupervised feature learning, my approach is able to learn a representation useful for the task at hand, thanks purely to the fact that the data used is composed of half graspable and half non-graspable cases.
From Table 2.1, I see that the recognition performance is signiﬁcantly improved with deep learning methods, improving 9% over the features from [61] and 4.1%
42

Table 2.2: Recognition results for diﬀerent modalities, for a deep network pretrained using SAE.

Modes Chance RGB Depth Surf. Normals Depth + Surf. Normals RGB + Depth + Surf. Normals

Accuracy (%) 50
90.3 92.4 90.3 92.8 93.7

over those features combined with FPFH features. Both L1 and group regularization performed similarly for recognition, but training separate ﬁrst layer features decreased performance slightly. This shows that learned features, in addition to avoiding hand-design, are able to improve performance signiﬁcantly over the state of the art. It demonstrates that a deep network is able to learn the concept of “graspability” in a way that generalizes to new objects it hasn’t seen before.
Table 2.2 shows that even using any one of the three input modalities (RGB, depth, or surface normals), my algorithm is able to learn features which outperform hand-engineered ones for recognition. Depth gives the highest performance of any single-mode network. Combining depth and normal information improves results over either alone, indicating that they give non-redundant information.
The highest accuracy is still obtained by using all the input modalities. This shows that combining depth and color information leads to a system which is more robust than either modality alone. This is due to the fact that some graspable cases (rims of monochromatic objects, etc.) can only be detected using depth information, while in others, the depth channel may be extremely noisy, requiring the use of color information. From this, I can see that integrating multimodal information, a major focus of this work, is important in recognizing good robotic grasps.

43

Table 2.3: Detection results for point and rectangle metrics, for various learning algorithms.

Algorithm
Chance Jiang et al. [61] SAE, no mask-based scaling SAE, separate layer-1 feat. SAE, L1 reg. SAE, struct. reg., 1st pass only SAE, struct. reg., 2nd pass only SAE, struct. reg. two-stage

Image-wise split Point Rect 35.9 6.7 75.3 60.5 62.1 39.9 70.3 43.3 87.2 72.9 86.4 70.6 87.5 73.8 88.4 73.9

Object-wise split Point Rect 35.9 6.7 74.9 58.3 56.2 35.4 70.7 40.0 88.7 71.4 85.2 64.9 87.6 73.2 88.1 75.6

Table 2.3 shows that the performance gains from deep learning for recognition carry over to detection, as well. Once mask-based scaling has been applied, all deep learning approaches except for training separate ﬁrst-layer features outperform the hand-engineered features from [61] by up to 13% for the point metric and 17% for the rectangle metric, while also avoiding the need to design task-speciﬁc features. Without mask-based scaling, the system performs poorly, due to the bias illustrated in Fig. 2.5. Separate ﬁrst-layer features also give weak detection performance, indicating that the relative scores assigned by this form of network are less robust than those learned using my structured regularization approach.
Using structured multimodal regularization also improves results over standard L1regularization by up to 1.8%, showing that my method also learns more robust features than standard approaches which ignore modality information. Even though using the ﬁrst-pass network alone underperforms the second-pass network alone by up to 8.3%, integrating both in my two-pass system outperforms the solo second-pass network by up to 2.4%. This shows that the two-pass system improves not only eﬃciency, but accuracy as well. The performance gains from multimodal regularization and the two-pass system are discussed in detail below.
My system outperforms all baseline approaches by all metrics except for the
44

Thin gripper Wide gripper

0 degrees

45 degrees

90 degrees

Figure 2.10: Visualization of grasping scores for diﬀerent grippers: Red indicates maximum score for a grasp with left gripper plane centered at each point, blue is similar for the right plate. Best-scoring rectangle shown in green/yellow.

Figure 2.11: Improvements from group regularization: Cases where my group regularization approach produces a viable grasp (shown in green and yellow), while a network trained only with simple L1 regularization does not (shown in blue and red). Top: RGB image, bottom: depth channel. Green and blue edges correspond to gripper.
point metric in the object-wise split case. However, I can see that the chance performance is much higher for the point metric than for the rectangle metric. This shows that the point metric can overstate performance, and the rectangle metric is a better indicator of the accuracy of a grasp detection system.
Adaptability: One important advantage of my detection system is that I can ﬂexibly specify the constraints of the gripper in my detection system. This is
45

particularly important for a robot like Baxter, where diﬀerent objects might require diﬀerent gripper settings to grasp. I can constrain the detectors to handle this. Figure 2.10 shows detection scores for systems constrained based on two diﬀerent settings of Baxter’s gripper, one wide and one thin. The implications of these results for other types of grippers will be discussed in Section 2.9.
2.7.2 Multimodal Group Regularization
My group regularization term improves detection accuracy over simple L1 regularization. The improvement is more signiﬁcant for the object-wise split than for the image-wise split because the group regularization helps the network to avoid overﬁtting, which will tend to occur more when the learning algorithm is evaluated on unseen objects.
Figure 2.11 shows typical cases where a network trained using my group regularization ﬁnds a valid grasp, but a network trained with L1 regularization does not. In these cases, the grasp chosen by the L1-regularized network appears valid for some modalities – the depth channel for the sunglasses and nail polish bottle, and the RGB channels for the scissors. However, when all modalities are considered, the grasp is clearly invalid. The group-regularized network does a better job of combining information from all modalities and is more robust to noise and missing data in the depth channel, as seen in these cases.
46

Figure 2.12: Improvements from two-stage system: Example cases where the twostage system produces a viable grasp (shown in green and yellow), while the single-stage system does not (shown in blue and red). Top: RGB image, bottom: depth channel. Green and blue edges correspond to gripper.
2.7.3 Two-stage Detection System
Using my two-pass system enhanced both computational performance and accuracy. The number of rectangles the full-size network needed to evaluate was reduced by roughly a factor of 1000. Meanwhile, detection performance increased by up to 2.4% as compared to a single pass with the large-size network, even though using the small network alone signiﬁcantly underperforms the larger network. In most cases, the top 100 rectangles from the ﬁrst pass contained the top-ranked rectangle from an exhaustive search using the second-stage network, and thus results were unaﬀected.
Figure 2.12 shows some cases where the ﬁrst-stage network pruned away rectangles corresponding to weak grasps which might otherwise be chosen by the secondstage network. In these cases, the grasp chosen by the single-stage system might be feasible for a robotic gripper, but the rectangle chosen by the two-stage system represents a grasp which would clearly be successful.
47

Figure 2.13: Robotic experiment objects: Several of the objects used in experiments, including challenging cases such as an oddly-shaped RC car controller, a cloth towel, plush cat, and white ice cube tray.
The two-stage system also signiﬁcantly increases the computational eﬃciency of my detection system. Average inference time for a MATLAB implementation of the deep network was reduced from 24.6s/image for an exhaustive search using the larger network to 13.5s/image using the two-stage system.
2.8 Robotic Experiments
In order to evaluate the performance of my algorithms in the real world, I ran an extensive series of robotic experiments. To explore the generalizability and eﬀect of the robot on the success rate of my algorithms, I performed experiments on two diﬀerent robotic platforms, a Baxter Research Robot (“Yogi”) and a PR2 (“Kodiak”).
Baxter: The ﬁrst platform used is a Baxter Research Robot, which I call “Yogi.”
48

Figure 2.14: Robots executing grasps: My robots grasping several objects from the experimental dataset. Top row: Baxter grasping a quad-rotor casing, coﬀee mug, ice cube tray, knife, and electric shaver. Middle row: Baxter grasping a desk lamp, cheese grater, umbrella, cloth towel, and hot glue gun. Bottom row: PR2 grasping a plush cat, RC car controller, cereal box, toy elephant, and glove.
49

Table 2.4: Results for robotic experiments for Baxter, sorted by object category, for a total of 100 trials. Tr. indicates number of trials, Acc. indicates accuracy (in terms of success percentage.)

Kitchen tools

Object

Tr. Acc.

Can opener 3 100

Knife

3 100

Brush

3 100

Tongs

3 100

Towel

3 100

Grater

3 100

Average

100

Overall

84

Lab tools

Object

Tr.

Kinect

5

Wire bundle 3

Mouse

3

Hot glue gun 3

Quad-rotor

4

Duct tape roll 4

Average

Acc. 100 100 100 67 75 100 90

Containers

Object

Tr.

Colored cereal box 3

White cereal box 4

Cap-shaped bowl 3

Coﬀee mug

3

Ice cube tray

3

Martini glass

3

Average

Acc. 100 50 100 100 100
0 75

Toys

Object

Tr.

Plastic whale

4

Plastic elephant 4

Plush cat

4

RC controller

3

XBox controller 4

Plastic frog

3

Average

Acc. 75 100 75 67 50 67 72

Others

Object

Tr.

Electric shaver 3

Umbrella

4

Desk lamp

3

Remote control 5

Metal bookend 3

Glove

3

Average

Acc. 100 75 100 100 33 100 85

50

Table 2.5: Results for robotic experiments for PR2, sorted by object category, for a total of 100 trials. Tr. indicates number of trials, Acc. indicates accuracy (in terms of success percentage.)

Kitchen tools

Object

Tr. Acc.

Can opener 3 100

Knife

3 100

Brush

3 100

Tongs

3 100

Towel

3 100

Grater

3 100

Average

100

Overall

89

Lab tools

Object

Tr.

Kinect

5

Wire bundle 3

Mouse

3

Hot glue gun 3

Quad-rotor

4

Duct tape roll 4

Average

Acc. 100 100 100 67 100 100 95

Containers

Object

Tr.

Colored cereal box 3

White cereal box 4

Cap-shaped bowl 3

Coﬀee mug

3

Ice cube tray

3

Martini glass

3

Average

Acc. 100 100 100 100 100
0 83

Toys

Object

Tr.

Plastic whale

4

Plastic elephant 4

Plush cat

4

RC controller

3

XBox controller 4

Plastic frog

3

Average

Acc. 75 100 100 67 25 67 72

Others

Object

Tr.

Electric shaver 3

Umbrella

4

Desk lamp

3

Remote control 5

Metal bookend 3

Glove

3

Average

Acc. 100 100 100 100 67 100 95

Baxter has two arms with seven degrees of freedom each and a maximum reach of 104 cm, although I used only the left arm for these experiments. The end-eﬀector for this arm is a two-ﬁnger parallel gripper. I augmented the gripper tips using rubber bands for additional friction. Baxter’s grippers are interchangable, and I used two settings for these experiments - a “wide” setting with an open width of 8 cm and closed width of 4 cm, and a “thin” setting with an open width of 4 cm and a closed width of 0 cm (completely closed, gripper tips touching).
To detect grasps, I mounted a Kinect sensor to Yogi’s head, approximately 1.75 m above the ground. angled downwards at roughly a 75o angle towards a table in front of it. The Kinect gives RGB-D images at a resolution of 640x480 pixels. I calibrated the transformation between the Kinect’s and Yogi’s coordinate frames by marking four points corresponding to a set of 3D axes, and obtaining the coordinates of these points in both Kinect’s and Yogi’s frames.
All control for Baxter was done by specifying an end-eﬀector position and orientation, and using the inverse kinematics provided with Baxter to determine a set of joint angles for this pose. Baxter’s built-in control systems were used to drive the arm to these new joint angles.
PR2: My second platform was a PR2 robot, “Kodiak.” Similar to Baxter, PR2 has two 7-DoF arms with approximately 1 m reach, and I used only the left for these experiments. PR2’s grippers open to a width of 8 cm, and are capable of closing completely from that span, so I did not need to use two settings as with Baxter. I augmented PR2’s gripper friction with gaﬀer tape on the ﬁngertips.
For the experiments on PR2, I used the Kinect already mounted to Kodiak’s head, and used ROS’s built-in functionality to obtain 3D locations from that Kinect
51

and transform these to Kodiak’s body frame for manipulation. Control was performed using the ee cart stiﬀness controller [16] with trajectories provided by my own custom MATLAB code.
Experimental Setup: For each experiment, I placed a single object within a 25 cm x 25 cm square on the table, approximately 1.2 m below the mounting point of the Kinect. This square was chosen to be well-contained within each robot’s workspace, allowing objects to be reached from most approach vectors. Object positions and orientations were varied between trials, although objects were always placed in conﬁgurations in which at least one viable grasp was visible and accessible to the robot.
When using Baxter, due to the limited stroke (span from open to closed) of its gripper, I pre-selected one of the two gripper settings discussed above for each object. I constrained the search space as illustrated in Fig. 2.10 to ﬁnd grasps for that particular setting.
To detect grasps, I ﬁrst took an RGB-D image from the Kinect with no objects in the scene as a background image. The depth channel of this image was used to segment objects from the scene, and to correct for the slant of the Kinect. Once an object was segmented, I used my algorithm, as described above, to obtain a single best-ranked grasping rectangle.
The search space for the ﬁrst-pass network progressed in 15-degree increments from 15 to 180 degrees (angles larger than 180 being mirror-images of grasps already tested), searching over 10-pixel increments across the image for the X and Y coordinates of the upper-left corner of the rectangle. For the thin gripper setting, rectangle widths and heights from 10 to 40 pixels in 10-pixel increments were
52

searched, while for the thick setting these ranged from 40 pixels to 100 pixels in 20-pixel increments. In both cases, rectangles taller than they were wide were ignored. Once a single best-scoring grasp was detected, I translated it to a robotic grasp consisting of a grasping point and an approach vector using the rectangle’s parameters and the surface normal at the rectangle’s center as described above.
To execute the grasp, I ﬁrst positioned the gripper at a location 10 cm back from the grasping point along the approach vector. The gripper was oriented to the approach vector, and rotated around it based on the orientation of the detected grasping rectangle.
Since Baxter’s arms are highly compliant, slight imprecisions in end-eﬀector positioning are to be expected – I found that errors of up to 2 cm were typical. Thus, I implemented a visual servoing system using its hand camera, which provides RGB images at a resolution of 320x200 pixels. I used color segmentation to separate the object from the background, and used its lateral position in image space to drive Yogi’s end-eﬀector to center the object. I did not implement visual servoing for PR2 because its gripper positioning was found to be precise to within 0.5 cm.
After visual servoing was completed, I drove the gripper 14 cm forwards from its current position along the approach vector, so that the grasping point was wellcontained within it. I then closed the gripper, grasping the object, and moved it 30 cm upwards. A grasp was determined to be successful if it was suﬃcient to lift the object and hold it for one second.
Objects to be Grasped: For my robotic experiments, I collected a diverse set of 35 objects within a size of .3 m x .3 m x .3 m and weighing at most 2.5
53

kg (although most were less than 1 kg) from my oﬃces, homes, and lab. Many of them are shown in Fig. 2.13. Most of these objects were not present in the training dataset, and thus were completely new to the grasp detection algorithm.
Due to the physical limitations of the robots’ grippers, I found that ﬁve of these objects were not graspable even when given a hand-chosen grasp. The small pair of pliers was too low to the table to grip properly. The spray paint can was too smooth for the gripper to get enough friction to lift it. The weight of the hammer was too imbalanced, causing the hammer to rotate and slip out of the gripper when grasped. Similar problems were encountered with the bicycle U-lock. The bevel spatula’s handle was too close to the thin-set size of Baxter’s gripper, so that I could not position it precisely enough to grasp it reliably. I did not consider these objects for purposes of my experimental results, since my focus was on evaluating the performance of my grasp detection algorithm.
Results: Table 2.4 shows the results of my robotic experiments on Baxter for the remaining 30 objects, a total of 100 trials. Using my algorithm, Yogi was able to successfully execute a grasp in 84% of the trials. Figure 2.14 shows Yogi executing several of these grasps. In 8% of the trials, my algorithm detected a valid grasp which was not executed correctly by Yogi. Thus, I were able to successfully detect a good grasp in 92% of the trials. Video of some of these trials is available at http://pr.cs.cornell.edu/deepgrasping.
PR2 yielded a higher success rate as seen in Table 2.5, succeeding in 89% of trials. This is largely due to the much wider span of PR2’s gripper from open to closed and its ability to fully close from its widest position, as well as PR2’s ability to apply a larger gripping force. Some speciﬁc instances where PR2 and Baxter’s performance diﬀered are discussed below.
54

For comparison purposes, I ran a small set of control experiments for 16 of the objects in the dataset. The control algorithm simply returned a ﬁxed-size rectangle centered at the object’s center of mass, as determined by depth segmentation from the background. The rectangle was aligned so that the gripper plates ran parallel to the object’s principal axis. This algorithm was only successful in 31% of cases, signiﬁcantly underperforming my system.
On Baxter, my algorithm sometimes detected a grasp which was not realizable by the current setting of its gripper, but might be executable by others. For example, my algorithm detected grasps across the leg of the plush cat, and the region between the handle and body of the umbrella, both too thin for the wide setting of Baxter’s gripper to grasp since it has a minimum span of 4 cm. Since PR2’s gripper can close completely from any position, it did not encounter these issues and thus achieved a 100% success rate for both these objects.
The XBox controller proved to be a very diﬃcult object for either robot to grasp. From a top-down angle, there is only a small space of viable grasps with a span of less than 8 cm, but many which have either a slightly larger span (making them non-realizable by either gripper), or are subtly non-viable (e.g. grasps across the two “handles,” which tend to slip oﬀ.) All viable grasps are very near to the 8 cm span of both grippers, meaning that even slight imprecision in positioning can lead to failure. Due to this, Baxter achieved a higher success rate for the XBox controller thanks to visual servoing, succeeding in 50% of cases as compared to the 25% success rate for PR2.
My algorithm was able to consistently detect and execute valid grasps for a red cereal box, but had some failures on a white and yellow one. This is because the background for all objects in the dataset is white, leading the algorithm to learn
55

features relating white areas at the edges of the gripper region to graspable cases. However, it was able to detect and execute correct grasps for an all-white ice cube tray, and so does not fail for all white objects. This could be remedied by extending the dataset to include cases with diﬀerent background colors. Interestingly, even though the parameters of grasps detected for the white box were similar for PR2 and Baxter, PR2 was able to succeed in every case while Baxter succeeded only half the time. This is because PR2’s increased gripper strength allowed it to execute grasps across corners of the box, crushing it slightly in the process.
Other failures were due to the limitations of the Kinect sensor. I were never able to properly grasp the martini glass because its glossy ﬁnish prevented Kinect from returning any depth estimates for it. Even if a valid grasp were detected using color information only, there was no way to infer a proper grasping position without depth information. Grasps for the metal bookend failed for similar reasons, but it was not as glossy as the martini glass, and gave enough returns for some to succeed.
However, my algorithm also had many noteworthy successes. It was able to consistently detect and execute grasps for a crumpled cloth towel, a complex and irregular case which bore little resemblance to any object in the dataset. It was also able to ﬁnd and grasp the rims of objects such as the plastic baseball cap and coﬀee mug, cases where there is little visual distinction between the rim and body of the object. These objects underscore the importance of the depth channel for robotic grasping, as none of these grasps would be detectable without depth information.
My algorithm was also able to successfully detect and execute many grasps for which the approach vector was non-vertical. The grasps shown for the coﬀee mug,
56

desk lamp, cereal box, RC car controller, and toy elephant shown in Fig. 2.14 were all executed by aligning the gripper to such an approach vector. Indeed, many of these grasps may have failed had the gripper been aligned vertically. This shows that my algorithm is not restricted to detecting top-down grasps, but rather encodes a more general notion of graspability which can be applied to grasps from many angles, albeit within the constraints of visibility from a single-view perspective.
While a few failures occurred, my algorithm still achieved a high rate of accuracy for other oddly-shaped objects such as the quad-rotor casing, RC car controller, and glue gun. For objects with clearly deﬁned handles, such as the cheese grater, kitchen tongs, can opener, and knife, my algorithm was able to detect and execute successful grasps in every trial, showing that there is a wide range of objects which it can grasp extremely consistently.
2.9 Discussion and Future Work
My algorithm focuses on the problem of grasp detection for a two-ﬁngered parallelplate style gripper. It would be directly applicable to other grippers with ﬁxed conﬁgurations, simply requiring new training data labeled with grasps for the gripper in question. My system would allow even the basic features used for grasp detection to adapt to the gripper. This might be useful in cases such as jamming grippers [62], or two-ﬁngered grippers with diﬀerently-shaped contact surfaces, which might require diﬀerent features to determine a graspable area.
My detection algorithm does not directly address the problem of 3D orientation of the gripper – this orientation is determined only after an optimal rectangle has
57

been detected, orienting the grasp based on the object’s surface normals. However, just as my approach here considers aligns a 2D feature window to the gripper, an extension of this work might align a 3D window – using voxels, rather than pixels, as its basic unit of representation for input features to the network. This would allow the system to search across the full 6-DoF 3D pose of the gripper, while still leveraging the power of feature learning.
My system gives only a gripper pose as output, but multi-ﬁngered reconﬁgurable hands also require a conﬁguration of the ﬁngers in order to grasp an object. In this case, my algorithm could be used as a heuristic to ﬁnd one or more locations likely to be graspable (similar to the ﬁrst pass in my two-pass system), greatly reducing the search space needed to ﬁnd an optimal gripper conﬁguration.
My algorithm also depends only on local features to determine grasping locations. However, many household objects may have some areas which are strongly preferable to grasp over others - for example, a knife might be graspable by the blade, or a hot glue gun by the barrel, but both should actually be grasped by their respective handles. Since these regions are more likely to be labeled as graspable in the data, my system already weakly encodes this, but some may not be readily distinguishable using only local information. Adding a term modeling the probability of each region of the image being a semantically-appropriate area to grasp the object would allow me to incorporate this information. This term could be computed once for the entire image, then added to each local detection score, keeping detection eﬃcient.
In this work, my visual-servoing algorithm was purely heuristic, simply attempting to center the segmented object underneath the hand camera. However, in future work, a similar feature-learning approach might be applied to hand cam-
58

era images of graspable and non-graspable regions, improving the visual servoing system’s ability to ﬁne-tune gripper position to ensure a good grasp.
Many robotics problems require the use of perceptual information, but can be diﬃcult and time-consuming to engineer good features for, particularly when using RGB-D data. In future work, my approach could be extended to a wide range of such problems. My system could easily be applied to other detection problems such as object detection or obstacle detection. However, it could also be adapted to other similar problems, such as object tracking and visual servoing.
Multimodal data has become extremely important for robotics, due both to the advent of new sensors such as the Kinect and the application of robots to more challenging tasks which require multiple modalities of information to perform well. However, it can be very diﬃcult to design features which do a good job of integrating many modalities. While my work focuses on color, depth, and surface normals as input modes, my structured multimodal regularization algorithm might also be applied to others. This approach could improve performance while allowing roboticists to focus on other engineering challenges.
2.10 Conclusions
I presented a system for detecting robotic grasps from RGB-D data using a deep learning approach. My method has several advantages over current state-of-the-art methods. First, using deep learning allows me to avoid hand-engineering features, learning them instead. Second, my results show that deep learning methods signiﬁcantly outperform even well-designed hand-engineered features from previous work.
59

I also presented a novel feature learning algorithm for multimodal data based on group regularization. In extensive experiments, I demonstrated that this algorithm produces better features for robotic grasp detection than existing deep learning approaches to multimodal data. My experiments and results, both oﬄine and on real robotic platforms, show that my two-stage deep learning system with group regularization is capable of robustly detecting grasps for a wide range of objects, even those previously unseen by the system.
60

CHAPTER 3 DEEPMPC: LEARNING DEEP LATENT FEATURES FOR MODEL
PREDICTIVE CONTROL
Figure 3.1: Cutting food: My PR2 robot uses my algorithms to perform complex, precise food-cutting operations. Given the large variety of material properties, it is challenging to design appropriate controllers.
3.1 Introduction
As robots perform tasks in the real world, they must be able to handle a large variety of environments, objects, materials, and more. Traditional robotics approaches which hand-code controllers and models are ill-equipped to deal with this, both because it is time-consuming to create controllers for all such cases, and because a
61

human programmer cannot possibly anticipate the large variety of situations that may be encountered.
Most real-world tasks involve interactions with complex, non-linear dynamics. Although practiced humans are able to control these interactions intuitively, developing robotic controllers for them is very diﬃcult. Several common household activities fall into this category, including scrubbing surfaces, folding clothes, interacting with appliances, and cutting food. Other applications include surgery, assembly, and locomotion. These interactions are characterized by hard-to-model eﬀects, involving friction, deformation, and hysteresis. The compound interaction of materials, tools, environments, and manipulators further alters these eﬀects. Consequently, the design of controllers for such tasks is highly challenging.
In recent years, “feed-forward” model-predictive control (MPC) has proven eﬀective for many complex tasks, including quad-rotor control [132], mobile robot maneuvering [51], full-body control of humanoid robots [38], and many others [88, 49, 33]. The key insight of MPC is that an accurate predictive model allows me to optimize control inputs for some cost over both inputs and predicted future outputs. Such a cost function is often easier and more intuitive to design than completely hand-designing a controller. The chief diﬃculty in MPC lies instead in designing an accurate dynamics model.
Let us consider the dynamics involved in cutting food items, as shown in Fig. 3.1, for the wide range of materials shown in Fig. 3.2. An eﬀective cutting strategy depends heavily on properties of the food, including its coeﬃcient of friction with the knife, elastic modulus, fracture eﬀects, and hysteretic eﬀects
This work originally presented as a conference paper at Robotics: Science and Systems (RSS) 2015, and is under submission to the International Journal of Robotics Research (IJRR). This was joint work with Ross Knepper and Ashutosh Saxena.
62

such as plastic deformation [96]. These variations lead humans to such diverse cutting strategies as slicing, sawing, and chopping. In fact, properties can even vary within a single material – compare cutting through the skin of a lemon to cutting its ﬂesh. Thus, a major challenge of this work is to design a model which can estimate and make use of global environmental properties such as the material and tool in question and temporally-changing properties such as the current rate of motion, depth of cutting, enclosure of the knife by the material, and layer of the material the knife is in contact with. While some works [42] attempt to deﬁne parameters modeling these properties, it is very diﬃcult to design a set that truly captures all these complex inter- and intra-material variations.
Developing a model for such a complex task is thus extremely challenging. The model must be able to handle a wide range of non-linear dynamics. It must also be able to model the eﬀects of a huge range of properties and variations on these eﬀects, and infer these properties online while acting. They must be able to model the entire range of variations in dynamics the model might see in the real world, extremely challenging to do with hand-deﬁned properties. In order to be useful for MPC, the model’s outputs must be diﬀerentiable with respect to its inputs, and both forward prediction and backwards gradient computation must be time-eﬃcient in order to allow real-time optimization of the control inputs.
For these reasons, I take a deep learning approach. In the recent past, such methods have proven eﬀective for learning latent task-speciﬁc features across many domains [7, 50, 79, 27, 95, 58, 140]. In this chapter, I give a novel deep architecture for physical prediction for complex tasks such as food cutting. When this model is used for predictive control, it yields a DeepMPC controller which is able to learn task-speciﬁc controls.
63

Since deep networks can act as universal function approximators [7], they can model even complex, non-linear dynamics. I make use of conditional multiplicative structures [91] to model the dependence of these dynamics on diﬀerent properties. I treat these properties as latent, allowing the model itself to learn features which are useful for physical prediction. I apply temporal recurrence [119] to allow the model to reuse previous information both to reﬁne estimates of these properties and to model temporal variation in properties. Deep networks are an excellent choice as a model for real-time MPC because they are easily and eﬃciently diﬀerentiable with respect to their inputs using the same back-propagation algorithms used in learning, and because network sizes can simply be adjusted to trade oﬀ between prediction accuracy and computational speed.
My model, optimized for receding-horizon prediction, learns latent material properties directly from data. My architecture uses multiplicative conditional interactions and temporal recurrence to model both inter-material and timedependent intra-material variations. I also present a novel learning algorithm for this recurrent network which avoids overﬁtting and the “exploding gradient” problem commonly seen when training recurrent networks [8]. Once learned, inference for my model is extremely fast - when predicting to a time-horizon of 1s (100 samples) in the future, my model and its gradients can be evaluated at a rate of 1.2kHz.
In extensive experiments on my large-scale dataset comprising 1488 examples of robotic cutting across 20 diﬀerent material types, I demonstrate that my featurelearning approach outperforms other state-of-the-art methods for physical prediction. I also implement an online real-time model-predictive controller using my model. In a series of over 450 real-world robotic trials, I show that my controller
64

Figure 3.2: Food materials: Some of the 20 diverse food materials which make up my material interaction dataset. These include tough vegetables like carrots and potatoes, thick-skinned fruits like lemons and limes, and soft items like butter and bananas, all of which require diﬀerent techniques to cut properly.
gives extremely strong performance for robotic food-cutting, even compared to methods tuned for speciﬁc material classes.
In summary, the contributions of this chapter are:
• I combine deep learning and model-predictive control in a DeepMPC controller that uses learned task dynamics.
• I propose a novel deep architecture which is able to model dynamics conditioned on learned latent properties and a multi-stage pre-training algorithm that avoids common problems in training recurrent neural networks.
• I implement a real-time model predictive control system using my learned dynamics model on a PR2 robot.
• I demonstrate that my model and controller give strong performance for the diﬃcult task of robotic food-cutting.
The remainder of this chapter is organized as follows: I present related work, including an overview of model learning for control and related methods in deep
65

learning in Section 3.2. I then introduce and deﬁne the food-cutting problem and model predictive control in Section 3.3. I motivate and present my deep architecture for modeling complex, varying dynamics in Section 3.4, then present my learning and inference algorithms for it in Section 3.5. I give other system details in Section 3.6, then present my real-time MPC implementation on a PR2 robot in Section 3.7. I present experiments and results for my approach for modeling the complex dynamics involved in food-cutting in Section 3.8, and for real-world robotic control on a PR2 robot in Section 3.9. Finally, I conclude and present directions for future work in Section 3.10.
3.2 Related Work
3.2.1 Robotic Control
Reactive feedback controllers, where a control signal is generated based on error from current state to some set-point, have been applied since the 19th century [9]. Stiﬀness control, where error in robot end-eﬀector pose is used to determine endeﬀector forces, remains a common technique for compliant, force-based activities [16, 6, 42]. Such approaches are limited because they require a trajectory to be given beforehand, making it diﬃcult to adapt to diﬀerent conditions.
Markov Decision Processes (MDPs) [114] are another popular approach to robotic control. While these methods give a tractable, general approach to solving many robotic problems such as autonomous helicopter ﬂight [1], robotic soccer [116] and many others, they are limited to problems with discrete, fully-observable states. In my food-cutting problem and many other robotic problems, I are deal-
66

ing with continuous states (physical positions), and some environmental properties (e.g. material types) may not be directly observable. Partially-Observable MDPs (POMDPs), which have been applied to such diverse problems as action anticipation for table tennis [147], robotic grasping [53], navigation [40], avoid these assumptions. However, they typically still make others, such as locally linear dynamics [18] or discrete action spaces [42]. In this work, both my states and actions will be continous-valued, and I will directly model the fact that task dynamics depend on some unobserved properties.
Feed-forward model-predictive control allows controls to adapt online by optimizing some cost function over predicted future states. These approaches have gained increased attention as modern computing power makes it feasible to perform optimization in real time. Shim et al. [132] used MPC to control multiple quad-rotors in simulation, while Howard et al. [51] performed intricate maneuvers with real-world mobile robots. Erez et al. [38] used MPC for full-body control of a humanoid robot. These approaches have been extended to many other tasks, including underwater vehicle control [88], visual servoing [49], and even heart surgery [33]. However, all these works assume the task dynamics model is fully speciﬁed.
3.2.2 Model Learning for Control
Model learning for robot control has also been a very active area, and I refer the reader to a review of work in the area by Nguyen-Tuong and Peters [106]. While early works in model learning [2, 103] ﬁt parameters of some hand-designed task-speciﬁc model to data, such models can be diﬃcult to design and may not generalize well to new tasks. Thus, several recent works attempt to learn more general dynamics models such as Gaussian mixture models [21, 66] and Gaussian
67

processes [68]. Neural networks [22, 19] are another common choice for learning general non-linear dynamics models. The highly parameterized nature of these models allows them to ﬁt a wide variety of data well, but also makes them very susceptible to overﬁtting.
3.2.3 Policy Learning
Several recent approaches to control learning ﬁrst learn a dynamics model, then use this model to learn a policy which maps from system state to control inputs. These works often iteratively use this policy to collect more data and re-learn a new policy in an online learning approach. Levine et al. [82] use a Gaussian mixture model (GMM) where linear models are ﬁt to each cluster, while Deisenroth and Rasmussen [30] use a Gaussian process (GP.) Experimentally, both these models gave less accurate predictions than mine for robotic food-cutting. The GP also had very long inference times (roughly 106 times longer than mine) due to the large amount of training data needed. For details, see Section 3.8. This weak performance is because they use only temporally-local information, while my model uses learned recurrent features to integrate long-term information and model unobserved system properties such as materials.
These works focus on online policy search, while here I focus on modeling and application to real-time MPC. My model could be used along with them in a policy-learning approach, allowing them to model dynamics with environmental and temporal variations. However, my model is eﬃcient enough to optimize for predictive control at run-time. This avoids the possibility of learned policies overﬁtting the training data and allows the cost function and its parameters to be changed online. It also allows my model to be used with other algorithms which
68

use its predictions directly.
3.2.4 Robotic Manipulation
Luo and Hauser [84] developed a system which adapts manipulation to unknown system parameters, but requires a parameterized dynamics model. Koval et al. [70] developed a new algorithm for planar contact manipulation which decomposes preand post-contact policies. Maitin-Shepard et al. [86] developed a system for robotic towel-folding. This system focused on the perception aspects of the problem, and assumes uniformity and compliance in the material being manipulated.
Several recent works have applied robotic manipulation to kitchen operations. Bollini et al. [16] developed a vision-based robotic system for preparing and baking cookies, while Beetz et al. [6] developed a system for preparing pancakes. Gemici and Saxena [42] presented a learning system for manipulating deformable objects which infers a set of material properties, then uses these properties to map objects to a latent set of haptic categories which are used to determine how to manipulate the object. However, their approach requires a predeﬁned set of properties (plasticity, brittleness, etc.), and chooses between a small set of discrete actions. By contrast, my approach performs continuous-space real-time control, and uses learned latent features to model material properties and other variations, avoiding the need for hand-design. All three of these works also apply non-reactive stiﬀness controllers.
69

Position (m)

Vertical Axis Sawing Axis

0.03 0.02 0.01
0 −0.01 −0.02 −0.030
0.015 0.01
0.005 0
−0.005 −0.01
−0.0150

Butter 2 Time (s) 4 2 Time (s) 4

0.03 0.02 0.01
0 −0.01 −0.02 6 −0.030
0.015 0.01
0.005 0
−0.005 −0.01
6 −0.0150

Position (m)

Position (m)

Lemon 2 Time (s) 4 2 Time (s) 4

0.03 0.02 0.01
0 −0.01 −0.02 6 −0.030
0.015 0.01
0.005 0
−0.005 −0.01
6 −0.0150

Position (m)

Position (m)

Lemon, Faster 2 Time (s) 4 2 Time (s) 4

6 6

Position (m)

70

Figure 3.3: Variation in cutting dynamics: plots showing desired (green) and actual (blue) trajectories, along with error (red) obtained using a stiﬀness controller while cutting butter (left) and a lemon at low (middle) and high (right) rates of vertical motion. Butter resists the knife signiﬁcantly less than the lemon. Even though only the vertical cutting rate is the only change between the middle and right-hand plots, dynamics along the sawing axis also change signiﬁcantly. Dynamics also vary with time for the lemon as the knife cuts through the skin and into the ﬂesh.

3.3 Problem Deﬁnition and System
In this work, I focus on the task of cutting a wide range of food items. This problem is a good testbed for my algorithms because of the variety of dynamics involved in cutting diﬀerent materials. Designing individual controllers for each material would be very time-consuming, and hand-designing accurate dynamics models for each would be nearly infeasible.
For the task of cutting, I deﬁne gripper axes as shown in Fig. 3.4, such that the X axis points out of the point of the knife, Y axis normal to the blade, and Z axis vertically. Here, I consider linear cutting, where the goal is to make a cut of some given length along the Z axis. The control inputs to the system are denoted as u(t) = (Fx(t), Fy(t), Fz(t)), where Fx(t) represents the force, in Newtons, applied along the end-eﬀector X axis at time t. The physical state of the system is x(t) = (Px(t), Py(t), Pz(t)) where Px(t) is the X coordinate of the end-eﬀector’s position at time t.
A simple approach to control for this problem might use a ﬁxed-trajectory stiﬀness controller, where control inputs are proportional to the diﬀerence between the current state x(t) and some desired state x∗(t) taken from a given trajectory.
Fig. 3.3 shows some examples which demonstrate the diﬃculties inherent in this approach. While some materials, such as the butter shown on the left, oﬀer very little resistance, allowing a stiﬀness controller to accurately follow a given trajectory, others, such as the lemon shown in the remaining two plots, oﬀer more resistance, causing signiﬁcant deviation from the desired trajectory. When cutting a lemon, I can also see that the dynamics change with time, resisting the knife more as it cuts through the skin, then less once it enters the ﬂesh of the lemon.
71

Figure 3.4: Gripper axes: PR2’s gripper with knife grasped, showing the axes used in this chapter. The X (“sawing”) axis points along the blade of the knife, Y points normal to the blade, and Z points vertically.
The dynamics of the sawing and vertical axes are also coupled - increasing the rate of vertical motion increases error along the sawing axis, even though the same controls are used for that axis. This coupled behavior presents additional challenges for modeling and control, as these axes must be considered together.

In my approach, I ﬁx the orientation of the end-eﬀector, as well as the position

of the knife along its Y axis, using stiﬀness control to stabilize these. However,

even though my primary goal is to move the knife along its Z axis, as shown

in Fig. 3.3, the X and Z axes are strongly coupled for this problem. Thus, my

algorithm performs control along both the X and Z axes. This allows “sawing”

and “slicing” motions in which movement along the X axis is used to break static

friction along the Z axis and enable forward progress. A nonlinear function f

predicts future states:

xˆ(t+1) = f (x(t), u(t+1))

(3.1)

This formula can then be applied recurrently to predict further into the future, e.g. xˆ(t+2) = f (xˆ(t+1), u(t+2)). When performing recurrent prediction as such, an accurate dynamics model is extremely important as errors will accumulate over

72

multiple timesteps.

3.3.1 Model-Predictive Control: Background

In this work, I use a model-predictive controller to control the cutting hand. Such controllers have been shown to work extremely well for a wide variety of tasks for which hand-deﬁned controllers are either diﬃcult to deﬁne or simply cannot suﬃce [51, 38, 88, 33]. Deﬁning Xt:k as the system state from time t through time k, and Ut:k similarly for system inputs, a model-predictive controller works by ﬁnding a set of optimal inputs Ut∗+1:t+T which minimize some cost function C(Xˆt+1:t+T , Ut+1:t+T ) over predicted state Xˆ and control inputs U for some ﬁnite time horizon T :

Ut∗+1:t+T = arg max C(Xˆt+1:t+T , Ut+1:t+T ) Ut+1:t+T

(3.2)

This approach is powerful, as it allows the robot to leverage its knowledge of task dynamics f (x, u) directly, predicting future interactions and proactively avoiding mistakes rather than simply reacting to past observations. It is also versatile, as I have the freedom to design C to deﬁne optimality for some task. The chief diﬃculty lies in modeling the task dynamics f (x, u) in a way that is both diﬀerentiable and quick to evaluate, to allow online optimization.

3.4 Modeling Time-Varying Non-Linear Dynamics with Deep Networks
Hand-designing models for the entire range of potential interactions encountered in complex tasks such as cutting food would be nearly impossible. My main challenge
73

in this work is then to design a model capable of learning non-linear, time-varying dynamics. This model must be able to respond to short-term changes, such as breaking static friction, and must be able to identify and model variations caused by varying materials and other properties. It must be diﬀerentiable with respect to system inputs, and the system outputs and gradients must be fast to compute to be useful for real-time control.
I choose to base my model on deep learning algorithms, a strong choice for my problem for several reasons. They have been shown to be general non-linear learners [7], but remain diﬀerentiable using eﬃcent back-propagation algorithms. When time is an issue, as in my case, network sizes can be scaled down to trade accuracy for computational performance.
Although deep networks can learn any non-linear function, care must still be taken to design a task-appropriate model. As shown in Fig. 3.7, a simple deep network gives relatively weak performance for this problem. Thus, one major contribution of this work is to design a novel deep architecture for modeling dynamics which might vary both with the environment, material, etc., and with time while acting. In this section, I describe my architecture, shown in Fig. 3.5 and motivate my design decisions in the context of modeling such dynamics.
3.4.1 Deep Learning - Background
Before describing my new architecture for handling complex, varying dynamics, I will ﬁrst describe previous work which makes up some components of this architecture and algorithm.
74

Unsupervised Feature Learning: Unsupervised feature learning is one of the major strengths of modern deep learning approaches. Even for unlabled data, these algorithms are capable of learning useful features which can be used to initialize the network before supervised learning. Since they are generic to the type of data used as input, they can be applied to learned features to learn multiple layers of representation. Here, I will apply a variant of the sparse auto-encoder (SAE) algorithm [46], which learns features which reconstruct the training data well while activating sparsely (e.g. for a given case, only a few features should have high values.) Initializing the network in this way helps to avoid overﬁtting by giving supervised learning a better, more general starting point.
Back-propagation: During both supervised ﬁne-tuning and some gradientbased unsupervised learning algorithms such as SAE, I use back-propagation to eﬃciently compute cost function gradients with respect to each parameter of the network. This works by ﬁrst performing forward inference in the network, then computing the cost function and its gradient with respect to its inputs (from the network.) We can then iteratively propagate this gradient through the network, computing the gradient of the cost function with respect to each hidden unit and weight.
A similar approach can be used to compute cost function gradients with respect to system inputs during online MPC. Here, gradients with respect to network parameters are irrelevant, so I propagate cost function gradients backwards through each layer of hidden features until they reach the system inputs. This lets me quickly and eﬃciently compute the gradient of some cost function over network outputs with respect to system inputs, allowing me to perform real-time gradientbased optimization for MPC.
75

Conditional Features: When modeling dynamics which depend on the robot’s environment, I want to be able to condition the current dynamic response on some set of environmental properties. Factored conditional nodes [91, 143] are able to model conditional structures such as this by learning weights from each set of inputs to some hidden “factors.” Each factor then multiplies all inputs it recieves, eﬀectively scaling the contribution of each set of inputs based on features extracted from the others. When modeling dynamics, this is useful as it allows diﬀerent dynamics to be activated or deactivated depending on environmental properties. Such features have also been shown to be useful in detecting transformations, such as shifts and rotations in natural images [91], a behavior which I will use when modeling time-varying properties.
Temporal Recurrence: When modeling time-dependent behavior, it is useful to re-use past information and features. Temporal recurrence [119] allows us to do so by forming weights to a set of features from the same features for the previous timestep, and, in turn, feeding these features forwards to the next. This allows us to naturally re-use features and integrate long-term information, while still allowing features to change over time. This is also memory-eﬃcient, as the system need only remember one additional set of features (the previous timestep’s) and do not need to record all observed features.
However, recurrent models introduce additional diﬃculty during training, in particular in cases such as recurrent dynamic prediction where the model outputs are also used recurrently. If the network is not initialized well, inaccurate predictions will be fed forwards, causing increased inaccuracy in future timesteps, leading to the “exploding gradient” problem when already-huge error gradients are further scaled up during back-propagation [8]. Such gradients cause problems for
76

h[l]

l(b−1) h[lp]

l(b) h[lc]

xˆ(b+1) h[f ]

h[c]

v(b−1)

v(b)

u(b+1)

Figure 3.5: Deep predictive model: Architecture of my recurrent conditional deep predictive dynamics model. Transforming recurrent units (TRUs) on the left model time-varying latent properties which aﬀect system dynamics. On the right, conditional multiplicative modulation is used again to condition future system responses on past observed dynamics and latent features.

gradient-based learning algorithms and can often lead to overﬁtting the training data. Thus, in this work, I present a new algorithm for training my recurrent model for physical prediction which iteratively initializes portions of it to avoid these issues.

3.4.2 DeepMPC Architecture
In order to properly model complex dynamics conditioned on environmental properties which might vary with time, I now deﬁne a new deep architecture. This architecture retains the strengths of standard deep learning algorithms while giving superior predictive performance for complex tasks such as food cutting, as shown in Section 3.8.

77

Dynamic Response Features: When modeling physical dynamics, it is important to capture short-term input-output responses. Thus, rather than learning features separately for system inputs u and outputs x, the basic input features used in my model are a concatenation of both. It is also important to capture high-order and delayed-response modes of interaction. Thus, rather than considering only a single timestep, I consider blocks thereof when computing these features, so that for block b, with block size B, I have visible input features v(b) = (Xb∗B:(b+1)∗B−1, Ub∗B:(b+1)∗B−1). For known timesteps, I use the observed values of x, while for future timesteps, I use xˆ as predicted by my model. For more details on my feature pre-processing, see Section 3.6.
Conditional Dynamic Responses: For tasks such as material cutting, local dynamics might be conditioned on both time-invariant and time-varying properties. Thus, I must design a model which operates conditional on past information. I do so by introducting factored conditional units [91], where features from some number of inputs modulate multiplicatively and are then weighted to form network outputs. Intuitively, this allows me to blend a set of models based on features extracted from past information. Since my model needs to incorporate both shortand long-term information, I allow three sets of features to interact – the current control inputs, the past block’s dynamic response, and latent features modeling long-term observations, described below. Although the past block’s response is also included when forming latent features, including it directly in this conditional model frees my latent features from having to model such short-term dependencies.
I use c to denote the current timeblock, f to denote the immediate future one, l for latent features, and o for outputs. Take Nv as the number of features v, Nx as the number of states x, and Nu as the number of inputs u in a block, and Nl
78

as the number of latent features l. With h[c](b) ∈ RNoh as the hidden features from the current timestep, formed using weights W [c] ∈ RNv×Noh (similar for f and l), and W [o] ∈ RNoh×Nx as the output weights, my predictive model is then:

hj[c](b) = σ

Nv
Wi[,cj]vi(b)

i=1

h[jf](b) = σ

Nu
Wi[,fj]u(ib+1)

i=1

hj[l](b) = σ

Nl
Wi[,lj] li(b)

i=1

Noh

xˆj(b+1) =

Wi[,oj]hi[c](b)hi[f ](b)hi[l](b)

i=1

(3.3) (3.4) (3.5) (3.6)

Long-Term Recurrent Latent Features: Another major challenge in modeling time-dependent dynamics is integrating long-term information while still allowing for transitions in dynamics, such as moving from the skin to the ﬂesh of a lemon. To this end, I introduce transforming recurrent units (TRUs). To retain state information from previous observations, my TRUs use temporal recurrence, where each latent unit has weights to the previous timestep’s latent features. To allow this state to transition based on locally observed transformations in dynamics, they use the paired-ﬁlter behavior of multiplicative interactions to detect transitions in the dynamic response of the system and update the latent state accordingly. In previous work, multiplicative factored conditional units have been shown to work well in modeling transformations in images [91] and physical states [143], making them a good choice here. Each TRU thus takes input from the previous TRU’s output and the short-term response features for the current and previous time-blocks. With ll denoting recurrent weights, lc denoting current-step for the latent features, lp previous-step, and lo output, and Nlh as the number of

79

TRU hidden units, my latent feature model is then:

hj[lc](b) = σ hj[lp](b) = σ
lj(b) = σ

Nv
Wi[,cj]vi(b)

i=1

Nv
Wi[,fj]vi(b−1)

i=1

Nlh

Nl

Wi[,ljo]hi[lc](b)hi[lp](b) +

Wk[l,lj]lk(b−1)

i=1 k=1

(3.7) (3.8) (3.9)

Finally, Fig. 3.5 shows the full architecture of my deep predictive model, as described above.

3.5 Learning and Inference
In this section, I deﬁne the learning and inference procedures for the model deﬁned above. The online inference approach is a direct result of my model. However, there are many possible approaches to learning its parameters. Neural networks require a huge number of parameters (weights) to be learned, making them particularly susceptible to overﬁtting, and recurrent networks often suﬀer from instability in future predictions, causing large gradients which make optimization diﬃcult (the “exploding gradient” problem [8]).
To avoid these issues, I deﬁne a new three-stage learning approach which pretrains the network weights before using them for recurrent modeling. Deep learning methods are non-convex, and converge to only a local optimum, making my approach important in ensuring that a good optimum which does not overﬁt the training data is reached.
Inference: During inference for MPC, we are currently at some time-block
80

b with latent state l(b), known system state x(b) and control inputs u(b). Future control inputs Ut+1:t+T are also given, and the goal is then to predict the future system states Xˆt+1:t+T up to time-horizon T , along with the gradients ∂X/∂U for all pairs of x and u. These gradients will then be used by MPC to iteratively optimize the control inputs u.
I perform this inference by applying my model recurrently to predict future states up to time-horizon T , using predicted states xˆ and latent features l as inputs to my predictive model for subsequent timesteps, e.g. when predicting x(b+2), I use the known x(b) along with the predicted xˆ(b+1) and l(b+1) as inputs.
My model’s outputs (xˆ) are diﬀerentiable with respect to all its inputs, allowing me to take gradients ∂X/∂U using an approach similar to the backpropagationthrough-time algorithm used to optimize model parameters during learning, as shown in Algorithm 1. I can in turn use these gradients with any gradient-based optimization algorithm to optimize Ut+1:t+T with respect to some diﬀerentiable cost function C(X, U ). No online optimization is necessary to perform inference for my model. For details on my online optimization approach, see Section 3.7.
Learning: During learning, my objective is to use my training data to learn a set of model parameters Θ = (W [f], W [c], W [l], W [o], W [lp], W [lc], W [ll], W [lo]) which minimize prediction error while avoiding overﬁtting.
A naive approach to learning might randomly initialize Θ, then optimize the entire recurrent model for prediction error. However, random weights would likely cause the model to make inaccurate predictions, which will in turn be fed forwards to future timesteps. This could cause huge errors at time-horizon T , which will in turn cause large gradients to be back-propagated, resulting in instability in the
81

learning and overﬁtting to the training data. To remedy this, I propose a multistage pre-training approach which ﬁrst optimizes some subsets of the weights, leading to much more accurate predictions and less instability when optimizing the ﬁnal recurrent network. I show in Fig. 3.7 that my learning algorithm signiﬁcantly outperforms random initialization.
Phase 1: Unsupervised Pre-Training: In order to obtain a good initial set of features for l, I apply an unsupervised learning algorithm similar to the sparse auto-encoder algorithm [46] to train the non-recurrent parameters of the TRUs. This algorithm ﬁrst projects from the TRU inputs up to l, then uses the projected l to reconstruct these inputs. The TRU weights are optimized for a combination of reconstruction error and sparsity in the outputs of l. Taking vm,k as the visible features for the kth time-block of training case m, M as the number of training cases, and Tm as the number of timesteps for case m, and g(l) as some function penalizing latent feature activation to induce sparsity, my unsupervised pre-training phase proceeds as:

M Tm/B

Θ∗ = arg min

||vˆ(m,b−1) − v(m,b−1)||22

Θ m=1 b=2

K
+ ||vˆ(m,b) − v(m,b)||22 + λ g(lj(b))
j=1

Nl

h[jll](b) =

Wj[,lal]la(b)

a=1

Nlh

vˆi(b−1) =

Wi[,ljp]h[jll](b)hj[lc](b)

j=1

Nlh

vˆi(t) =

Wi[,ljp]h[jll](b)hj[lp](b)

j=1

(3.10) (3.11) (3.12) (3.13)

82

Phase 2: Short-term Prediction Training: While I could now use these parameters as a starting point to optimize a fully recurrent multi-step prediction system, I found that in practice, this lead to instability in the predicted values, since inaccuracies in initial predictions might “blow up” and cause huge deviations in future timesteps.

Instead, I include a second pre-training phase, where I train the model to predict a single timestep into the future. This allows the model to adjust from the task of reconstruction to that of physical prediction, without risking the aforementioned instability. For this stage, I remove the recurrent weights from the TRUs, eﬀectively setting all W [ll] to zero and ignoring them for this phase of optimization.

Taking x(m,k) as the state for the kth time-block of training case m, M as the

number of training cases, and Bm as the number of timeblocks for case m, this

stage optimizes:

M Bm−1

Θ∗ = arg min

||xˆ(m,b+1) − x(m,b+1)||22

Θ m=1 b=2

(3.14)

Phase 3: Warm-Latent Recurrent Training: Once Θ has been pre-trained by these two phases, I use them to initialize a recurrent prediction system which performs inference as described above. I then optimize this system to minimize the sum-squared prediction error up to T timesteps in the future, using a variant of the backpropagation-through-time algorithm commonly used for recurrent neural networks [119].

When run online, my model will typically have some amount of past information, as I allow a short period where I optimize forces while a stiﬀness controller makes an initial inwards motion. Thus, simply initializing the latent state “cold” from some intial state and immediately penalizing prediction error does not match

83

well with the actual use of the network, and might in fact introduce overﬁtting by forcing the model to rely more heavily on short-term information. Instead, I train my model for a “warm” start. For some number of initial time-blocks Bw, I propagate latent state l, but do not predict or penalize system state xˆ, only doing so after this warm-up phase. I still back-propagate errors from future timesteps through the warm-up latent states as normal.

3.6 System Details

Learning System: I used the L-BFGS algorithm, shown to give strong results for deep learning methods [77], to optimize my model during learning. While larger network sizes gave slightly (∼10%) less error, I found that setting Nlh = 50, Nl = 50, and Noh = 100 was a good tradeoﬀ between accuracy and computational performance. I found that block size B = 10, giving blocks of 0.1s, gave the best performance. When implemented on the GPU in MATLAB, all phases of my learning algorithm took roughly 30 minutes to optimize.

MPC Cost Function: In order to perform MPC, I need to deﬁne a cost function C(X, U ) for my task. For food cutting, I design a cost function with two main components, with β deﬁning the weighting between them:

C(X, U ) = Ccut(X) + βCsaw(X)

(3.15)

The ﬁrst, Ccut, drives the controller to move the knife downwards. It penalizes the height of the knife at each timestep, with an additional penalty at the ﬁnal

timestep allowing a tradeoﬀ between immediate and eventual downwards motion:

t+T

Ccut(X) =

Pz(k) + γPz(t+T )

k=t

(3.16)

84

The second term, Csaw, keeps the tip of the knife inside some reasonable “sawing

range” along the X axis, ensuring that it actually cuts through the food. Since

any valid position is acceptable, this term will be zero inside some margins from

this range, then become quadratic once it passes those margins. Taking Px∗ as the center point of the sawing range, ds as the range, and λ as the margin, I deﬁne

this term as:

t+T

Csaw(X) =

max 0, |Px(k) − Px∗| − ds + λ 2

k=t

(3.17)

I also include terms performing ﬁrst- and second-order smoothing and L2 regularization on the control forces.

Data Pre-Processing: In order to allow my learning algorithm to learn a better model for my data, I perform light pre-processing. I represent position features for each block relative to the last position in the previous block – e.g. if the previous block ended with an X-position of 0.4 m and the current block started with an X-position of 0.5 m, the ﬁrst X-position feature for the current block would be 0.1 m. This representation avoids overﬁtting to absolute positions, while still representing relative motions and allowing me to easily reconstruct an absolute-position trajectory. Since I want to capture absolute, not relative, input forces, I do not oﬀset them in this way.
The only whitening I perform on these features is to scale them so all features for a particular channel (e.g. X-positions, Z-forces, etc.) have unit standard deviation. I scale per-channel – applying the same scaling to all B features for a particular channel – rather than per-feature in order to preserve relative values within a channel. For similar reasons, I do not shift values e.g. to set the mean to zero as is common in other whitening approaches.

85

One advantage to my whitening approach is that it allows me to transfer this scaling to the input-layer weights during inference. For example, if I applied a scaling factor of 0.1 to X-position inputs during learning, I can simply scale the weights to the X-position used during inference by 0.1 and use un-whitened Xposition values (still oﬀset as above.) This saves computation time by avoiding performing scaling on new input features.
3.7 Real-time Robotic DeepMPC System
Robotic Platform: For both data collection and online evaluation of my algorithms, I used a PR2 robot. The PR2 has two 7-DoF manipulators with parallelplate grippers, and a reach of roughly 1m. For safety reasons, I limit the forces applied by PR2’s arms to 30N along each axis, which was suﬃcient to cut every material tested. PR2 natively runs the Robot Operating System (ROS) [118]. Its arm controllers recieve robot state information in the form of joint angles and must publish desired motor torques at a hard real-time rate of 1KHz.
Online Model-Predictive Control System: The main challenge in designing a real-time model-predictive controller for this architecture lies in allowing prediction and optimization to run continuously to ensure optimality of controls, while providing the model with the most recent state information and performing control at the required real-time rate. As shown in Fig. 3.6, I solve this by separating my online system into two processes (ROS nodes), one performing continuous optimization, and the other real-time control. These processes use a shared memory space for high-rate inter-process communication. This approach is modular and ﬂexible - the optimization process is generic to the robot involved (given an appro-
86

Figure 3.6: Online system: Block diagram of my DeepMPC system. Parameters learned using my three-stage deep learning algorithm are loaded by the optimization process, which then continually predicts future states and updates future controls based on these predictions. The control process takes state information from the robot, transmits it to the optimization process, and transmits controls optimized by that process to the robot.
priate model), while the control process is robot-speciﬁc, but generic to the task at hand. In fact, models for the optimization process do not even need to be learned locally, but could be shared using an online platform [130].
The control process is designed to perform minimal computation so that it can be called at a rate of 1KHz. It recieves robot state information in the form of joint angles, and control information from the optimization process as end-eﬀector forces. It performs forward kinematics to determine end-eﬀector pose, transmits it to the optimization process, and uses it to determine restoring forces for axes not
87

controlled by MPC. It translates the combination of these forces and those recieved from MPC to a set of joint torques sent to the arm. All operations performed by the control process are at most quadratic in terms of the number of degrees of freedom of the arm, allowing each call to run in roughly 0.1 ms on PR2.
The optimization process runs as a continuous loop. When started, it loads model parameters (network weights) from disk. Cost function parameters are loaded from a ROS parameter server, allowing them to be changed online. The optimization loop ﬁrst uses past robot states (received from the control process) and control inputs along with past latent state and the future forces being optimized to predict future state using my model. It then uses this state to compute the gradients of the MPC cost function and back-propagates these through my model, yielding gradients with respect to the future forces. It optimizes these forces using a variant of the AdaGrad algorithm [34], a form of gradient descent in which gradient contributions are scaled by the L2 norm of past gradients, chosen because it is eﬃcient in terms of function evaluations while avoiding scaling issues. This process is implemented using the Eigen matrix library [35], allowing the optimization loop to run at a rate of over 1.2kHz.
3.8 Prediction Experiments
In order to evaluate my model learning approach as compared to other state-ofthe-art methods, I performed experiments evaluating prediction accuracy on my extensive dataset of 1488 material cuts. In the next section, I will also evaluate my algorithms on a real-world PR2 robot.
Dataset: My material interaction dataset contains 1488 examples of robotic food-
88

cutting for 20 diﬀerent materials (Fig. 3.2). I collected data from three diﬀerent settings. First, a ﬁxed-parameter setting in which trajectories as shown in the leftmost two columns of Fig. 3.3 were used with a stiﬀness controller. Second, for 8 of the 20 materials in question, data was collected while a human tuned a stiﬀness controller to improve cutting rate. This data was not collected for all materials to avoid giving the learning algorithm and controller near-optimal cases for all materials. Third, a randomized setting where most parameters of the controller, including cutting and sawing rate and stiﬀnesses, but excluding sawing range (still ﬁxed at 4cm) were randomized for each stroke of the knife. This helped to obtain data spanning the entire range of possible interactions.
Setting: In order to test my model, I examine its predictive accuracy compared to several other approaches. Each model was given 0.7s worth of past trajectory information (forces and known poses) and 0.5s of future forces and then asked to predict the future end-eﬀector trajectory. For this experiment, I used 70% of my data for training, 10% for validation, and 20% for testing, sampling to keep each set class-balanced.
Baselines: For prediction, I include a linear state-space model, an ARMAX model which also weights a range of past states, and a K-Nearest Neighbors (KNN) model (5-NN gave the best results) as baseline algorithms. I also include a GMM-linear model [81], and a Gaussian process (GP) model [30], trained using the GPML package [115]. Additionally, I compare to standard recurrent and nonrecurrent two-layer deep networks, and versions of my approach without recurrence and without pre-training (randomly initializing weights, then training for recurrent prediction).
Results: Fig. 3.7 shows performance for each model as mean L2 distance from
89

Mean L2 error (mm)

12 Linear SS GMM−Linear
10 ARMAX Simple Deep 5−NN
8 Ours, Non−Recur. Gaussian Process Recur. Deep Ours, Rand. Init
6 Ours
4
2
0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Time (s)
Figure 3.7: Prediction error: Mean L2 distance (in mm) from predicted to groundtruth trajectory from 0.01s to 0.5s in the future. While most models give similar performance up to 0.1s, models with linear components give very weak long-term results. Non-parametric methods give better results, but are hampered by expensive inference which scales poorly. Recurrent deep networks give the best results, with my approach outperforming all others after 0.1s, reducing error at 0.5s by 46% as compared to the baseline recurrent network
predicted to ground truth trajectory vs. prediction time in the future. Temporallylocal (piecewise-) linear methods (linear SS, GMM-linear, and ARMAX) gave weak performance for this problem, each yielding an average error of over 8mm at 0.5s. This shows, as expected, that linear models are a poor ﬁt for my highly non-linear problem.
Instance-based learning methods – K-NN and Gaussian processes – gave better performance, at an average of 4.25mm and 3.56mm, respectively. Both outperformed the baseline two-layer non-recurrent deep network, which gave 4.90mm error. The GP gave the best performance of any temporally-local model, although this came at the cost of extreme inference time, taking an average of 3361s (56 minutes) to predict 0.5s into the future, 1.18x106 times slower than my algorithm, whose MATLAB implementation took only 3.1ms.
90

Table 3.1: Conﬁdence at 0.5s: Mean L2 error and 95% conﬁdence interval at prediction time of 0.5s (all in mm)

Algorithm
Linear SS GMM-Linear ARMAX Simple Deep 5-NN Mine, Non-Recur. Gaussian Process Recur. Deep Mine, Rand. Init Mine

Mean
11.46 8.96 8.66 4.90 4.25 3.80 3.56 3.27 2.78 1.78

95% Conf. Min. Max. 0.89 38.58 0.58 31.80 0.79 31.36 0.52 18.31 0.22 19.24 0.35 15.03 0.27 14.85 0.47 12.14 0.33 10.41 0.16 7.55

The relatively unimpressive performance of a standard two-layer deep network for this problem underscores the need for task-appropriate architectures and learning algorithms. Including conditional structures, as in the non-recurrent version of my model and temporal recurrence reduced this error to 3.80mm and 3.27mm, respectively. Even when randomly initialized, my model outperformed the baseline recurrent network, giving only 2.78mm error, showing the importance of using an appropriate architecture. Using my learning algorithm further reduced error to 1.78mm, 36% less than the randomly-initialized version and 46% less than the baseline recurrent model, demonstrating the need for careful pre-training.
For online model-predictive control, particularly when dealing with real-world variety, worst-case performance is very important, since I need to ensure that my model will perform well under any conditions. As shown in Table 3.1, my approach also gave a tighter and lower 95% conﬁdence interval of prediction error, from 0.16-7.55mm at 0.5s, a width of 7.39mm, compared to the baseline recurrent net’s interval of 0.47-12.14mm, a width of 11.67mm, and the GP’s interval of 0.27-

91

14.85mm, a width of 14.58mm. In fact, my method was the only approach whose entire 95% conﬁdence interval lay under 1cm of error.
3.9 Robotic Experiments
To evaluate my algorithm’s performance for real-world robotic control, I performed an extensive series of over 450 experiments on my PR2 robot using the system described in Sec. 3.7.
Setting: In these experiments, the robot’s goal was to make a linear cut between two given points. I selected a subset of 10 of the 20 materials in my dataset for these experiments, aiming to span the range of variation in material properties. I evaluated these experiments in terms of cutting rate, i.e. the vertical distance traveled by the knife divided by time taken.
Baselines: For control, I validate my algorithm against several other control methods with varying levels of class information. First, I compare to a classgeneric stiﬀness controller, using the same controller for all classes. This controller was tuned to give a good (>90%) rate of success for all classes, meaning that it is much slower than necessary for easy-to-cut classes such as tofu. I also validate against controllers tuned separately for each of the test classes, using the same criteria as above, showing the maximum cutting rate that can be expected from ﬁxed-trajectory stiﬀness control.
As a middleground, I compare to an algorithm similar to that of Gemici and Saxena [42], where a set of class-speciﬁc material properties are mapped to learned haptic categories. I found that learning ﬁve such categories assigned all data for
92

Cutting rate (cm/s)

25 Stiff., class−general
20 Stiff., clustered (Gemici) Stiff., class−specific Ours
15
10
5
0
Figure 3.8: Robotic experiment results: Mean cutting rates, with bars showing normalized standard deviation, for ten diverse materials. Red bar uses the same controller for all materials, blue bar uses the same for each cluster given by [42], purple uses a tuned stiﬀness controller for each, and green is my online MPC method. My algorithm consistently gives higher mean rates, making statistically signiﬁcant improvements for all materials except butter and tofu. Particularly signiﬁcant improvements are seen for tough, varying materials such as carrots and potatoes.
each class to exactly one cluster, and thus used the same controller for all classes assigned to each cluster. In cases where this controller was the same as used for the class-tuned case, I used the same results for both.
Results: Figure 3.8 shows the results of my series of over 450 robotic experiments. For all materials except butter and tofu, a paired t-test showed that my DeepMPC controller made a statistically signiﬁcant improvement in cutting rate with 95% conﬁdence. This makes sense as butter and tofu are relatively soft and easy-tocut materials. However, for the four materials for which stiﬀness control gave the weakest results – lemons, potatoes, carrots, and apples – my algorithm more than tripled the mean cutting rate, from 1.5 cm/s to 5.1 cm/s.
One major advantage my approach has over the others tested is that it treats material properties and classes as latent and continuous-valued, rather than supervised and discrete. For intra-class variations which aﬀect dynamics, such as
93

Figure 3.9: Cutting food: Time-series of my PR2 robot using my DeepMPC controller to cut several of the food items in my dataset. My algorithm is able to adapt its strategy for diﬀerent materials. Note in particular that it picks up the knife slightly, then chops back down when cutting the carrot, and uses more “sawing” motion on tougher materials. Video of these experiments is available at http://deepmpc.cs.cornell.edu
diﬀerent varieties of apples or cheeses, diﬀerent radii of carrots or potatoes, or varying material temperature, even the class-speciﬁc stiﬀness controllers were typically limited by the hardest-to-cut variation. However, my approach’s latent material properties allowed it to adapt to these, signiﬁcantly increasing cutting rates. This was particularly evident for carrots, whose thickness causes huge variations in dynamics. While all approaches were tested on both thick and thin sections of carrot, only mine was able to properly adapt, slicing easily through thin sections and more carefully through thicker ones, increasing mean cutting rate from 0.4 cm/s to 4.7 cm/s. Similar results were observed for potatoes, increasing mean rate from 3.1 cm/s to 6.8 cm/s.
Another advantage of my approach is its ability to respond to time-dependent changes in dynamics, thanks to the time-varying nature of my latent features and the online adaptation performed by MPC. Such changes occur to some degree as the knife enters and becomes enclosed by most materials, particularly in irregular shapes such as potatoes where the degree of enclosure varies throughout the cut.
94

They are even more apparent in non-uniform materials, such as lemons, with variation between the skin and ﬂesh, and apples, which are much denser and harder to cut closer to the core. Again, stiﬀness control was limited by the toughest of these dynamics, while my approach was able to adapt, typically performing more sawing for more diﬃcult regions, and quickly moving downwards for easier ones. This allowed my controller to increase the mean cutting rate for lemons from 1.3 cm/s to 4.5 cm/s, and for apples from 1.4 cm/s to 4.6 cm/s.
Optimizing its trajectory online also enabled my DeepMPC controller to exhibit a much more diverse range of behaviors. Most tuned stiﬀness controllers were forced to make use of high-amplitude sawing to ensure continuous motion. However, my controller was able to use more aggressive cutting strategies, typically executing smooth slicing motions until it found its progress impeded. It then used a variety of techniques to break static friction and continue motion, including highamplitude sawing, low-amplitude “wiggle” motions, and reducing and re-applying vertical pressure, even to the point of picking up the knife slightly in some cases. The latter behavior, in particular, underscores the strength of predictive control, as it trades oﬀ short-term losses for long-term gains. Stiﬀness controllers sometimes became stuck in tough materials such as thick potatoes and carrots and the cores of apples, and remained so because downwards force grew as vertical error increased. My controller, however, was able to detect and break such cases using these techniques.
Some examples of the diverse behaviors of my DeepMPC controller can be seen in Fig. 3.9 and in the video at http://deepmpc.cs.cornell.edu.
95

3.9.1 Handling Variety
In addition to my main experiments, presented above, I performed a series of additional experiments to test the adaptive abilities of my controller. For both these experiments, I chose to use potatoes, as they represent a challenging, dense material which responds in interesting ways to changes in temperature and tools (as detailed below.)
Varying Temperature: While the temperature of a material is not detectable by most robotic platforms (with some notable exceptions [11]), it can have a huge eﬀect on the cutting dynamics of that material. I performed a series of experiments to validate the robustness of my algorithm to this variation. For these experiments, I prepared several “cold” potatoes by placing them in a freezer for 30 minutes, and several “warm” potatoes by microwaving them for ﬁve minutes. Both of these operations signiﬁcantly altered the texture of the potato.
I did not tune or alter any controllers to reﬂect these new cases. As a baseline, I compare to the class-speciﬁc stiﬀness controller tuned for potatoes described above. For my approach, I used the same learned controller as used in all other robotic experiments, which has no experience with either warm or cold potatoes, only room-temperature. These are the same controllers presented as the purple and green bars, respectively, in Fig. 3.8, which gave cutting rates of 3.1 cm/s and 6.8 cm/s for room temperature potatoes.
The warm potatoes were much softer and easier to cut than at room temperature. This allowed the stiﬀness controller to track its input trajectory almost exactly, increasing cutting rate slightly to 3.5 cm/s. My approach, however, was able to properly adapt to these new conditions, almost doubling its cutting rate
96

Figure 3.10: Diﬀerent tools: Diﬀerent knives used to test my algorithm. From left to right, the paring knife used to collect data and train the algorithm, a shorter, sharper paring knife, a long kitchen knife, a wedge-shaped chef’s knife, and a serrated steak knife. In all cases except the serrated knife, my algorithm, trained only with the paring knife on the left, was able to maintain comparable cutting rates.
to 12.0 cm/s.
The cold potatoes were signiﬁcantly more diﬃcult to cut. In over half of my trials with cold potatoes, the baseline controller became stuck, with downwards progress halting after some point. For purposes of rate computation, I considered the controller to still be cutting until it reached the end of its input trajectory, leading to an average cutting rate of only 1.8 cm/s. My controller, once again, was able to properly adapt to this new case. While its forward progress was sometimes temporarily halted, it was able to perform motions to break out and continue cutting, allowing it to achieve a rate of 3.4 cm/s, almost double that of the baseline stiﬀness controller.
Varying Tools: I also tested my controller, learned using the paring knife shown in Fig. 3.4, with a series of other knives, as shown in Fig. 3.10. These included a
97

much sharper and slightly shorter paring knife, a longer kitchen knife, a wedgeshaped chef’s knife, and a serrated steak knife. My controller was able to adapt to the ﬁrst three knives, giving cutting rates similar to the results in Fig. 3.8. The sharp paring knife increased the rate slightly, to 7.8 cm/s, while the long and wedge knives both lead to slightly decreased rates of 5.5 and 5.6 cm/s, respectively. This makes sense, as the sharp paring knife was very similar to the training knife, only sharper, wheras the long and wedge-shaped knives were signiﬁcantly diﬀerent, leading to diﬀerent cutting dynamics, particularly in the case of the wedge-shaped knife. It is notable that my algorithm experienced only a slight (roughly 17%) decrease in performance for such a signiﬁcant change in tools.
My algorithm, trained using a straight-edged paring knife, was not able to use the serrated steak knife to cut a potato. This makes sense for two reasons:
First, a serrated knife is not, in general, a good tool for cutting dense vegetables like potatoes. Typically, such knives are only used for meat, where the main mode of cutting is splitting muscle ﬁbers, or soft, skinned vegetables such as tomatoes, where the serration helps to cut through the skin without smashing the vegetable, which oﬀers minimal resistance. For dense vegetables, a sharp, straight edge is necessary to split the vegetable, and even a skilled human user would have a hard time using a serrated knife.
Second, the dynamics of cutting using a serrated knife are very diﬀerent – while a straight-edged knife acts primarily by splitting the vegetable apart, a serrated knife acts by “eroding” the food item with its serrations. Food will respond very diﬀerently to the same motion executed by a serrated knife, so such a knife requires a signiﬁcantly diﬀerent dynamics model and cutting strategy. In particular, straight downwards motions are ineﬀective, requiring instead more “sawing”
98

and less downwards force to allow the serrations to act. Thus, experience with a straight-edge knife should not be expected to transfer to a serrated one. However, given new training data, my algorithm could also learn to use such a knife.
In both these experiments, my algorithm was able to successfully handle variations that it was not explicitly trained for. While varying temperature and tools both signiﬁcantly alter cutting dynamics, in all cases but the serrated knife, my algorithm was able to adapt to these and maintain comparable cutting rates, even improving them in some cases.
3.10 Conclusion
In this work, I presented DeepMPC, a novel approach to model-predictive control for complex non-linear dynamics which might vary both with environment properties and with time while acting. Instead of hand-designing predictive dynamics models, which is extremely diﬃcult and time-consuming for such tasks, my approach uses a new deep architecture and learning algorithm to accurately model even such complicated dynamics. In experiments on my large-scale dataset of 1488 material cuts over 20 diverse materials, I showed that my approach improves accuracy by 46% as compared to a standard recurrent deep network. In a series of over 450 real-world robotic experiments for the challenging problem of robotic foodcutting, I showed that my algorithm produced signiﬁcant improvements in cutting rate for all but the easiest-to-cut materials, and over tripled average cutting rates for the most diﬃcult ones.
Here, I presented a system which displays adaptability while learning a good latent representation for complex tasks. While food-cutting and many other ma-
99

nipulation tasks lend themselves well to intuitive cost functions, for others I could envision also learning the cost function used for MPC from data. In future work, this system could also be extended to handle multimodal information, e.g. incorporating haptic, visual, audio, or other data to enhance manipulation. While it would be extremely diﬃcult to design features which properly integrate all this information, deep learning allows me to learn these features directly from data, and my system would allow me to integrate them into real-time model-predictive control.
100

Algorithm 1 Recurrent Prediction and Cost Gradients for MPC
Input: Previous and current dynamic responses v(b−1), v(b) Current latent state l(b) Future control inputs Ut+1:t+T
Output: Predicted future state Xˆb+1:b+T Cost function value C(Xˆb+1:b+T , Ub+1:b+T ) Cost gradients ∂C(Xˆb+1:b+T , Ub+1:b+T )/∂Ub+1:b+T
Deﬁne C(xˆ(i), U ) as the direct contribution of xˆ(i) to C(Xˆt+1:t+T , Ut+1:t+T ) (ignoring recurrent eﬀects.) Deﬁne C(U ) as the direct contribution of U to the cost function (e.g. via smoothing/regularization terms)
Forward prediction: for k = 1:T
Compute l(b+k) from eq. (3.9) Compute xˆ(b+k) from eq. (3.6) end
Compute C(Xˆt+1:t+T , Ut+1:t+T )
Back-propagating gradients: dXBack1 = 0; dXBack2 = 0; dUBack1 = 0; dUBack2 = 0; dLBack1 = 0;
for k = B:1 Compute current cost function gradient, back-prop to u dXCur = ∂C(xˆ(b+k), U )/∂xˆ(b+k) dXCur = dXCur + dXBack1; ∂C(Xˆ , U )/∂u(b+k) = dXCur * ∂xˆ(b+k)/∂u(b+k) + ∂C(U )/∂u(b+k) + dUBack1
Back-propagate gradients to previous timesteps dXBack1 = dXBack2 + dXCur * ∂xˆ(b+k)/∂xˆ(b+k−1)
+ dLBack1 * ∂l(b+k)/∂xˆ(b+k−1) dUBack1 = dUBack2 + dXCur * ∂xˆ(b+k)/∂u(b+k−1)
+ dLBack1 * ∂l(b+k)/∂uˆ(b+k−1) dXBack2 = dXCur * ∂xˆ(b+k)/∂xˆ(b+k−2)
+ dLBack1 * ∂l(b+k)/∂xˆ(b+k−2) dUBack2 = dXCur * ∂xˆ(b+k)/∂u(b+k−2)
+ dLBack1 * ∂l(b+k)/∂uˆ(b+k−2) dLBack1 = dLBack1 * ∂l(b+k)/∂l(b+k−1)
+ dXCur * ∂xˆ(b+k)/∂l(b+k−1) end
101

CHAPTER 4 LOW-POWER PARALLEL ALGORITHMS FOR SINGLE IMAGE
BASED OBSTACLE AVOIDANCE IN AERIAL ROBOTS
4.1 Introduction
Perceiving obstacles is extremely important for an aerial robot in order to avoid collisions. Methods based on stereo vision [101, 52] are fundamentally limited by the ﬁnite baseline between the stereo pairs [97], and fail in textureless regions and in presence of specular reﬂections [10]. Active range-ﬁnding devices (e.g., [150, 93]) are either designed for indoor low-light environments (e.g., the Kinect [3]), or are too heavy for aerial applications. More importantly, they demand more onboard power, which is at a premium for aerial vehicles.
In this work, I use a single monocular camera for obstacle perception. Recent works [129, 94, 123, 127, 125, 5] have shown that it is possible to obtain depth from a single monocular image. Multiple frames can also be used in combination to determine depth, but this approach does not work well on an aerial robot due to camera disturbances from robot motion and vibrations. Here, I present an algorithm that takes a single image as input and classiﬁes each region in the image as obstacle or not. I will deﬁne an obstacle as an object which the robot could not safely pass through. This approach is very attractive for aerial robots because cameras are small and draw little power.
My second key contribution is to formulate a Markov Random Field classiﬁca-
This work was originally presented as a conference paper at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2012. It is joint work with Mevlana Gemici and Ashutosh Saxena.
102

tion model and design fast inference algorithms for estimating obstacles in a new low-power parallel hardware that implements a leaky integrate and ﬁre neuron architecture. One implementation of such hardware [92] consumes only 45 picoJoules per spike (see Section 4.3 for an overview). The combination of using a camera (miniature cameras that consume extremely low power are available) and this hardware could allow miniature aerial robots to successfully ﬂy amidst obstacles even in unknown environments. Inference in my MRF is solved using Belief Propagation (BP). While performing belief propagation in a traditional computer is expensive, my design allows the low-power hardware to do so natively. Each logical clock cycle performs a full parallel update of BP. I obtain good obstacle estimates once the BP network has converged (see Figure 4.1). I then use the estimated obstacle map to select a evasive manuever for avoiding the obstacle.
In detail, I represent the energy function of the MRF over the neuron architecture, and use spikes to propagate beliefs during inference. My MRF uses the logistic function to model the local dependence of visual features to the obstacle label, where each spatial region of the image is represented using a diﬀerent set of neurons—this allows parallel computation. Furthermore, I use diﬀerent parameters for diﬀerent spatial regions of the image [83] for improved performance. Finally, I induce sparsity in the parameters of the model to satisfy constraints on the number of parameters allowed by my hardware.
I performed extensive experiments in a variety of environments, containing obstacles such as trees, fences, and poles. In learning experiments, I obtained an average precision and recall of 81.9% and 93.6% respectively. In 53 outdoor robotic experiments, my algorithm was able to successfully perceive obstacles in every case, and avoid them in 51 cases. The two failures were due to communication delays
103

(a) Aerial Robot.

(b) Single image from camera.

(c) Initial inferred obstacle map.

(d) Inferred obstacle map.

Figure 4.1: Avoiding obstacles: I use low-power parallel hardware to compute an obstacle map, given a single image from the camera onboard the aerial robot. I then use these results to select an evasive maneuver.
and robot drift. Some of these experiments involved avoiding a series of obstacles
of multiple types.

4.2 Related Work
In aerial robotics, most works which perform obstacle avoidance either make strong assumptions on precise knowledge of 3D location of obstacles [90], or use sensors that are not onboard, such as GPS (together with known obstacle map). Other work such as [48] focuses on mapping obstacles from overhead images. For a small aerial robot to ﬂy autonomously in a real environment full of obstacles, these techniques do not directly apply.
104

Navigation by labeling obstacles in images has been used for several ground robots. For example, Ghosh and Mulligan [43] use a ground segmentation approach for navigation, while Nabbe and Hebert [102] use ground-vertical segmentation for extending the path planning horizon for ground robots. Work such as [67, 135] employs learning algorithms to determine terrain traversability for ground vehicles. Michels, Saxena and Ng [94] and Plagemann et. al. [111] attempt to determine range directly from monocular images. However, these works use only a local feature based classiﬁer for navigating a ground vehicle. An aerial robot is typically severely constrained by onboard power, and I present methods that allow even a complex inference method such as BP to be eﬃciently computed in low-powered hardware.
There are other works that consider single monocular image based obstacle avoidance for aerial robots. McGee et. al. [89] use sky segmentation for detecting obstacles, but apply only a local classiﬁer. Soundararaj, Sujeeth and Saxena [137] and Courbon et. al. [29] use vision techniques to navigate aerial robots, but are limited to known indoor environments. Bills, Chen and Saxena [13] and Zingg et. al [152] perform similar work to unknown environments, but still handle only a few known types of indoor environment. On the other hand, I consider general outdoor environments, employing learning algorithms which allow my approach to be easily adapted to new obstacle types and integrate non-local information to enhance classiﬁcation.
Vision algorithms have implemented in neural architectures or embedded systems, such as [69, 4, 64]. Other works [99, 57] used spiking neurons for basic obstacle detection and navigation. However, these approaches do not generalize to real-world outdoor cases.
105

4.3 Neuromorphic Hardware

My goal is to develop algorithms for obstacle mapping that can be implemented in low-power parallel hardware. In particular, I use a neuromorphic hardware platform that comprises a network of linear-leak integrate and ﬁre (LLIF) artiﬁcial neurons as in [92]. The LLIF neurons represent an extremely versatile high-level primitive which couples memory and processing. More importantly, this architecture uses extremely low power, as discussed below.

Each neuron integrates the weighted synaptic inputs from other neurons and ﬁres if the integrated value exceeds a preset threshold. More formally, each neuron has some integer-valued internal potential Z and binary-valued spiking output S. For ith neuron, S and Z update as:

Zti+ = Zti +

wijYtj − λi

j∈N (i)

Zti+1 = 1{Zti+ < αi}Zti+

Sti+1 = 1{Zti+ ≥ αi}

(4.1)

where N (i) are the neighbors of neuron i, wi,j indicates the synaptic weight from neuron j to neuron i, α indicates neuron threshold, and λ is a constant decay. 1{...} is the indicator function, which takes the value one if its argument is true and zero otherwise.

Since these are spiking neurons, one major restriction is that the inputs and outputs be binary-valued. I address this by using representations where the spike count over a particular time window is proportional to the value being represented. Since each neuron can integrate over time, this is still a useful representation. The expected spike rate given the input x is simply a ramp function with limits, as

106

follows:


0    
g(x, α) = x/α
    1

if x ≤ 0 if 0 < x < αi if x ≥ αi

(4.2)

Cardinality constraints in hardware. To allow for a more compact design with lower power consumption, hardware such as that in [92] typically imposes a constraint on the cardinality of the weights w. That is to say, each neuron’s weights may take at most k unique nonzero values.

Power consumption in hardware. In a well-designed hardware platform such as [92], power consumption will be proportional to the number of spikes and the density of connections. In particular, the hardware in [92] takes only 45 pico-Joules (pJ) per spike, and has very low quiescent power draw in their absence.

4.4 Obstacle Classiﬁcation
The primary goal of my approach is to produce an obstacle map of suﬃcient quality for obstacle avoidance. This is a challenging problem, as outdoor environments are perceptually complex, with variations in obstacle appearance, lighting, background appearance, and other factors. My model will deﬁne obstacles as objects which project upwards from the ground and thus present a navigational challenge to the robot. I will use the labels it produces to select an evasive maneuver for the perceived obstacles.
My classiﬁcation model is a Markov Random Field (MRF) model (e.g., see [72]), where I use an Ising pairwise potential for modeling dependencies between
107

neighboring image regions. More formally, an MRF is an undirected graph G = (V, E), where I represent each labeled location in the image as a vertex V, and edges E connecting neighboring image locations. Let Y i ∈ {−1, +1} represent the binary labels indicating presence or absence of an obstacle at the ith location in the image, and Xi represent the input visual features at that location.

I model the joint conditional likelihood of the labels given the features as:

P (Y |X, θ) ∝ exp (−E(Y |X, θ))

(4.3)

where the E(Y |X, θ) is an energy function containing three terms:

E(Y |X, θ) = − Y iA(Xi, θ) −

wijY iY j + β||θ||1

i∈V (i,j)∈E

(4.4)

The ﬁrst term uses a logistic model, with θ as its parameters, to model the de-

pendence of the label on the local visual features. More formally, I model the

association potential as A(X, θ) = 2 ∗ σ(XT θ) − 1, where σ(x) = 1/(1 + exp(−x))

is the sigmoid function. The second term prefers neighboring labels to be similar,

with w indicating the relative importance of the two terms. Finally, β controls

the strength of an L1 regularization term on the local feature weights, which helps with the weight cardinality constraints in the hardware.

During learning, I will be given a set of labeled examples and my goal is to ﬁnd the optimum value of the parameters. During inference, the system is given a new image and my goal is to ﬁnd the optimal value of Y using the LLIF neuron architecture.

108

Figure 4.2: Classiﬁcation improvements from MRF: Left: input images. Middle: initial classiﬁcation. Right: classiﬁcation with full MRF model with belief propagation. Initial classiﬁcation results which would present problems for navigation, but are greatly improved by integrating non-local information using my MRF.
4.4.1 Learning

I manually set parameters wij, and learn parameters θ by maximizing the pseudoconditional log-likelihood. The system is given M labeled ground-truth pairs as

{(Xm, Ym) : m = 1, . . . , M }, and I learn θ∗ as:

M

θ∗ = arg min θ

E(Ym|Xm, θ)

m=1

M

= arg min θ

− Ymi A(Xmi , θ)

m=1

i∈V

+ β||θ||1

(4.5) (4.6)

Here, Ymi indicates Y i from the mth training example, and Xmi is similar for input

features. This sub-problem is convex, and is equivalent to solving logistic regression

with an additional L1 penalty term. I vary the β parameter until the number of

109

non-zero weights ﬁts within what is allowed in the hardware.

4.4.2 Inference

The goal of inference is to ﬁnd an optimal value for the label estimates Y , given features X and parameters θ:

Y ∗ = arg max P (Y |X, θ)
Y

(4.7)

My inference algorithm is based on loopy belief propagation [109, 100]. In order to derive the update rule for node i, I assume that the values of the other nodes are known and compute the messages as follows:

ψ(Xi) = σ(θT Xi)

µi(Xi) = ψ(Xi)

µj (X j )wij

(i,j)∈E

P (Y i = 1|X) = µi(Xi)

(4.8) (4.9)
(4.10)

My goal is to perform obstacle avoidance in new environments, and the product above may give zero probability of being an obstacle if any of terms is zero. This is not preferable, and I need to account for such cases. Following ideas of additive smoothing in statistics [145], I include an additive smoothing term in ψ and µ. This is a small constant factor ǫ added to each function. Denote the versions of these functions with additive smoothing as ψs and µs.
If I consider these update equations in log-space, they become sum of weighted terms, and thereby can be implemented in neuromorphic hardware (Eqn. 4.1). In such a case, the additive smoothing term becomes a constant lower bound on the
110

log probability terms. Thus, in log-space, I have:

log ψs(Xi) = max(log ǫ, log σ(θT Xi))

(4.11)

log µs,i(Xi) = max log ǫ, log ψs(Xi)

+ wij log µs,j(Xj)
(i,j)∈E

(4.12)

To implement this in hardware, I will use one neuron each to represent ψs(Xi) and µs,i(Xi) for each node. Each ψ unit will take input in spike-rate from local features, weighted as θ. Each µ unit will take input from the corresponding ψ unit

and neighboring µ units, weighted as w.

The log-sigmoid ψs term can be approximated as a linear function which saturates at 0. With the lower bound from the additive smoothing function, this becomes a scaled and shifted version of Eqn. 4.2. I will refer to the version of g(x, α) adjusted to ﬁt the log-sigmoid function as gψ(x). log µs,i(Xi) is also well approximated by a thresholded linear function, and can thus also be modeled by g(x, α), as gµ(x). In both cases, α is ﬁxed to some value which gives the best ﬁt. The equations for my system are then:

log ψs(Xi) = gψ(θT Xi) − log Zψ

(4.13)

log µs,i(Xi) = gµ(wiiψs(Xi) +

wijµs,j(Xj)) − log Zµ

(i,j)∈E

(4.14)

Where the Z’s are two separate constants, necessary to include here to preserve

exact equality, but unnecessary to implement in hardware since relative values

are preserved. Except for the error introduced by approximating the log-sigmoid

function with gψ(x) and discretization, this is exactly Eqn. 4.12. To infer optimal values for Y , I can simply threshold the gµ terms once the network has converged.

111

4.4.3 Visual Features
My basic features are a standard set of texture ﬁlter responses, similar to those in [87]. They consist of oriented edge ﬁlters (in my case, Gabor ﬁlters) and centersurround ﬁlters (diﬀerence-of-Gaussian ﬁlters) as shown in Figure 4.3. I also include color information in the form of patch-averaged RGB values in my feature set. In addition to raw ﬁlter responses, I include the absolute value of these responses and the maximum and average of this absolute value over a local area.
I found it diﬃcult to design features which ﬁt the hardware constraints of [92], so the features given here do not. They do, however, rely only on simple local computations, and thus are amenable to eﬃcient implementation in circuitry or parallel hardware. Only features given nonzero weights by some classiﬁer would need to be implemented in this hardware, signiﬁcantly reducing complexity.
Figure 4.3: Filter set: My ﬁlter set, which includes two scales of Gabor ﬁlters at six orientations and two scales of diﬀerence-of-Gaussian ﬁlters.
4.4.4 Spatially-varying models and multiple models
Often the statistics of the dependence of the obstacles on the visual features varies with their location in the image [83]. For example, while leaves on the trees are typically obstacles, they are not if they fall on the ground in the Fall season. Thus
112

my parameters θ should be diﬀerent for diﬀerent rows in the image. In order to do so, I divide the image vertically into three equally sized areas. This is equivalent to considering a separate θi for each i ∈ V, and tying weights within regions. By dividing the classiﬁcation as such, each region can have a diﬀerent feature set. Tying the parameters of these diﬀerent models helps to avoid over-ﬁtting. Since my neuron architecture is distributed for diﬀerent spatial regions in the image, each neuron gets the appropriate synaptic weights depending on which region in the image it is representing.
To detect multiple visually diﬀerent obstacle classes, I train a model for each class and use the combined output of these models. In order to exercise maximum caution, I combine outputs using the logical OR of the results, ie if any classiﬁer classiﬁes a region as obstacle, it is considered to be an obstacle for purposes of navigation.
Since my algorithms run natively in parallel hardware, neither of these changes require a change to the architecture or cause an increase in runtime.
4.4.5 Tuning for Power Consumption
One major strength of the hardware described in [92] is that its power consumption can be estimated a priori from simulation with high accuracy. This is because power consumption in this hardware is determined largely by two factors: a) connection density and b) spike count. While a) is ﬁxed for a particular local connection pattern, b) can change drastically depending on the relative weight between local classiﬁer estimates and neighboring labels, as well as neuron threshold and decay. In particular, faster convergence of the MRF model in terms of hard-
113

ware timesteps can signiﬁcantly reduce power consumption, as the model can be terminated earlier and thus generates fewer spikes. Therefore, in practice, I tune my models for a tradeoﬀ between power and accuracy, aiming to produce results which produce fewer spikes while still yielding results which would be useful for obstacle avoidance.
In practice, I obtained good performance in terms of both accuracy and performance for the following parameter settings for each neuron:
• wii: set slightly below α (ﬁring threshold). • wij: set to a small local connectivity pattern (4- or 8-way), with uniform
weights set to 10-20% of α. • λ (decay): set to 5-10% of α.
4.5 Obstacle Avoidance Manuevers
The goal of my obstacle avoidance algorithm is to avoid obstacles in near to midrange (e.g., about 2m to 10m). My motion selection algorithm uses a library of paired motions and visual-space masks. If less than some threshold’s worth of pixels within a masked area are labeled as obstacles, the corresponding motion is considered to be safe. The eﬀects of sweeping this threshold are shown in Figure 4.6. A ﬁxed preference order is used to select motions in cases where more than one is found to be safe.
For vertical obstacles such as trees and poles, the motions, in order of preference, were diagonal-left, diagonal-right, and forwards. For horizontal obstacles such as fences, the motions were forwards or upwards.
114

115

Figure 4.4: BP in action: Time series of belief propagation in action. Left: original image and baseline classiﬁcation result. Top: internal node potential. Middle: spike locations. Bottom: spike counts. Time proceeds from left to right. Best viewed in color.

Table 4.1: Obstacle classiﬁcation results: Classiﬁer precision and recall, in percent, for four obstacle classes, and average across classes. Results presented for baseline local classiﬁer and MRF model, with and without spatially varying model, and varying weight cardinalities.

116

Algorithm Chance Local association term only, no spatial-varying model
Local association term only, spatial-varying model
Full MRF, no spatialy varying model
Full MRF, spatialy varying model

Card.
110
8 5 3 110 8 5 3 110 8 5 3 110 8 5 3

Trees Prec. Rec. 50.7 52.0 80.0 81.5 78.6 77.2 77.4 75.4 77.2 75.3 75.9 85.5 74.1 84.0 72.8 81.3 70.5 77.9 79.8 94.8 76.3 91.9 76.0 92.0 71.5 88.1 81.9 94.0 78.9 92.7 79.3 92.0 74.4 90.1

Buildings Prec. Rec. 70.1 65.3 83.6 91.0 86.9 89.4 90.3 73.2 88.7 78.2 91.2 86.1 91.7 85.3 91.0 85.6 90.7 79.9 89.0 94.2 84.8 95.0 90.3 89.6 87.1 90.0 87.7 96.3 87.6 96.2 87.2 95.6 86.9 95.8

Fences Prec. Rec. 28.9 21.5 61.5 92.6 68.3 90.0 67.6 90.1 66.8 90.7 80.3 92.1 78.2 93.0 77.4 93.2 66.6 93.4 82.1 87.4 89.2 85.3 84.1 87.9 90.2 89.2 95.0 94.0 92.2 93.9 91.1 93.6 90.3 92.5

Poles Prec. Rec.
5.4 4.4 56.6 94.7 62.2 92.6 47.9 95.6 47.7 95.4 54.0 86.7 59.8 86.6 56.3 86.8 57.0 86.8 80.8 91.7 82.1 90.7 80.6 90.2 81.8 89.9 63.1 90.1 65.3 89.1 65.7 89.6 64.2 88.5

Average Prec. Rec. 29.3 26.4 71.8 88.2 74.8 86.0 70.8 86.2 69.7 86.4 75.4 87.6 75.9 87.2 74.4 85.6 71.2 84.5 82.2 92.3 83.5 91.1 81.7 91.5 82.2 90.7 81.9 93.6 81.0 93.0 80.8 92.8 79.0 91.7

117

Figure 4.5: Visual results: Figure showing the results of my algorithm for a variety of obstacles. Top: Input image. Bottom: Inferred obstacle labels.

4.6 Experiments and Results
4.6.1 Oﬄine Learning Experiments
Dataset. My dataset includes 120 images taken from various locations around the Cornell University campus using the onboard camera of my aerial robot. The environments include four types of obstacles: trees, buildings, light poles, and fences. The dataset includes 63 images of trees, 45 of buildings, 10 of poles and 10 of fences. I used 80% of the images for training and 20% for testing.
Belief Propagation. During inference, my BP system exhibits distinct phases of operation as seen in Figure 4.4. First, there is a warm-up phase where nodes with high values of classiﬁer output build up energy. Eventually, some of these nodes’ local potential exceeds their thresholds and they ﬁre, spreading energy to neighboring nodes which may also ﬁre in response. Spikes begin to propagate across the network, which ﬁnally reaches a steady state from which inferred classiﬁcation labels are determined.
Results. Table 4.1 shows the performance of my algorithm. I present comparisons with diﬀerent models. First, I consider a model with only the association term, i.e., equivalent to a local logistic classiﬁer. Second, I also compare the eﬀects of training several spatially varying models. I compare the eﬀects of reducing the cardinality of local classiﬁer weights as well. Similar baseline results were observed using SVM.
My results show that the spiking neuron based BP system has proven very eﬀective in producing cleaner, more usable results for obstacle avoidance, as com-
118

pared to baseline results using the local logistic term only. Since my feature set contains edge and center-surround ﬁlter responses, local classiﬁer estimates are generally stronger at obstacle edges. BP propagates these inwards, allowing the entire obstacle to be classiﬁed, causing increases in recall of as much as 17%. This propagation behavior can sometimes smooth edges of perceived obstacles slightly, as seen in many cases in Figure 4.5, causing a slight decrease in precision oﬀset by gains elsewhere.
Most clear areas contain no strong local values and thus do not generate an initial spike, while true positive regions almost always do. This allows the system to avoid some false positive cases present in the baseline results. For obstacle avoidance purposes, recall is much more important than precision, and errors near an obstacle are less important than false positives in otherwise clear areas.
In order to evaluate the performance of my algorithm for obstacle avoidance purposes, I discretized the lower region of each test image into a 3x5 grid of cells, and considered a cell to be ground-truth occupied if it contained at least 10% ground-truth obstacles. I produced the curve shown in Figure 4.6 by sweeping thresholds for occupancy ratios. My algorithm outperforms the baseline in all cases. The most signiﬁcant improvement is for high values of recall, which are necessary for safe obstacle avoidance. Results presented are for trees, results for other classes were similar.
Weight Cardinality Limitations. Reducing the number of features available to the classiﬁer decreases performance on average, producing particularly signiﬁcant decreases for varied obstacles such as trees. However, with BP, the results are generally comparable for all weight cardinalities. This demonstrates that BP is able to resolve the errors produced by restricting weight cardinality, making it an
119

Figure 4.6: Precision-recall curve: Precision vs. recall for cell-based error metric, demonstrating improvements over baseline results for navigation purposes.
ideal choice as an inference algorithm in neuromorphic hardware which imposes that constraint.
Spatially-varying models. Baseline results were improved by spatially-varying models in most cases, and full MRF results were improved in all cases except poles. This suggests that spatially varying models are useful in most cases, but might be foregone for obstacles with spatially consistent visual appearance such as poles.
Performance in diﬀerent environments. Since my algorithm uses supervised learning to learn local classiﬁer weights, it is easily adapted to new obstacle classes. The four classes presented here are very visually diﬀerent, yet my model is able to detect obstacles eﬀectively in each. My model was also able to handle variations in lighting and obstacle appearance (such as leaves falling oﬀ the trees). Figure 4.7 shows an example of a tree trunk classiﬁer using my algorithm performing eﬀectively under a variation in lighting conditions.
120

Figure 4.7: Varying lighting: Classiﬁcation results for a tree trunk classiﬁer using my algorithm, on the same tree under very diﬀerent lighting conditions.
Table 4.2: Robotic experiment results: Error rates presented for classiﬁcation, motion execution, and overall successful avoidance

Type
Trees Fences Poles Total

Obstacles
3 3 2 8

Tests
20 25
8 53

Success Rate Class. Motion Overall 100.0% 95.0% 95.0% 100.0% 100.0% 100.0% 100.0% 87.5% 87.5% 100.0% 96.2% 96.2%

4.6.2 Robotic Experiments in Real Environments
I performed obstacle avoidance experiments on my aerial robot platform, an AirRobot with a single onboard camera and an onboard IMU for stabilization.1 Since the hardware described in [92] is still in production, processing was done using an
1Because of funding agency’s restrictions, I am not allowed to disclose more detailed speciﬁcations of my system.

121

Figure 4.8: Avoiding obstacles in series: My aerial robot avoiding a fence and a pole in sequence. Left: Overhead map of area, red indicates obstacles avoided, blue robot path. Robot started at the blue dot, behind and below the level of the fence, moved upwards to a safe altitude, and proceeded over the fence and through the clear area. It then stopped at the pole, detected it as an obstacle, moved diagonally to the right to avoid it, and then forwards again through the following clear area.
oﬀboard laptop computer running a MATLAB simulation of the hardware. Due to transmission latency and a lack of the necessary parallelism, this limited me to classifying a single image at a time, then taking a large step with the robot.
For each experiment, the robot was driven to a hovering position roughly 210 meters from the obstacle. Using the robot’s IMU, I determined when it was in a stable, level pose, then captured an image from its onboard camera. This image was transmitted to the laptop, which performed the classiﬁcation described in Section 4.4, using a classiﬁer trained for the appropriate obstacle class. I then used the algorithm described in Section 4.5 to select an appropriate avoidance maneuver. The robot then executed a predetermined, ﬁxed motion plan based on the maneuver chosen.
In Table 4.2, I report three types of success-rates: “classiﬁcation”: when the obstacle classiﬁcation was correct, “motion”: when the computed obstacle avoidance maneuver was correct, and “overall”: when the robot successfully executed the maneuver avoiding the obstacle.
122

Figure 4.9: Avoiding obstacles: AirRobot avoiding obstacles based on my classiﬁcation results (robot circled in red in cases where it’s diﬃcult to see)
In all the experiments in Table 4.2, classiﬁcation results from my model were of suﬃcient accuracy to allow the controller described in Section 4.5 to properly select a safe motion. My model identiﬁed both foreground obstacles and clear areas consistently in these experiments.
As seen in Figure 4.9, my algorithm worked even in environments with visually cluttered backgrounds such as buildings and background trees. While these backgrounds did cause some false positives, as seen in some cases in Figure 4.5, these were only in the top region of the image, which was not considered by the controller. Since my algorithm works at visual range, it was able to perceive obstacles from long distances, allowing the robot to make large, predetermined motions to avoid them.
123

Some obstacles encountered were semi-transparent, such as the fence in the bottom row of Figure 4.9. Even though the visual appearance of the fence varied depending on viewing angle and background, and the fence exhibited specular reﬂections in some places, my algorithm was able to produce an extremely clean labeling of the fence as seen in Figure 4.5.
I also performed an experiment where the robot avoided multiple classes of obstacles in sequence. The classiﬁcation models for fences and poles were run in parallel, and only motions determined to be safe by both were considered. The robot was re-oriented to its goal travel direction after each motion. The robot was able to travel roughly 25 meters while avoiding obstacles, as shown in Figure 4.8. This approach was able to produce good results because the models were able to correctly report a lack of foreground obstacles when there were none, allowing only the classiﬁers for which foreground obstacles were present to dictate maneuver selection.
Video showing my robot avoiding obstacles using my algorithm is available at: http://mav.cs.cornell.edu
4.7 Conclusions
I presented a learning algorithm that takes as input a single image and outputs an obstacle map. My algorithm uses a Markov Random Field to model the mapping from visual features to obstacles and the relations between neighboring regions in the image. I use parallel neuromorphic hardware for performing inference in the model. This hardware is well-suited for aerial robots because of its extremely low power requirements. My MRF model also considers the cardinality constraints of
124

the synaptic weights, and tries to minimize the power requirements for inference by minimizing the number of spikes. In upcoming hardware, my algorithms would allow several frames per second to be processed, while consuming less than 1 W of power.
I evaluated my algorithms in both learning experiments and robotic experiments. In learning experiments, my MRF model made signiﬁcant improvements in classiﬁer accuracy both quantitatively and qualitatively for the purpose of obstacle avoidance. In robotic experiments, my algorithm was able to correctly identify the locations of forgeround obstacles in all tests, allowing the robot to select an evasive maneuver to avoid them. Some of these tests involved multiple obstacles of diﬀerent classes in sequence, demonstrating that inference results from multiple models can be eﬀectively combined.
4.8 Acknowledgements
I would like to thank Dharmendra Modha, Shyamal Chandra, Thomas Zimmerman, Stefano Carpin, Steve Esser, Myron Flickner, Jerry Yeh, and Dale Cassidy for useful disussions, and Jasdeep Hundal and Brian Wojcik for their help with experiments. This work was partially supported by DARPA under grant #HR001109-C-0002 and by an Alfred P. Sloan research fellowship.
125

CHAPTER 5 CONCLUSION
Here, I presented several works which apply deep learning algorithms and neural networks to robotics problems. These works showed that by identifying hard-tomodel nonlinearities and designing deep networks and learning algorithms to model them, we can enable robotic systems to learn to solve a wide range of problems from data. I showed that these networks can model even abstract nonlinearities such as mapping RGB-D image data to graspability in Chapter 2, and complex timevarying nonlinearities like food-cutting dynamics in Chapter 3. These algorithms were able to improve prediction accuracy in both cases, giving more accurate results than even learning approaches with hand-engineered features.
In my work on aerial robot obstacle avoidance in Chapter 4, I showed the promise of up-and-coming neural hardware, which lends itself well to implementing deep learning algorithms. Such hardware would allow these algorithms to operate completely in parallel, allowing us to run hundreds of thousands of neurons at a time while maintaining an update rate fast enough to be useful for real-time robotic applications.
In all these works, I applied these algorithms to real-world robotic systems. I developed three diﬀerent end-to-end systems which were able to autonomously solve their given tasks. While these systems still required signiﬁcant engineering eﬀorts to develop, using learning algorithms allowed me to avoid having to model such complex nonlinearities by hand. This shows that these deep learning algorithms are not only capable of producing impressive oﬄine results, but are capable of producing strong results in online, real-world robotic systems.
126

While these works present a strong starting point for the incorporation of deep networks into real-world robotics, they also expose several exciting directions for future work. One area which has not been explored here, but could be hugely helpful in deploying new real-world systems is the incorporation of online learning and “exploration” type behavior. Iteratively learning and re-applying controllers allows for less manual data collection and more reﬁnement of learned models.
Incorporating uncertainty into these models is another exciting direction which has several beneﬁts. First, this would allow a system like that in Chapter 3 to naturally perform information-collecting operations, for example to reﬁne its knowledge of the material being cut. This would also aid online learning, as the model would then have knowledge of portions of its input state space for which it has high uncertainty, and could focus data collection in those regions.
While the system presented in Chapter 3 showed very strong performance for the challenging food-cutting task, another exciting direction for future work would be to extend this system to new tasks, especially those integrating multimodal data such as visual and tactile information. An architecture like that presented in Chapter 2 could allow such a system to deal more robustly with multimodal data.
While food cutting and many other tasks lend themselves well to handdesigning an intuitive cost function for predictive control, some others might not. For such tasks, the system presented in Chapter 3 could be extended to learn the cost function for the task at hand as well, in a manner similar to contemporary work on learning cost functions from weak reinforcement. Deep learning could allow such a system to learn even cost functions which are too abstract or diﬃcult for human engineers to design.
127

The hardware discussed in Chapter 4 demonstrates a ﬁnal, very exciting, direction for future work. Since the algorithms presented in Chapters 2 and 3 use neurons as the core of their inference algorithms, the systems could potentially be implemented using such hardware. However, several challenges remain in doing so. First, the learning algorithms for these approaches must be adapted to learn networks which reﬂect the hardware constraints discussed in Chapter 4. Second, the portions of these algorithms not directly implemented by deep networks (e.g. search, optimization, cost function computation, etc.) would also need to be implemented using hardware neurons, which might require signiﬁcant engineering eﬀort.
The combination of these future directions shows the potential of this approach to solve many of the current challenges of robotics. Such a system would be capable of performing online learning for even complex tasks involving multimodal data such as visual feedback, given only weak reinforcement signals which could easily be produced by human users. It would avoid the need for hand-engineering controllers or features, while allowing the system to continuously reﬁne its own controllers and models. Implementation in neural hardware would allow such a system to perform inference extremely quickly, even in parallel with other systems, while consuming minimal power, allowing high-performance predictive control even on miniature aerial or ground vehicles with limited battery supply.
128

BIBLIOGRAPHY
[1] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to aerobatic helicopter ﬂight. In NIPS, 2007.
[2] C. G. Atkeson, C. H. An, and J. M. Hollerbach. Estimation of inertial parameters of manipulator loads and links. Int. J. Rob. Res., 5(3):101–119, Sept. 1986.
[3] A. Bachrach, S. Prentice, R. He, and N. Roy. Range - robust autonomous navigation in gps-denied environments. Journal of Field Robotics, 28(5): 644–666, 2011.
[4] C. Bartolozzi, F. Rea, C. Clercq, M. Hofst¨atter, D. Fasnacht, G. Indiveri, and G. Metta. Embedded neuromorphic vision for humanoid robots. In ECVW, 2011.
[5] D. Batra and A. Saxena. Learning the right model: Eﬃcient max-margin learning in laplacian crfs. In CVPR, 2012.
[6] M. Beetz, U. Klank, I. Kresse, A. Maldonado, L. Mo¨senlechner, D. Pangercic, T. Ru¨hr, and M. Tenorth. Robotic Roommates Making Pancakes. In Humanoids, 2011.
[7] Y. Bengio. Learning deep architectures for AI. FTML, 2(1):1–127, 2009.
[8] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is diﬃcult. IEEE Transactions on Neural Networks, 5 (2):157–166, 1994.
[9] S. Bennett. A brief history of automatic control. Control Systems, IEEE, 16 (3):17–25, Jun 1996. ISSN 1066-033X.
129

[10] D. Bhat and S. Nayar. Stereo in the presence of specular reﬂection. In ICCV, 1995.
[11] T. Bhattacharjee, J. Wade, and C. C. Kemp. Material recognition from heat transfer given varying initial conditions and short-duration contact. In RSS, 2015.
[12] A. Bicchi and V. Kumar. Robotic grasping and contact: a review. In ICRA, 2000.
[13] C. Bills, J. Chen, and A. Saxena. Autonomous mav ﬂight in indoor environments using single image perspective cues. In ICRA, 2011.
[14] L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In ISER, 2012.
[15] J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven grasp synthesis - a survey. accepted.
[16] M. Bollini, J. Barry, and D. Rus. Bakebot: Baking cookies with the pr2. In IROS PR2 Workshop, 2011.
[17] D. Bowers and R. Lumia. Manipulation of unmodeled objects using intelligent grasping schemes. IEEE Trans Fuzzy Sys, 11(3), 2003.
[18] E. Brunskill, L. Kaelbling, T. Lozano-P´erez, and N. Roy. Continuous-state POMDPs with hybrid dynamics. In ISAIM, 2008.
[19] M. V. Butz, O. Herbort, and J. Hoﬀmann. Exploiting redundancy for ﬂexible behavior: Unsupervised learning in a modular sensorimotor control architecture. Psychological Review, 114:1015–1046, 2007.
130

[20] C. Cadena and J. Kosecka. Semantic parsing for priming object detection in rgb-d scenes. In ICRA Workshop on Semantic Perception, Mapping and Exploration, 2013.
[21] S. Chernova and M. Veloso. Conﬁdence-based policy learning from demonstration using gaussian mixture models. In AAMAS, 2007.
[22] C.-M. Chow, A. G. Kuznetsov, and D. W. Clarke. Successive one-step-ahead predictions in multiple model predictive control. Int. J. Systems Science, 29 (9):971–979, 1998.
[23] A. Coates and A. Y. Ng. Selecting receptive ﬁelds in deep networks. In NIPS, 2011.
[24] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang, D. J. Wu, and A. Y. Ng. Text detection and character recognition in scene images with unsupervised feature learning. In ICDAR, 2011.
[25] A. Collet Romea, D. Berenson, S. Srinivasa, and D. Ferguson . Object recognition and full pose registration from a single image for robotic manipulation. In ICRA, 2009.
[26] A. Collet Romea, M. Martinez Torres, and S. Srinivasa. The moped framework: Object recognition and pose estimation for manipulation. IJRR, 30 (10):1284 – 1306, 2011.
[27] R. Collobert and J. Weston. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. In ICML, 2008.
[28] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 12: 2493–2537, 2011.
131

[29] J. Courbon, Y. Mezouar, N. Guenard, and P. Martinet. Visual navigation of a quadrotor aerial vehicle. In IROS, 2009.
[30] M. P. Deisenroth and C. E. Rasmussen. Pilco: A model-based and dataeﬃcient approach to policy search. In ICML, 2011.
[31] R. Detry, E. Baseski, M. Popovic, Y. Touati, N. Kruger, O. Kroemer, J. Peters, and J. Piater. Learning object-speciﬁc grasp aﬀordance densities. In ICDL, 2009.
[32] M. Dogar, K. Hsiao, M. Ciocarlie, and S. Srinivasa. Physics-based grasp planning through clutter. In RSS, 2012.
[33] M. Dominici and R. Cortesao. Model predictive control architectures with force feedback for robotic-assisted beating heart surgery. In ICRA, 2014.
[34] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159, July 2011.
[35] Eigen Matrix Library. http://eigen.tuxfamily.org.
[36] S. Ekvall and D. Kragic. Learning and evaluation of the approach vector for automatic grasp generation and planning. In ICRA, 2007.
[37] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard. 3d mapping with an RGB-D camera. International Journal of Robotics Research (IJRR), 2013.
[38] T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov. An integrated system for real-time model-predictive control of humanoid robots. In ICHR, 2013.
[39] C. Ferrari and J. Canny. Planning optimal grasps. ICRA, 1992.
132

[40] A. Foka and P. Trahanias. Real-time hierarchical pomdps for autonomous robot navigation. Robot. Auton. Syst., 55(7):561–571, Jul 2007.
[41] C. R. Gallegos, J. Porta, and L. Ros. Global optimization of robotic grasps. In RSS, 2011.
[42] M. Gemici and A. Saxena. Learning haptic representation for manipulating deformable food objects. In IROS, 2014.
[43] S. Ghosh and J. Mulligan. A segmentation guided label propagation scheme for autonomous navigation. In ICRA, 2010.
[44] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
[45] C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen. The Columbia grasp database. In ICRA, 2009.
[46] I. Goodfellow, Q. Le, A. Saxe, H. Lee, and A. Y. Ng. Measuring invariances in deep networks. In NIPS, 2009.
[47] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
[48] H. K. Heidarsson and G. S. Sukhatme. Obstacle detection from overhead imagery using self-supervised learning for autonomous surface vehicles. In IROS, 2011.
[49] S. Heshmati-alamdari, G. K. Karavas, A. Eqtami, M. Drossakis, and K. Kyriakopoulos. Robustness analysis of model predictive control for constrained image-based visual servoing. In ICRA, 2014.
133

[50] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
[51] T. Howard, C. Green, and A. Kelly. Receding horizon model-predictive control for mobile robot navigation of intricate paths. In International Conference on Field and Service Robotics, July 2009.
[52] S. Hrabar. 3d path planning and stereo-based obstacle avoidance for rotorcraft uavs. In IROS, 2008.
[53] K. Hsiao, L. P. Kaelbling, and T. Lozano-P´erez. Grasping POMDPs. In IEEE International Conference on Robotics and Automation, 2007. URL http://lis.csail.mit.edu/pubs/tlp/pomdp-grasp-final.pdf.
[54] K. Huebner and D. Kragic. Selection of Robot Pre-Grasps using Box-Based Shape Approximation. In IROS, 2008.
[55] A. Hyva¨rinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural computation, 13(7):1527–1558, 2001.
[56] A. Hyva¨rinen, J. Karhunen, and E. Oja. Principal Component Analysis and Whitening, chapter 6, pages 125–144. John Wiley & Sons, Inc., 2002.
[57] S. B. i Badia, P. Pyk, and P. F. M. J. Verschure. A ﬂy-locust based neuronal control system applied to an unmanned aerial vehicle: the invertebrate neuronal principles for course stabilization, altitude control and collision avoidance. IJRR, 26(7):759–772, 2007.
[58] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In Cornell Tech Report, 2015.
134

[59] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multitask learning. In NIPS, 2010.
[60] N. R. Jared Glover, Daniela Rus. Probabilistic models of object geometry for grasp planning. In RSS, 2008.
[61] Y. Jiang, S. Moseson, and A. Saxena. Eﬃcient grasping from RGBD images: Learning using a new rectangle representation. In ICRA, 2011.
[62] Y. Jiang, J. R. Amend, H. Lipson, and A. Saxena. Learning hardware agnostic grasps for a universal jamming gripper. In ICRA, 2012.
[63] Y. Jiang, M. Lim, C. Zheng, and A. Saxena. Learning to place new objects in a scene. IJRR, 31(9), 2012.
[64] T. Jochem, D. Pomerleau, and C. Thorpe . Vision-based neural network road and intersection detection and traversal. In IROS, 1995.
[65] I. Kamon, T. Flash, and S. Edelman. Learning to grasp using visual information. In ICRA, 1996.
[66] S. Khansari-Zadeh and A. Billard. Learning stable nonlinear dynamical systems with gaussian mixture models. Robotics, IEEE Transactions on, 27(5): 943–957, Oct 2011.
[67] D. Kim, J. Sun, S. Min, O. James, M. Rehg, and A. F. Bobick. Traversability classiﬁcation using unsupervised on-line visual learning for outdoor robot navigation. In ICRA, 2006.
[68] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gaussian process model based predictive control. In American Control Conference, 2004.
135

[69] A. Konno, R. Uchikura, T. Ishihara, T. Tsujita, T. Sugimura, J. Deguchi, M. Koyanagi, and M. Uchiyama. Development of a high speed vision system for mobile robots. In IROS, 2006.
[70] M. Koval, N. Pollard , and S. Srinivasa. Pre- and post-contact policy decomposition for planar contact manipulation under uncertainty. IJRR, August 2015.
[71] D. Kragic and H. I. Christensen. Robust visual servoing. IJRR, 2003. [72] S. Kumar and M. Hebert. Discriminative random ﬁelds: A discriminative
framework for contextual interaction in classiﬁcation. In ICCV, 2003. [73] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view
rgb-d object dataset. In ICRA, 2011. [74] K. Lakshminarayana. Mechanics of form closure. ASME, 1978. [75] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and
A. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012. [76] Q. V. Le, D. Kamm, A. F. Kara, and A. Y. Ng. Learning to grasp objects with multiple contact points. In ICRA, 2010. [77] Q. V. Le, A. Coates, B. Prochnow, and A. Y. Ng. On optimization methods for deep learning. In ICML, 2011. [78] Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004.
136

[79] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.
[80] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In NIPS, 2009.
[81] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In NIPS, 2015.
[82] S. Levine, N. Wagener, and P. Abbeel. Learning contact-rich manipulation skills with guided policy search. ICRA, 2015.
[83] C. Li, A. Saxena, and T. Chen. θ-mrf: Capturing spatial and semantic structure in the parameters for scene understanding. In NIPS, 2011.
[84] J. Luo and K. Hauser. Robust trajectory optimization under frictional contact with iterative learning. In RSS, 2015.
[85] J. Maitin-shepard, M. Cusumano-towner, J. Lei, and P. Abbeel. Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In ICRA, 2010.
[86] J. Maitin-Shepard, J. L. M. Cusumano-Towner, and P. Abbeel. Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In ICRA, 2010.
[87] J. Malik, S. Belongie, J. Shi, and T. Leung. Textons, contours and regions: Cue integration in image segmentation. In ICCV, 1999.
137

[88] C. McFarland and L. Whitcomb. Experimental evaluation of adaptive modelbased control for underwater vehicles in the presence of unmodeled actuator dynamics. In ICRA, 2014.
[89] T. McGee, R. Sengupta, and K. Hedrick. Obstacle detection for small autonomous aircraft using sky segmentation. In ICRA, 2005.
[90] D. Mellinger, N. Michael, and V. Kumar. Trajectory generation and control for precise aggressive maneuvers with quadrotors. In ISER, Dec 2010.
[91] R. Memisevic and G. E. Hinton. Learning to represent spatial transformations with factored higher-order boltzmann machines. Neural Computation, 22(6):1473–1492, June 2010.
[92] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. Modha. A digital neurosynaptic core using embedded crossbar memory with 45pj per spike in 45nm. In CICC, 2011.
[93] T. Merz and F. Kendoul. Beyond visual range obstacle avoidance and infrastructure inspection by an autonomous helicopter. In IROS, 2011.
[94] J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In ICML, 2005.
[95] A.-R. Mohamed, G. Dahl, and G. E. Hinton. Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech, and Language Processing, 20 (1):14–22, 2012.
[96] F. C. Moon and T. Kalm´ar-Nagy. Nonlinear models for complex dynamics in cutting materials. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 359(1781):695– 711, 2001.
138

[97] R. Moore, S. Thurrowgood, D. Bland, D. Soccol, and M. Srinivasan. A stereo vision system for uav guidance. In IROS, 2009.
[98] A. Morales, P. J. Sanz, and A` ngel P. del Pobil. Vision-based computation of three-ﬁnger grasps on unknown planar objects. In IROS, 2002.
[99] R. Mudra, R. Hahnloser, and R. J. Douglas. Neuromorphic active vision used in simple navigation behavior for a robot. In Proc. 7th Int. Conf. On Microelectronics for Neural Networks, pages 7–9, 1999.
[100] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In UAI, 1999.
[101] I. Na, S. H. Han, and H. Jeong. Stereo-based road obstacle detection and tracking. In ICACT, 2011.
[102] B. Nabbe and M. Hebert. Extending the path-planning horizon. volume 26, pages 997–1024, 2007.
[103] K. S. Narendra and A. M. Annaswamy. Stable Adaptive Systems. PrenticeHall, Inc., Upper Saddle River, NJ, USA, 1989. ISBN 0-13-839994-8.
[104] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.
[105] V. Nguyen. Constructing stable force-closure grasps. In ACM Fall joint computer conf, 1986.
[106] D. Nguyen-Tuong and J. Peters. Model learning for robot control: a survey. Cognitive Processing, 12(4), 2011.
[107] M. Osadchy, Y. LeCun, and M. Miller. Synergistic face detection and pose estimation with energy-based models. JMLR, 8:1197–1215, 2007.
139

[108] C. Papazov, S. Haddadin, S. Parusel, K. Krieger, and D. Burschka. Rigid 3d geometry matching for grasping of known objects in cluttered scenes. IJRR, 31(4):538–553, Apr. 2012.
[109] J. Pearl. Fusion, propagation, and structuring in belief networks. Artif. Intell., 29(3):241–288, 1986.
[110] J. H. Piater. Learning visual features to predict hand orientations. In ICML, 2002.
[111] C. Plagemann, F. Endres, J. M. Hess, C. Stachniss, and W. Burgard. Monocular range sensing: A non-parametric learning approach. In ICRA, 2008.
[112] F. T. Pokorny, K. Hang, and D. Kragic. Grasp moduli spaces. In RSS, 2013. [113] J. Ponce, D. Stam, and B. Faverjon. On computing two-ﬁnger force-closure
grasps of curved 2D objects. IJRR, 12(3):263, 1993. [114] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic
Programming. Wiley-Interscience, 1994. [115] C. E. Rasmussen and H. Nickisch. GPML Matlab Code version 3.5. http:
//www.gaussianprocess.org/gpml/code/matlab/doc/. [116] M. Riedmiller, T. Gabel, R. Hafner, and S. Lange. Reinforcement learning
for robot soccer. Autonomous Robots, 27(1):55–73, 2009. [117] A. Rodriguez, M. Mason, and S. Ferry. From caging to grasping. In RSS,
2011. [118] ROS: Robot Operating System. http://www.ros.org.
140

[119] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pages 318–362. MIT Press, 1986.
[120] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (FPFH) for 3D registration. In ICRA, 2009.
[121] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu. Fast 3D recognition and pose using the viewpoint feature histogram. In IROS, 2010.
[122] A. Sahbani, S. El-Khoury, and P. Bidaud. An overview of 3d object grasp synthesis algorithms. Robot. Auton. Syst., 60(3):326–336, Mar. 2012.
[123] A. Saxena, S. Chung, and A. Ng. Learning depth from single monocular images. In NIPS, 2005.
[124] A. Saxena, J. Driemeyer, J. Kearns, and A. Ng. Robotic grasping of novel objects. In NIPS, 2006.
[125] A. Saxena, S. Chung, and A. Ng. 3-d depth reconstruction from a single still image. IJCV, 76(1):53–69, 2008.
[126] A. Saxena, J. Driemeyer, and A. Y. Ng. Robotic grasping of novel objects using vision. IJRR, 27(2):157–173, 2008.
[127] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Depth perception from a single still image. In AAAI, 2008.
[128] A. Saxena, L. L. S. Wong, and A. Y. Ng. Learning grasp strategies with partial shape information. In AAAI, 2008.
141

[129] A. Saxena, M. Sun, and A. Ng. Make3d: Learning 3d scene structure from a single still image. PAMI, 2009.
[130] A. Saxena, A. Jain, O. Sener, A. Jami, D. K. Misra, and H. S. Koppula. Robo brain: Large-scale knowledge engine for robots. Tech Report, Aug 2014.
[131] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR. 2013.
[132] D. Shim, H. Kim, and S. Sastry. Decentralized reﬂective model predictive control of multiple ﬂying robots in dynamic environment. In Conference on Decision and Control, 2003.
[133] K. B. Shimoga. Robot grasp synthesis algorithms: A survey. IJRR, 15(3): 230–266, June 1996.
[134] R. Socher, B. Huval, B. Bhat, C. D. Manning, and A. Y. Ng. Convolutionalrecursive deep learning for 3D object classiﬁcation. In NIPS, 2012.
[135] B. Sofman, E. L. Ratliﬀ, J. A. D. Bagnell, J. Cole, N. Vandapel, and A. T. Stentz. Improving robot navigation through self-supervised online learning. Journal of Field Robotics, 23(12), 2006.
[136] K. Sohn, D. Y. Jung, H. Lee, and A. Hero III. Eﬃcient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011.
[137] S. P. Soundararaj, A. K. Sujeeth, and A. Saxena. Autonomous indoor helicopter ﬂight using a single onboard camera. In IROS, 2009.
[138] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In NIPS, 2012.
142

[139] J. Sung, S. H. Jin, and A. Saxena. Robobarista: Object part-based transfer of manipulation trajectories from crowd-sourcing in 3d pointclouds. In Cornell Tech Report, 2015.
[140] J. Sung, S. H. Jin, and A. Saxena. Robobarista: Object part based transfer of manipulation trajectories from crowd-sourcing in 3d pointclouds. In International Symposium on Robotics Research (ISRR), 2015.
[141] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
[142] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS. 2013.
[143] G. W. Taylor and G. E. Hinton. Factored conditional restricted boltzmann machines for modeling motion style. In ICML, 2009.
[144] C. Teuliere and E. Marchand. Direct 3d servoing using dense depth maps. In IROS, 2012.
[145] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
[146] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
[147] Z. Wang, A. Boularias, K. Muelling, B. Schoelkopf, and J. Peters. Anticipatory action selection for human-robot table tennis. accepted.
[148] J. Weisz and P. K. Allen. Pose error robust grasping from contact wrench space metrics. In ICRA, 2012.
[149] T. Whelan, H. Johannsson, M. Kaess, J. Leonard, and J. McDonald. Robust real-time visual odometry for dense RGB-D mapping. In ICRA, 2013.
143

[150] K. M. Wurm, R. Ku¨mmerle, C. Stachniss, and W. Burgard. Improving robot navigation in structured outdoor environments by identifying vegetation from laser data. In IROS, 2009.
[151] L. Zhang, M. Ciocarlie, and K. Hsiao. Grasp evaluation with graspable feature matching. In RSS Workshop on Mobile Manipulation, 2011.
[152] S. Zingg, D. Scaramuzza, S. Weiss, and R. Siegwart. Mav navigation through indoor corridors using optical ﬂow. In ICRA, 2010.
144