GRAPH CUTS, SUM-OF-SUBMODULAR FLOW, AND LINEAR PROGRAMMING: EFFECTIVE INFERENCE IN HIGHER-ORDER MARKOV RANDOM FIELDS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Alexander Fix May 2017 c 2017 Alexander Fix ALL RIGHTS RESERVED GRAPH CUTS, SUM-OF-SUBMODULAR FLOW, AND LINEAR PROGRAMMING: EFFECTIVE INFERENCE IN HIGHER-ORDER MARKOV RANDOM FIELDS Alexander Fix, Ph.D. Cornell University 2017 Optimization algorithms have a long history of success in computer vision, providing effective algorithms for tasks as varied as segmentation, stereo estimation, image denoising and scene understanding. A notable example of this is Graph Cuts, in which the minimum-cut problem is used to solve a class of vision problems known as first-order Markov Random Fields. Despite this success, first-order MRFs have their limitations. They cannot encode correlations between groups of pixels larger than two or easily express higher-order statistics of images. In this thesis, we generalize graph cuts to higher-order MRFs, while still maintaining the properties that make graph cuts successful. In particular, we will examine three different mathematical techniques which have combined to make previously intractable higher-order inference problems become practical within the last few years. First, order-reducing reductions, which transform higher-order problems into familiar first-order MRFs. Second, a generalization of the min-cut problem to hypergraphs, called Sum-ofSubmodular optimization. And finally linear programming relaxations based on the Local Marginal Polytope, which together with Sum-of-Submodular flow results in the highly effective primal-dual algorithm SoSPD. This thesis presents all mathematical background for these algorithms, as well as an implementation and experimental comparison with state-of-the-art. BIOGRAPHICAL SKETCH Alexander Fix graduated with a Bachelor of Science in Computer Science and Mathematics from the University of Chicago in 2009. He is currently a PhD student at Cornell University, focusing on optimization algorithms with applications in Computer Vision. His advisor is Ramin Zabih. In the summer of 2013, he was a research intern at Google, advised by Sameer Agarwal. From 2013 to 2014, he completed his PhD research at Cornell Tech, in NYC. Since February 2015, he has been a researcher at Oculus Research in Redmond, WA. iii ACKNOWLEDGEMENTS First, I am most grateful to my PhD advisor, Ramin Zabih, who has supported me without fail for the last six years. There are too many things to list, but here’s a start: Thank you for giving me the perfect environment to grow as a researcher — for independence when I needed it, and guidance when independence failed. Thank you for your example of how to be a member of a research community, and all your many introductions — here’s to many more workshops in Italy. Thank you for bearing with me on the actual writing of this thesis — it’s been a trial, but I think it’s turned out in the end. I would also like to thank my committee members David Williamson and David Shmoys, for teaching me everything I know about approximation algorithms and linear programming, and for all your questions along the way — this thesis wouldn’t be half so interesting without them. Endre Boros, without whom I would not have started on my first project, and for being a continual fount of promising ideas and research ideas ever since. Sameer Agarwal, and the rest of Steve Seitz’s group at Google, for a truly wonderful internship — thank you for introducing me to the wonderful world of research in industry. And finally, Loranne, for putting up with all the years, and all the travel. I couldn’t have done it without you. This research in this thesis has been funded by NSF grants IIS-0803705, IIS1161860/1161476, and IIS-1161282. iv TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 Introduction 1 1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Optimization Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Constrained Optimization . . . . . . . . . . . . . . . . . . . 6 1.2.2 Constraint Indicator Functions . . . . . . . . . . . . . . . . 7 1.2.3 Minimizing Elements . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Relaxations . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.5 Equivalence of Optimization Problems . . . . . . . . . . . 11 1.2.6 Common Equivalences Between Problems . . . . . . . . . 13 1.3 Example: Image Segmentation . . . . . . . . . . . . . . . . . . . . 14 1.3.1 Binary Labeling Problems . . . . . . . . . . . . . . . . . . . 15 1.3.2 Per-Pixel Cost Functions . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Spatial Relations Between Pixels . . . . . . . . . . . . . . . 17 1.3.4 The Potts Model . . . . . . . . . . . . . . . . . . . . . . . . 18 1.3.5 Reduction to Graph Cut . . . . . . . . . . . . . . . . . . . . 19 1.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 23 1.4.1 Labeling Problems . . . . . . . . . . . . . . . . . . . . . . . 23 1.4.2 Maximum A-Posteriori (MAP) Inference . . . . . . . . . . 24 1.4.3 Log-probabilities . . . . . . . . . . . . . . . . . . . . . . . . 26 1.4.4 MAP inference in Foreground-Background Segmentation 27 1.4.5 Conditional Dependence . . . . . . . . . . . . . . . . . . . 29 1.4.6 The Hammersley-Clifford Theorem . . . . . . . . . . . . . 31 1.4.7 The Potts Model as an MRF . . . . . . . . . . . . . . . . . . 33 1.5 First-order and Higher-order MRFs . . . . . . . . . . . . . . . . . . 34 1.5.1 Advantages of Higher-Order Models . . . . . . . . . . . . 35 1.5.2 Image Denoising and Patch-Based Priors . . . . . . . . . . 36 1.5.3 Curvature Regularizing Priors for Stereo . . . . . . . . . . 40 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2 Mathematical Background 2.1 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Pseudoboolean functions . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Representations of MRFs . . . . . . . . . . . . . . . . . . . 2.2.2 Set Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Multilinear Polynomials . . . . . . . . . . . . . . . . . . . . 2.2.4 Properties of Multilinear Polynomials . . . . . . . . . . . . 45 46 48 49 50 51 51 v 2.2.5 Computational Complexity and Hardness of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Submodular Functions . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Decreasing Marginal Gains . . . . . . . . . . . . . . . . . . 2.3.2 Equivalent Definitions of Submodularity . . . . . . . . . . 2.3.3 Properties of Submodular Functions . . . . . . . . . . . . . 2.3.4 Submodular First-order Pseudoboolean Functions . . . . . 2.4 Local and Marginal Polytopes for MRFs . . . . . . . . . . . . . . . 2.4.1 Weighted Averages as Linear Programs . . . . . . . . . . . 2.4.2 Marginal polytopes . . . . . . . . . . . . . . . . . . . . . . . 2.5 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Linear Cone Programming . . . . . . . . . . . . . . . . . . 2.6 Convex Sets and Convex Functions . . . . . . . . . . . . . . . . . . 2.7 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Exchanging Minimization and Maximization . . . . . . . . 2.7.2 Linear Programming Duality: An Example . . . . . . . . . 2.7.3 Conic Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Optimality for Linear Programs . . . . . . . . . . . . . . . . . . . . 2.9 Duality for the Local Marginal Polytope . . . . . . . . . . . . . . . 2.10 First-order Binary MRFs and Minimum Cut . . . . . . . . . . . . . 2.10.1 Solving First-order Submodular MRFs with Graph Cuts . 2.10.2 Linear Programs for Min-Cut . . . . . . . . . . . . . . . . . 2.10.3 Local Marginal Polytope for First-order Binary Problems . 53 58 58 60 62 65 67 68 69 72 75 76 78 79 80 86 90 92 94 95 97 98 3 Related Work 101 3.1 Higher-Order Models in Computer Vision . . . . . . . . . . . . . . 102 3.2 Inference Algorithms for Binary MRFs . . . . . . . . . . . . . . . . 104 3.2.1 First-Order Submodular MRFs . . . . . . . . . . . . . . . . 104 3.2.2 First-Order Nonsubmodular MRFs . . . . . . . . . . . . . . 105 3.2.3 Higher-Order Reductions . . . . . . . . . . . . . . . . . . . 107 3.2.4 Higher-order Submodular Functions . . . . . . . . . . . . . 109 3.3 Primal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.4 Dual Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.5 Primal-Dual Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 116 4 Higher order reductions 118 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2.1 Reduction by substitution . . . . . . . . . . . . . . . . . . . 121 4.2.2 Reducing negative-coefficient terms . . . . . . . . . . . . . 122 4.2.3 Reducing positive-coefficient terms . . . . . . . . . . . . . 122 4.2.4 Generalized Roof Duality . . . . . . . . . . . . . . . . . . . 123 4.3 Reducing groups of higher-order terms . . . . . . . . . . . . . . . 125 4.3.1 Our method . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vi 4.4 Worst case performance . . . . . . . . . . . . . . . . . . . . . . . . 129 4.5 Local completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.5.1 Performance on locally complete problems . . . . . . . . . 131 4.6 Locally complete energy functions in vision . . . . . . . . . . . . . 133 4.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5 Sum of Submodular Minimization 140 5.1 Sum of Submodular Minimization via Submodular Flow . . . . . 141 5.1.1 Definitions and Graph Construction . . . . . . . . . . . . . 142 5.1.2 Flow as a Reparameterization . . . . . . . . . . . . . . . . . 143 5.1.3 The Max-Flow Min-Cut Theorem for SoS Functions . . . . 145 5.2 IBFS for Submodular Flow . . . . . . . . . . . . . . . . . . . . . . . 148 5.2.1 IBFS on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.2.2 Modifying IBFS for SoS Flow . . . . . . . . . . . . . . . . . 150 5.2.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.3 Proof of the “No Shortcuts” Lemma . . . . . . . . . . . . . . . . . 152 5.4 The Current Arc Heuristic . . . . . . . . . . . . . . . . . . . . . . . 155 6 Submodular Upper Bounds for Higher Order Energy Functions 159 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 160 6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3 Submodular Upper Bounds . . . . . . . . . . . . . . . . . . . . . . 163 6.4 Upper Bound Approximations . . . . . . . . . . . . . . . . . . . . 164 6.4.1 The Iterative Heuristic of SoSPD . . . . . . . . . . . . . . . 165 6.4.2 Quadratic-Based Submodular Upper Bounds . . . . . . . . 166 6.4.3 Cardinality-Based Submodular Upper Bounds . . . . . . . 169 7 A Primal-Dual Algorithm for Higher-Order Multilabel Markov Ran- dom Fields 175 7.1 Higher-order Multi-label MRFs . . . . . . . . . . . . . . . . . . . . 175 7.1.1 Summary of Our Method . . . . . . . . . . . . . . . . . . . 176 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.2.1 Graph Cut Methods and Higher-Order MRFs . . . . . . . 178 7.2.2 Linear Programming and Duality for MRFs . . . . . . . . . 178 7.2.3 Sum-of-Submodular Flow . . . . . . . . . . . . . . . . . . . 179 7.3 The SoS Primal Dual Algorithm . . . . . . . . . . . . . . . . . . . . 181 7.3.1 Update-Duals-Primals . . . . . . . . . . . . . . . . . . . . . 184 7.3.2 Pre-Edit-Duals . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.3.3 Post-Edit-Duals . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.3.4 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . 188 7.3.5 Approximation Bounds . . . . . . . . . . . . . . . . . . . . 189 vii 8 Experimental Evaluation of the SoSPD Algorithm 192 8.1 Benchmarks and Datasets . . . . . . . . . . . . . . . . . . . . . . . 192 8.1.1 Field of Experts Denoising . . . . . . . . . . . . . . . . . . . 193 8.1.2 Curvature Regularizing Stereo Reconstruction . . . . . . . 194 8.2 Comparison of Upper Bound Methods . . . . . . . . . . . . . . . . 195 8.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 195 8.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.3 Evaluation of SoSPD . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.3.1 Stereo reconstruction . . . . . . . . . . . . . . . . . . . . . . 203 8.3.2 Field of Experts denoising . . . . . . . . . . . . . . . . . . . 206 9 Structured learning of sum-of-submodular higher order energy func- tions 207 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 9.3 S3SVM: SoS Structured SVMs . . . . . . . . . . . . . . . . . . . . . 210 9.3.1 Structured SVMs . . . . . . . . . . . . . . . . . . . . . . . . 211 9.3.2 Submodular Feature Encoding . . . . . . . . . . . . . . . . 212 9.3.3 Solving the quadratic program . . . . . . . . . . . . . . . . 214 9.3.4 Generalization to multi-label prediction . . . . . . . . . . . 216 9.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.4.1 Binary denoising . . . . . . . . . . . . . . . . . . . . . . . . 219 9.4.2 Interactive segmentation . . . . . . . . . . . . . . . . . . . . 220 10 Conclusion 224 A Local Completeness 227 B Laplacian Equations 231 C Approximation Ratio for Cardinality Upper Bounds 233 Bibliography 235 viii CHAPTER 1 INTRODUCTION Optimization algorithms have a long history of success in computer vision, providing the basis for many effective tasks as varied as segmentation, stereo estimation, image denoising and scene understanding. A particularly notable example of this is the method of Graph Cuts [11], in which minimum-cut algorithms are used to solve a class of vision problems known as first-order Markov Random Fields (MRFs). There are two main reasons for Graph Cuts’ success. First, min-cut is already a well-studied problem with highly efficient algorithms (and the popularity of Graph Cuts has encouraged the development of even more efficient algorithms tuned specifically to computer vision problems). Second, the class of problems solved by Graph Cuts (first-order MRFs) encapsulates the fundamental idea of image locality, i.e., that pixels in an image are highly correlated with their neighbors. This property makes MRFs well-suited to solving a wide range of inference problems in computer vision as well as machine learning and other fields. Despite this success, first-order MRFs have their limitations. They cannot easily encode correlations between groups of pixels larger than two, and thus are unable to express higher-order statistics of images. In this thesis, we focus on removing this limitation. Our goal is to generalize graph cuts to a wider class of higher-order MRFs, greatly extending the class of models for which MRF inference can be applied, while keeping the fast algorithms that make graph cuts successful. In a broader sense, this thesis is about the interaction between modeling and inference: by applying new advances in algorithms, we can now optimize a new class of models which were previously intractable, allowing 1 much greater flexibility and power in the kinds of problems we can solve. Our goal for the first two chapters is to cover the mathematical background for Markov Random Fields, and to introduce the main optimization problem considered in the later chapters, which is a minimization problem of the form: min fi(xi) + fC(xC) x iC where the vector of variables x comes from a discrete label space x ∈ (1.1) i Xi, and the function f to be optimized is a sum of unary functions fi (each depending on a single variable xi) and so-called clique functions fC each of which depends on a subset of the variables xC from a clique C, which are overlapping subsets of the variables. In particular, the main results of this thesis rely on the concepts of Linear Programming, duality, and linear programming relaxations. This introduction and the following chapter discuss the necessary background for these topics. We begin with the basic concepts of optimization, which may be unnecessary for readers familiar with the subject. However, we wish to put the MRF inference problem in the context of probabilistic inference, which informs the types of models which are useful in computer vision. In this chapter, we will give a brief introduction to the use of optimization algorithms in computer vision, along with an extended example of how first order MRFs and graph cuts are applied to a simple but typical vision task of binary segmentation. We will also explain how first order MRFs can be generalized to include interactions between more than just pairs of pixels — such MRFs are called higher-order. We will conclude with several applications where allowing these higher-order interactions is necessary for using more sophisticated models which cannot be expressed by simpler first-order MRFs. 2 1.1 Notation All special notation will be introduced at the point of first use in the text, but is also repeated here for easy reference. We will write vectors as bold lowercase symbols: x. For any sets X and S , XS is the set of all vectors with components in X indexed by the elements of S . This allows, for instance, vectors not indexed by just the set 1, . . . , n. We’ll use Xn as shorthand for X{1,...,n}. For i ∈ S and x ∈ XS , xi is the i-th component of x. For any subset T ⊆ S and vector x ∈ XS , we’ll write xT for the subvector of S corresponding to just the components in T . We will always use V for the set of variable indices, so that xi are indexed by i ∈ V. When summing over variables, we will write i as shorthand for i∈V, as in (1.1). Similarly, we’ll use C to denote the set of cliques, and will use C as shorthand for C∈C. When the cliques are all pairs, |C| = 2, then we have a graph with unordered edges {i, j}. We’ll write pairwise clique functions as fi, j which are similarly unordered, i.e., fi, j(xi, x j) = f j,i(x j, xi). Sums over pairs are also unordered, so i, j means i< j, including each unordered pair only once. In a graph, we will use N(i) to denote the set of neighbors of i, N(i) = { j | {i, j} ∈ E}. The minimum value of a minimization problem (or maximum value of a maximization problem) is denoted OPT, so the minimum value of (1.1) is OPT(1.1). Minimizers are denoted by x∗, and X∗ is the set of all minimizers. 3 For a finite set X, the set of probability distributions on X is P[X], i.e., the set of all p : X → R with p(x) ≥ 0 for all x ∈ X, and x∈X p(x) = 1. For any function f : X → R, the expectation of f under the probability distribution p is f, p , which is given by f, p := x∈X f (x)p(x). Note that this is the inner product of f and p when treated as vectors in RX, so we will use ·, · for inner products in general. The normal distribution with mean µ and standard deviation σ is N(µ, σ). We write x ∼ N(µ, σ) to denote a random variable drawn from this distribution. We use ∝ to denote proportionality, so the probability distribution function of −(x−µ)2 N(µ, σ) is p(x) ∝ e 2σ2 . The Iverson bracket P(x) is 1 or 0 depending on whether the condition P(x) is true or false. So f (x) = x is even has f (3) = 0 and f (6) = 1. 1.2 Optimization Basics Optimization, at its core, is a search problem — we have an exponentially large (or possibly infinite) set of choices, among which we want to find the “best” one. To be precise, the most general formulation of optimization is that we have a solution space X (also called a state space or feasible set) and some objective (or cost function) function f : X → R, saying how good a given solution is. Our goal is to find an x ∈ X which minimizes the objective value f (x). Our standard notation for an optimization problem is: min f (x) x s.t. x ∈ X (1.2) 4 For small problems which fit on one line, we will also write min{ f (x) | x ∈ X}. x (1.3) Most of the optimization problems we’ll consider are minimization problems, where the goal is to find an x with f (x) as small as possible. We will write the optimum value (either minimum or maximum) of an optimization problem as OPT(1.2). Maximization problems will arise later, particularly when we come to the topic of duality. Note that we can convert back and forth between maximization and minimization problems by using the identity max f (x) = − min − f (x). x∈X x∈X (1.4) By itself, having an objective function isn’t much use — if we know nothing at all about the function f (i.e., we have a black-box which given an x ∈ X, evaluates and returns f (x)) then this general optimization problem is as hard as a totally unguided search problem — the best possible algorithm is to evaluate every single f (x) and return the best one. Since the set X is exponentially (or infinitely) large for most interesting problems, this tells us there cannot be an efficient optimization algorithm that doesn’t “look inside” the function f . Consequently, the study of optimization algorithms always involves taking advantage of problem structure, whether that structure comes from a particular form for the objective f (as in the clique structure of the MRF objective (1.1)), or from structure in the feasible set X. 5 Figure 1.1: The constraints of the optimization problem 1.5 are graphed above. The feasible region is shaded grey. Note that any value which satisfies 2x1 + x2 ≥ 2 and x2 ≥ 0 also satisfies the inequality x1 + x2 ≥ 1. 1.2.1 Constrained Optimization Very commonly, problem structure comes from the state space X being defined by some constraints. That is, the set X is specified by a set of equations or in- equalities that the elements x ∈ X must satisfy. For example, in the simple optimization problem min x1,x2 3x1 + 4x2 s.t. 2x1 + x2 ≥ 2 x1 + x2 ≥ 1 (1.5) x1, x2 ≥ 0 we have 4 constraints, namely that x1, x2 satisfy each of the four inequalities: 2x1 + x2 ≥ 2, x1 + x2 ≥ 1, x1 ≥ 0 and x2 ≥ 0. Solutions which violate any of these inequalities are called infeasible. For example (x1, x2) = ( 1 2 , 0) is infeasible 6 (meaning ( 1 2 , 0) X), because in the first inequality we would have 2 · 1 2 + 0 = 1 < 2. In general, in a constrained optimization problem we are given some functions g j : X → R indexed by j ∈ J, and our optimization problem is min f (x) x s.t. g j(x) ≥ 0 ∀ j ∈ J x∈X (1.6) We call each inequality g j(x) ≥ 0 a constraint, and the feasible set X is defined to be the subset of X for which each constraint is satisfied, i.e., X = {x ∈ X | g j(x) ≥ 0, ∀ j ∈ J}. The set X is called the ambient space, and is typically Rn for some n. Also note that we may have constraints of the form g j(x) ≤ 0 or g j(x) = 0. These can each be converted into the standard form of (1.6) by noting that g j(x) ≤ 0 is equivalent to −g j(x) ≥ 0, and g j(x) = 0 can be replaced by the two inequalities g j(x) ≥ 0 and −g j(x) ≥ 0. Any optimization problem may have many ways of being written in terms of constraints: some constraints may be redundant, for example in (1.5) the constraint x1 + x2 ≥ 1 is redundant, since any x1, x2 which satisfies the first inequalities also satisfies x1 + x2 ≥ 1 (see Figure 1.1 for illustration). 1.2.2 Constraint Indicator Functions An important construction for turning constrained minimization problems into unconstrained minimization is the indicator function: for a minimization prob- 7 lem and constraint g j(x) ≥ 0, the indicator function is  Igmjin(x) := 0 ∞ g j(x) ≥ 0 . otherwise (1.7) That is, Igmjin is 0 whenever x satisfies the constraint, and is infinite whenever the constraint is violated. Using the indicator function, we can replace all our constraints with terms in the objective, to get an unconstrained minimization minx F(x) where F(x) = f (x) + Igmjin(x) j (1.8) Whenever x is feasible (i.e., satisfies all the inequalities g j(x) ≥ 0) then j Igmjin(x) = 0, so F(x) = f (x). However, if x is infeasible, it violates at least one inequality, so we will have F(x) = ∞. In other words, we have replaced ‘disallowed’ solutions which violate the constraints, by putting an infinite cost on those solutions. Therefore, minimizing the unconstrained F is the same as minimizing f with the constraints g j. Note that for maximization problems, we instead have that solutions with value −∞ are infeasible, so the indicator function for a maximization problem is  Igmjax(x) := 0 −∞ g j(x) ≥ 0 . otherwise (1.9) 1.2.3 Minimizing Elements We will reserve the notation x∗ for elements x which are optimal, i.e., for which f (x∗) = minx{ f (x) | x ∈ X}. The set of all such x is denoted by argmin, so that argmin{ f (x)} := {x | f (x) ≤ f (x ), ∀x ∈ X}. x∈X (1.10) 8 We will use the shorthand X∗ to denote the set of minimizers, when the problem we are referring to is clear from context. Note that in general the set of minimizers may be empty or have more than one element. For example, argminx∈R{ex} = ∅ and argminx∈R{(x − 1)2(x + 1)2} = {−1, 1}. Remark 1. For almost all problems in this thesis, it will be the case that the set of minimizers is non-empty. We will make special note of problems where this is not necessarily the case. Correspondingly, all proofs will make use of the existence of optimizers where useful. There are various conditions which ensure that minimizers exist. One of the most powerful is the following: Theorem 2. If X is topologically compact and f : X → R is continuous, then argminx∈X f (x) ∅. In particular, if X is a finite set, then it is always compact and any function f : X → R is continuous. Therefore, in the common case of |X| finite (also called discrete X) minimizers always exist. Another special case is when the feasible set X is a subset of Rn. A subset of Rn is compact if and only if it is closed and bounded (meaning there is some R with ||x|| < R for all x ∈ X), so in particular minimizers always exist on any closed, bounded subset of Rn. 1.2.4 Relaxations An important question is how to relate two optimization problems. The most common relation we will deal with is the notion of relaxation. The basic idea 9 behind relaxations comes from a seemingly trivial observation: if if we minimize over a larger set, the minimum value must go down. For example, it’s clear that min{3, 8, 6} ≥ min{3, 8, 6} ∪ {1, 5}, since the latter, larger set contains the minimum value 3 of the smaller set, plus possibly some other elements which may be smaller (like 1). This idea is the basic intuition for relaxations, in which we take a constrained optimization problem and we ignore, or relax, some constraints. For example, from our example (1.5) above, we can get a relaxation by removing the third constraint (that x1 ≥ 0) to get min x1,x2 3x1 + 4x2 s.t. 2x1 + x2 ≥ 2 x1 + x2 ≥ 1 (1.11) x2 ≥ 0 In this case, we know that OPT (1.5) ≥ OPT (1.11), since the latter problem is minimizing over a larger set. To cover the all the cases we’ll use later on, we’ll extend this notion to work not just with constrained optimization. We’ll also allow renaming of elements of X, by a function g from X to some other set X . Definition 3. An optimization problem ( f, X) embeds into another problem ( f , X ) if there is a function g : X → X with f (x) = f (g(x)) for all x ∈ X. The most important special case of embeddings is relaxation, where X ⊆ X and the mapping g is just the inclusion map: g(x) = x. From our example above, we should expect the minimizing value of the relaxed problem to be no larger than the original problem. First, we have a very basic fact about lower bounds: 10 Decreasing objective f(x) Ω Ω’ g x* g(x*) x’* Figure 1.2: Illustration of the proof of Lemma 5. As long as the function g preserves objective values, then whatever the minimizer of ( f , X ) is, it is at least as good as g(x∗). Proposition 4. If L is a lower bound of a set of real numbers A ⊆ R, meaning L ≤ a for all a ∈ A, then L ≤ min A. Lemma 5. If ( f, X) embeds into ( f , X ) then OPT( f, X) ≥ OPT( f , X ). Proof. See figure 1.2 for illustration. We’ll show that OPT( f , X ) is a lower bound to { f (x) | x ∈ X}. The Lemma then follows immediately from Proposition 4. Let x ∈ X, we have that g(x) ∈ X and f (g(x)) = f (x) since g is an embedding. Then, since g(x) is feasible for ( f , X ) we have f (g(x)) ≥ OPT( f , X ) and therefore f (x) ≥ OPT( f , X ) as well. 1.2.5 Equivalence of Optimization Problems Another question we might want to ask is: under what conditions can we use one problem to solve another. We can think of such problems as equivalent — a 11 solution to one gives us a solution to the other, and vice-versa. There are many ways of formalizing this notion of equivalence, but the following will be the most helpful for our purposes.1 Definition 6. Two optimization problems ( f, X) and ( f , X ) are equivalent if there is a bijection g : X → X which is order-preserving, i.e., f (x) ≤ f (y) implies f (g(x)) ≤ f (g(y)) (1.12) If g, g−1 are poly-time computable, then we’ll say ( f, X) and ( f , X ) are poly-time equivalent. Given our identity (1.4) converting maximization problems to minimization problems, that minx f (x) = − maxx − f (x), we will say that a maximization problem ( f, X) and minimization problem ( f , X ) are equivalent if there is an orderreversing bijection between them, meaning f (x) ≤ f (y) implies f (x) ≥ f (y). As we claimed, given equivalent problems we can convert solutions of one to solutions of the other. Note that in the case of a poly-time equivalence, this is a reduction in the usual NP-completeness sense. Lemma 7. If ( f, X) and ( f , X ) are equivalent (with mappings g : X → X and g−1 : X → X) the functions g and g−1 send minimizers to minimizers: g(X∗) ⊆ X ∗ and g−1(X ∗) ⊆ X∗. Proof. Let x∗ be a minimizer of ( f, X). We want to show that g(x∗) is a minimizer of ( f , X ), so let x be any other element of X . Since g is a bijection, we have that 1In particular, we define problem equivalence so that we can convert from optimal solutions of one problem to optimal solutions of another problem. When investigating probabilistic inference, this definition is most suited to taking the log-probability of a Gibbs energy to get an energy function of the form (1.1). However, it does not preserve approximation algorithms, and two equivalent problems (according to this definition) may have very different best-possible approximation ratios. 12 g−1(x ) ∈ X, and since x∗ is a minimizer of f , we must have f (x∗) ≤ f (g−1(x )). (1.13) Then, since g is order preserving, we have f (g(x∗)) ≤ f (g(g−1(x ))) = f (x ). (1.14) Therefore, f (g(x∗)) is less than any other element of X , so g(x∗) is a minimizer of ( f , X ). The reverse claim follows by symmetry. 1.2.6 Common Equivalences Between Problems There are several common transformations we will apply to problems, that all deal with manipulating the objective function to get an equivalent problem. In particular, adding a constant to the objective function or multiplying the objective by a fixed positive constant leads to an equivalent optimization problem. These are both special cases of a general rule: applying a monotonic transformation to the objective is an equivalence. Lemma 8. If g : R → R is monotonically increasing (i.e., x ≤ y implies g(x) ≤ g(y)) then the optimization problems ( f, X) and (g ◦ f, X) are equivalent. Proof. The identity function id : X → X is of course a bijection. Furthermore, whenever g is monotonic, then id is order-preserving, in the sense of Definition 6. Indeed, we need to show that f (x) ≤ f (y) implies f (id(x)) ≤ f (id(y)). (1.15) But plugging in f = g ◦ f this is just f (x) ≤ f (y) implies g( f (x)) ≤ g( f (y)) (1.16) 13 which follows from g being monotonic. Corollary 9. For any constant b and positive constant a > 0, the optimization problems ( f, X) and (a f + b, X) are equivalent. Proof. The function ax + b is monotonic for a > 0. In particular, we can always ignore constant terms in any optimization problem. For example in (1.5) the objective 7 + x1 + 2x2 can be replaced with x1 + 2x2, which is an equivalent problem according to Lemma 8. Other useful monotonic functions include log and exp which we can combine with Lemma 8 to convert products to sums and vice-versa. 1.3 Example: Image Segmentation In this section, we will illustrate all of the above concepts by way of an extended example. In computer vision, a prototypical use of optimization is to compute a foreground-background segmentation of an image — most commonly solved by reduction to minimum-cut in a graph. This example is likely familiar to readers familiar with graph cuts; however, it illustrates the key concepts of reduction, gadgets, and the tradeoff between more complex models and efficient optimization, all of which are main themes of this thesis. The foreground-background segmentation problem is this: for each pixel, we want to give a binary label indicating that this pixel is either part of the foreground object, or part of the background. With n pixels, there are 2n possible 14 segmentations, so brute-force searching for the best one is impractical. However, we haven’t yet specified f or X, which together will give us the structure we need to solve the problem. 1.3.1 Binary Labeling Problems First, we define the feasible set X. Binary segmentation is an example of a labeling problem, where we are trying to give every pixel of an image some discrete label. We formalize this by giving each pixel a corresponding variable. That is, if we let V be the set of pixel indices, then for every i ∈ V there is a corresponding variable xi. These variables take values (also called labels) from {0, 1}, where a label of xi = 0 indicates that pixel i is assigned to the background, and xi = 1 indicates the variable is in the foreground. Therefore, the feasible set (called a label space in a labeling problem) is X = {0, 1}V. In particular, because there are only two labels, this problem is called a binary labeling problem. 1.3.2 Per-Pixel Cost Functions There are many possible choices for the cost function f . We will start with the simplest one first, where we assume that for every pixel, we have some (not necessarily accurate) idea of whether it is likely to be in the foreground or the background. For example, in the image in Figure 1.3, we are trying to segment out the banana as the foreground object, with the remainder of the image as the background. If (for example) a machine learning algorithm has seen many examples of segmented bananas, it could learn that yellow and brown pixels are 15 Figure 1.3: (Left) An example input to a binary segmentation problem. The foreground object we want to segment out is the banana. (Right) The desired segmentation mask for the foreground object. likely to be foreground, while other colors are likely to be background. These rough ideas of likelihood are formalized as a cost function fi for each pixel: fi(0) gives the cost for the pixel being in the background, and fi(1) that of the pixel being in the foreground. We then have an objective function f (x) = fi(xi). i (1.17) If our costs are constructed such that low cost f (x) equates to high likelihood of x, then choosing the minimizing x is the most likely solution. The function (1.17) has a particularly simple structure to optimize: a function is called separable when it can be written as a sum of functions fi(xi), each of which is a function of a single variable xi, with no shared variables between them. That is, we can write f (x) = i fi(xi). In this case, it is clear that we can find the minimizing x by setting xi to be 0 if fi(0) ≤ fi(1), and 1 otherwise. For separable objectives, making locally good choices gives a globally optimal solution, so these are among the simplest objectives to optimize. 16 Figure 1.4: (Left) The resulting segmentation using a unary-only model, as in (1.17). (Right) The resulting segmentation after adding edge terms of the form (1.19). Note that the segmentation boundaries much more closely follow the actual boundaries of the foreground object. 1.3.3 Spatial Relations Between Pixels This simple, separable model unfortunately does not give good results — even with very informative functions fi, the segmentations obtained tend to be very noisy, as shown in Figure 1.4. The problem is that our model is missing important knowledge about the problem: it makes the (implicit) assumption that the label of a pixel is unconnected with the label of its neighbors. We can see by noting that the mimimizing xi is found independently of the others (i.e., without reference to f j for j i). However, we actually know a lot about the relations between pixels in an image. For example, the foreground labels tend to form a connected region in the image, and in general, a given pixel being in the foreground is good evidence that its neighbors are likely to be foreground pixels as well (and similarly for background pixels). The intuition we want to incorporate into improving (1.17) is that we should take advantage of the spatial relations between pixels in an image. To do so, we 17 ii Figure 1.5: (left) Typical 4-connected neighborhood on a pixel grid (right) 8-connected neighborhood. Neighbors N(i) of pixel i are bolded. make use of the fact that the pixels V are arranged in a grid, and we will say that pixels i, j are neighbors if they are nearby in this grid. Typical neighbor sets are the 4-connected grid (where i has up to 4 neighbors up, down, left and right of it) and the 8-connected grid (which includes the 4-connected neighbors, as well as the 4 immediately diagonal neighbors). See Figure 1.5 for illustration. We will let N(i) be the set of pixels j which are neighbors of i, and let E = {{i, j} | j ∈ N(i)} be the set of all pairs of neighbors, called edges. 1.3.4 The Potts Model With this neighborhood structure, a simple model which encourages neighbors to take the same labels is the Potts model [75]. This model has the unary costs fi from (1.17), as well as pairwise costs fi, j between pairs of neighbors {i, j} ∈ E. The pairwise costs are of a particularly simple form — we pay a flat cost every time the endpoints of an edge i, j have different labels. That is,  fi, j(xi, x j) =   1 0 xi x j otherwise . (1.18) Note that we’ll index fi, j using unordered pairs, meaning fi, j(xi, x j) = f j,i(x j, xi). 18 We incorporate these pairwise terms into (1.17) by letting f (x) = fi(xi) + λ fi, j(xi, x j). i i, j (1.19) Because we now pay a cost for neighbors with different labels, the minimum cost solution will tradeoff choosing likely foreground-background assignments, while also not having too many discontinuities in the labeling. We can adjust the parameter λ to adjust the relative weights in this tradoff. Higher λ encour- ages more cohesiveness in the segmentation: as λ → ∞ the segmentation will eventually become a single all-foreground or all-background region; as λ → 0 the objective becomes the same as the unary-only model of (1.17), and we get the noisy segmentations seen in Figure 1.4. The best segmentations will be obtained for λ somewhere in the middle, and the best λ must be tuned as a hyperparam- eter of the model (typically by evaluating many such λ on a validation set, and choosing the value with the best segmentation results on this set of held-out images). 1.3.5 Reduction to Graph Cut However, unlike (1.17), it is not immediately obvious how to find a minimum energy solution x to (1.19). We cannot simply set each variable xi to the one with smallest fi(xi) as before, as this might cause a large penalty from pairwise terms fi, j(xi, x j). To solve this minimization problem we will use the standard optimization technique of reduction, whereby we transform (or reduce) some seemingly difficult problem into another problem which we already know how to solve. Minimizing (1.19) can be reduced to the well-known optimization problem known 19 as minimum cut. The min-cut problem is the following: we are given a graph with nodes N and arcs 2 A ⊆ N × N, along with capacities ci, j on each arc (i, j) ∈ A. There are two special nodes s, t ∈ V called respectively the source and sink. A cut a set S ⊆ N that contains s but not t (i.e., s ∈ S and t S ). The cost of a cut is the sum of all capacities of arcs with their first endpoint in S , and the other outside S , so c(S ) = i∈S j S ci, j. Our goal is to find a cut S minimizing c(S ). There exist many efficient algorithms for solving these problems, including Augmenting Paths [23], Push Relabel [14], as well as the current state-of-theart for inputs typical of vision problems: IBFS [30]. If we can transform our actual problem (1.19) into an input for the min-cut problem, then we can apply any of these algorithms to find the solution. Fortunately, the min-cut problem has many similarities to the binary segmentation problem. If we think of membership-or-not in S as a binary label, then the cost function c(S ) looks very much like the Potts terms fi, j: we pay a cost ci, j whenever neighbors i and j take different labels (i.e., i is in S and j is not in S ). The only difficulty is handling the unary terms fi. To transform our problem into a min-cut input, we introduce a graph construction (sometimes called a gadget), which is a method for constructing a particular graph whose vertices and edges chosen in such a way that finding the minimum cut in this graph gives a solution to the binary segmentation problem. For Binary Segmentation, our constructed graph has a vertex for every pixel of the original problem, as well as two special nodes s and t, called a super-source and super-sink, so N = V ∪ {s, t}. 2We distinguish edges, which are undirected (i.e., are unordered pairs {i, j}), from arcs which are directed pairs (i, j). 20 For every undirected edge e = {i, j} in the neighborhood structure, we get two directed arcs (i, j) and ( j, i), both of which have capacity λ. Finally, we account for the unary terms by adding arcs (s, i) and (i, t) for every pixel i, with capacities cs,i = fi(0) and ci,t = fi(1). This is now a valid input to the min-cut problem, so the only remaining question is whether this is actually an equivalent optimization problem to our original objective (1.19). According to Definition 6, we need an order-preserving bijection between these two problems. We’ve already noted a natural way of identifying binary vectors x with cuts S , namely, taking membership in S as a binary label. To ensure that s is always in the cut S and t is not in S , we define g from binary labelings to cuts by g(x) = {s} ∪ {i ∈ V | xi = 1}. It is clear that this is a bijection, with inverse g−1(S ) being the binary vector with xi = 1 for i ∈ S and xi = 0 for i S . Finally, g is not just order preserving, but actually leaves the objective value the same. Indeed, for any x we have f (x) = fi(1) + fi(0) + λ i:xi=1 i:xi=0 i, j:xi=1,x j=0 whereas the cost of the corresponding cut S is (1.20) c(S ) = fi(1) + fi(0) + λ i∈S i S i, j:i∈S , j S (1.21) Then, remembering that xi = 1 if and only if i ∈ S , we see that these two equations are the same, so f (x) = c(S ). 21 1.3.6 Discussion It’s important to not let these details obscure the overall plan at work — because the binary segmentation problem is equivalent to the min-cut problem, Lemma 7 says we can take any optimal solution to the min-cut problem and get back an optimal solution to the segmentation problem. In fact, this solution is as simple as taking the min-cut S ∗, and then applying g−1 to get a binary map x∗ out of it. A final observation: the min-cut problem was already close enough to our segmentation problem that we chould actually construct a gadget showing them to be equivalent. This is no coincidence — only by choosing to model the binary cut problem in this particular way (with an objective function of the form (1.19)) is this connection obvious. In fact, the existence of this graph construction is somewhat fragile — as we add more detail to the model, with more elaborate cost functions, it may break. For example, if we want to modify (1.19) to allow having a different cost λi, j on each edge, then we are still fine: the constructed min-cut problem would now have a different cost ci, j = λi, j on each internal arc (i, j), and min-cut will still find a solution. However, if we want to solve more general segmentation problems, for example semantic segmentation, where the label space is a larger set of semantic categories (e.g., sky, grass, building, etc.) then it is not obvious how to repair the construction, since min-cut is a binary problem, and multi-label generalizations of min-cut (such as multi-way cut) are generally NP-hard. Another change we could consider making is to allow some pixels to have a negative cost for taking different labels (i.e., allowing λi, j < 0). In this case, our construction would have negative capacities ci, j < 0, and the minimization problem again 22 NP-complete (by reduction from max-cut). This difficulty is the heart of the tension between modeling and inference — the limits of our inference algorithms constrain the expressiveness of how we can model problems from applications. 1.4 Markov Random Fields Having seen how these ideas work in a practical example, we can now formalize all of the concepts that make foreground-background segmentation work. The following definitions are all building towards the main topic of this thesis: optimization for Markov Random Fields. 1.4.1 Labeling Problems In our segmentation example, our optimization problem had a variable xi for each pixel i, and each variable could take one of two labels, representing a choice of foreground or background for that pixel. Many problems have this general form, frequently with larger label sets than just {0, 1}. For example, in a semantic segmentation problem, the pixels take labels from a pre-defined set of semantic categories, such as Sky, Ground, Tree, etc. The label sets can be quite large in practice: in optical flow each pixel is labeled with a two-dimensional displacement, labeling every pixel with its corresponding pixel in the previous vide frame, so that the 2 pixels belong the same physical point — in this case the size of the label space is potentially the total number of pixels in the image (and even larger if subpixel displacements are desired). 23 Definition 10. A labeling problem is an optimization problem with a set of variables xi indexed by i ∈ V each taking values from a label set Xi. That is, X = i∈V Xi. We will use vector notation x = (x1, . . . , xn) to indicate a feasible state, which we will call a labeling. We will write xC for the subvector of x restricted to the indices i ∈ C, so that xC = (xi)i∈C. Similarly, define XC = i∈C Xi to be the subspace of X corresponding to C. 1.4.2 Maximum A-Posteriori (MAP) Inference In defining the unary terms for equation (1.17), we indicated that the fi were chosen to correspond to how likely a particular pixel was to be background or foreground, given the information we observed in the image (i.e., that a particular yellow pixel was likely to be part of the banana, and therefore foreground). This rough idea is formalized by the notion of probabilistic inference. For probabilistic inference, in addition to our unknown state x ∈ X, we also have a set of observations y ∈ Y. For example, in binary segmentation our observations include all the color values yi of each pixel i that form the image. Probabilistic inference further assumes that there is some joint probability distribution p ∈ P[X × Y] over all possible observations and their associated labelings. We don’t assume we have knowledge of this joint probability distribution, but it does allow us to consider the conditional probability p[x|y]: the probability of an unknown state x ∈ X, given that we observed y ∈ Y. This conditional distribution is known as the posterior probability (since it is the remaining probability after having observed a definite y). The Maximum A-Posteriori 24 (MAP) problem is to find the state x with highest posterior probability max p[x|y] x∈X (1.22) That is, having seen (i.e., conditioning on) the values of the observable y, we choose the most probable x as our answer. Note that MAP is not the only probabilistic decision method, but it is empirically quite successful, as well as amenable to an optimization approach.3 In order to compute this posterior probability, it is frequently easier to reason about another conditional distribution: p[y|x]. This is known as the likelihood function, and gives the probability of seeing an observation given that the true state was x. Likelihood functions are useful whenever we have a forward model of how our observations are formed from the hidden latent variables x. Given this likelihood function, we can use Bayes’ rule to compute the poste- rior distribution p[x|y] = p[y|x]p[x] p[y] (1.23) There are two other terms two explain. The probability p[x] is the unconditional probability that the true state is x. Since this probability does not depend at all on our observation, it is known as the prior probability of the state x. The probability p[y] is the unconditional probability of a particular observation y, regardless of the true state x. Since in MAP inference we are optimizing over x, and y is fixed, this is a fixed positive constant multiplied to our objective, so given Corollary 9 (on problem equivalence), we are free to ignore it. In contexts 3It’s worth noting that MAP inference is a Bayes Optimal decision procedure, only when the loss function is 0-1, meaning we only care if we have found the exact correct solution and not any nearby point. In practice, we often have a loss function which allows some amount of inaccuracy; however, the decision procedure in this case often requires more difficult (and frequently intractable) inference algorithms, such as finding the posterior marginal distributions for each variable. 25 where y can vary, this probability p[y] is usually denoted Z and is called the partition function. An important point is that for almost all applications, the true underlying distribution p[x, y] by which observations and hidden states are generated is both unknown and likely unknowable, and in any case much too complicated an object to perform any computation with. However, we may have a reasonable model, which approximates this probability distribution, and which is close enough to the true distribution to give reasonable answers. 1.4.3 Log-probabilities A useful transformation for probabilistic inference is to consider the negative log of the probability, converting the product in (1.23) into a sum − log(p[x|y]) = − log(p[y|x]) − log(p[x]) + log(Z). (1.24) Because − log is monotonically decreasing on [0, ∞), it is an order-reversing bijection, so the minimization problem min − log(p[x|y]) = min − log(p[y|x]) − log(p[x]) + log(Z) xx (1.25) is equivalent to our original probabilistic inference problem (1.22), by Lemma 8. This form has a number of advantages over (1.22). Because optimization problems are invariant up to addition of constants, we can ignore the term log Z entirely. Then, define fdata(x) := − log(p[x|y]) fprior(x) := − log(p[x]). (1.26) (1.27) 26 Then the MAP problem is equivalent to min fdata(x) + fprior(x) x (1.28) The negative log transformation therefore lets us separate the minimization into two parts, one dealing with how well a state x fits the observed data y, and another dealing only with the prior probability of x. If some of our probabilities are independent of each other, we can simplify this even further. Many models assume that each observation yi is generated independently from the other y j, and in fact only depends on a single unknown xi. In this case, we have p[y|x] = p[yi|xi]. (1.29) i We will say that such a model has separable likelihood, because we can take the negative log of these probabilities, and define fi(xi) := − log(p[yi|xi]), to get fdata(x) = fi(xi). i (1.30) This is a separable function of the variables xi. In this case, the functions fi are called unary data terms, since they are functions of a single variable, depending only on a single piece of data. 1.4.4 MAP inference in Foreground-Background Segmentation Returning to our example of Foreground-Background Segmentation, we can now be explicit about the unary terms fi in (1.17), and how they relate to likelihood of pixels taking certain labels. As we have seen, we can use Bayes rule to calculate the posterior probability, but we need a likelihood function p[y|x] and prior p[x]. 27 In reality, there is no simple rule which takes a foreground-background labeling x and gives a probability of different images y with that particular segmentation. The space of “all possible images” is much too large to specify a distribution over, and moreover requires a distribution over images collected from the real world. We can make a reasonable approximation by choosing a much simpler, but still plausible choice for our likelihood function. In foreground-background segmentation, one common choice is to use a color model, where we assume that the color of objects is drawn from a population, and we have distinct populations for the foreground and background objects. Specifically, we will choose for our unary likelihood terms a Gaussian Mixture Model (GMM), which models the individual pixels yi as independently chosen, and picked according to weighted sums of gaussians. The gaussians are different depending on whether the hidden state xi is foreground (xi = 1) or background (xi = 0). Definition 11. A Gaussian Mixture Model is a distribution over Rn which is a weighted sum of gaussians. A GMM has probability distribution function k GM M(c) = wiNµi,Σi(c) i=1 (1.31) for a color c ∈ R3, where w1, . . . , wk are weights summing to 1, and Nµi,Σi are normal distributions with mean µi and covariance Σi. Note that RGB colors come from R3, so we are using 3-dimensional gaussians. The GMM model for segmentation has 2 different distributions, GMMFG and GMMBG. Each pixel yi in the image (part of our observations) is generated independently of the other pixels: if the corresponding label xi is foreground, 28 then it is drawn from the distribution GMMFG, if xi is background, it is instead drawn from GMMBG. Therefore, the likelihood function is  p[yi|xi] = GMMFG(yi) GMMBG(yi) xi = 1 xi = 0 (1.32) Taking negative log probabilities, we can define unary terms fi(x) := − log(p[yi|xi])  = − log(GMMFG(yi)) − log(GMMBG(yi)) xi = 1 xi = 0 = − log(GMMFG(yi))δ1(xi) − log(GMMFG(yi))δ0(xi) (1.33) (1.34) (1.35) where δz(xi) is the delta function, defined to be 1 whenever xi = z and 0 other- wise. Then, the probabilistic inference problem with this choice for the likeli- hood function becomes min fi(xi) − log(p[x]) x i (1.36) GMMs are an effective choice for the likelihood function since they capture the intuition of images being composed of a few like-colored objects. For example, in a scene composed of a white cow on a green field with a blue sky, we will have 2 clusters of colors for background pixels, centered around blue and green respectively, and a single cluster of foreground pixels centered around white. 1.4.5 Conditional Dependence One feature we have not discussed is the choice of prior p[x]. The first thing to note is that if p[x] is the uniform prior over all images, then log(p[x]) is a 29 constant, so can be removed from (1.36). Doing so, we get back the unary-only model of (1.17). Therefore, the separable optimization problem we discussed earlier is exactly the case where we assume no prior over possible segmentations x — every possible labeling is assumed to be equally likely. Of course, this is not a very realistic prior — as we have discussed, pixels share a lot of information with their neighbors. Using this information, in the form of a prior p[x], leads us to our next topic, which is conditional dependence between variables. In computer vision, conditional dependence between variables is most closely related to spatial locality in images. That is, information at a particular pixel i is strongly correlated with that of its neighbors j ∈ N(i). In particular, we are interested in probability distributions for which variables xi depend only on their neighbors x j for j ∈ N(i). Such distributions are called Markov [35]. The Markov property relates two different conditional distributions. The first is the probability of a given variable xi taking the label ai ∈ Xi, conditioned on all other variables xV−i having some already determined labeling aV−i.The second conditional probability is that of xi taking ai, conditioned on just the variables neighboring i, xN(i), having labels aN(i). Definition 12. A probability distribution p ∈ P[X] is Markov (with respect to the neighbor structure N) if (1) we have p[x] > 0 for all x ∈ X and (2) for each i, and every labeling aV ∈ X, we have p[xi = ai|xV−i = aV−i] = p[xi = ai|xN(i) = aN(i)]. (1.37) Such a distribution is called a Markov Random Field. That is, if we specify the labels x j = a j for all the neighbors j ∈ N(i), then the 30 probability that xi = ai is the same, regardless of the labels of any non-neighbor. The variables xi and xk for k N(i) are conditionally independent, since after conditioning on the neighbors x j for j ∈ N(i), the variables xi and xk are independent. The Markov property is a particularly strong form of image locality, in that it says non-neighbors have no direct effect on a variable xi. 1.4.6 The Hammersley-Clifford Theorem While we are are interested in the Markov property because it captures our intuition of spatial locality, we get lucky — all such distributions can be written in a particularly simple form. In particular, the probability is a product over subsets of the variables called cliques. Definition 13. For a neighborhood graph N, the set of (Markov) cliques C consists of all fully-connected subgraphs C := {C ⊆ V | {i, j} ∈ E, ∀i, j ∈ C}. (1.38) When we have a function fC(xC) that depends only on xC, (i.e., fC : XC → R) we will call fC a clique function. Theorem 14 (Hammersley-Clifford [35]). A probability distribution p is a Markov Random Field if and only if it can be written in the form p[x] = e− fC(xC) C∈C (1.39) where each fC is a clique function depending only on xC. Such a distribution is called a Gibbs distribution. 31 Therefore, given a posterior distribution p[x|y] which is Markov, we can take negative log-probabilities to get a simple form for the MAP problem:  − max x log p[x] = min x − log  e− fC(xC) C = min fC(xC) x C∈C (1.40) Since the MAP problem for any Markov Random Field can be written in the form (1.40) (and vice versa), we will henceforth take this as our basic definition of an MRF. Definition 15. Let C be a subset of 2V, i.e., any collection of subsets of V, X = i Xi a label space, and f : X → R be of the form f (x) = fC(xC). C∈C (1.41) Then f is called the MAP problem for the MRF f , or an MRF inference problem with clique structure C and clique functions fC. Note that we dont require the cliques C to line up with any particular set of fully connected subsets, or to necessarily come from the probability of any probabilistic inference problem. Because the inference methods described in later chapters apply to any such functions, we will refer to any function of the form (1.41) as an MRF. Because many of the problems we deal with have a separable likelihood, we will generally separate out the unary terms in (1.41), and write f as f (x) = fi(xi) + fC(xC). i∈V C∈C (1.42) Remark 16. We will generally assume that the cliques are small, relative to the number of variables: |C| << n. 32 In this case, the cliques represent local structure, in that they are small collections of non-independent variables. 1.4.7 The Potts Model as an MRF Returning to our example of foreground-background segmentation, we can now fit the Potts model of Section 1.3.4 into this framework. Recall equation (1.19) for the binary segmentation energy f (x) = fi(xi) + fi, j(xi, x j). i i, j (1.43) As we’ve already seen, the unary terms fi come from our probabilistic inference framework via the likelihood functions p[yi|xi]. The pairwise terms, fi, j, how- ever, do not involve the observations y, and are therefore part of the prior, p[x]. Note that (1.43) is a particular instance of (1.41), i.e., (1.43) defines an MRF. Here, the clique structure includes the neighbor pairs {i, j} ∈ E for the pairwise terms and the singeltons {i} for the unary terms: C = {{i, j}|i ∈ N( j)} ∪ {{i}|i ∈ V} (1.44) We can even give this particular choice of pairwise functions fi, j a probabilis- tic interpretation, by reversing the negative-log transformation we used to get (1.40). p[x|y] ∝ e− fi(xi) e .− fi, j(xi,x j) (1.45) i i, j∈E Note that we only get the probability up to proportionality — since the mini- mization problem is the same up to an additive constant, reversing the negative- log transformation we only get the probability up to a positive multiplicative constant. 33 Recalling that fi, j(xi, x j) = λ if xi x j and 0 otherwise, we can simplify the part of 1.45 dealing with the prior p[x] to p[x] ∝ e =− fi, j(xi,x j) e−λ. i, j∈E i, j∈E:xi x j (1.46) This distribution p[x] is also known as the Ising model, which has been inde- pendently studied in physical models of spin states. In general, such Gibbs distributions [28] (distributions which are products of exponentials) are widely studied in statistical physics and other fields. 1.5 First-order and Higher-order MRFs One of the most important factors for the complexity of solving an MRF inference problem is the order of the MRF. Definition 17. The order of an MRF is the maximum size of any clique C ∈ C, minus one. We distinguish two cases: first-order MRFs, which have maximum clique size two, and higher-order MRFs, which have cliques of size three or greater. First-order MRFs were historically the first to have efficient inference algorithms. According to the definition of a first-order MRF, every clique has size at most 2, so is either a unary term, |C| = 1, or involves a pair of variables, |C| = 2. Therefore, for first order MRFs, we can form a graph G = (V, E) with (unordered) edges e = {i, j}, such that f (x) = fi(xi) + fi, j(xi, x j) i {i, j}∈E (1.47) Because the clique structure C forms a graph, we can employ existing graph algorithms for this problem. As seen in our example of binary segmenta- tion, (1.47) is closely related to the min-cut problem in a graph, and can be 34 exactly solved for binary problem. For multi-label problems, we can approximately solve the problem by repeated application of min-cut using algorithms such as alpha-expansion, which we will describe later. In contrast to first-order MRFs, it is harder to develop efficient optimization algorithms for higher-order MRFs. The first difficulty is that even specifying the values of a clique function fC(xC) for each of the labelings xC ∈ XC requires storing |C| values (where = |Xi|). Additionally, whereas a first-order MRF naturally forms a graph, a higher-order MRF forms a hypergraph, where instead of edges consisting of pairs of vertices, we have hyperedges which are subsets of the vertices, of size possibly larger than 2. While many graph algorithms can be extended to work on hypergraphs, it is not always completely obvious how to do so. In particular, a generalized version of the min-cut algorithm for hypergraphs was only very recently applied to higher-order MRF optimization [53]. 1.5.1 Advantages of Higher-Order Models Despite these difficulties, higher-order MRFs have a number of advantages, the most important of which is that first-order MRFs are limited in expressiveness compared to general MRFs. The primary advantage of using higher-order MRFs is that they allow greater flexibility in coming up with models that better match the true statistics of images. Probabilistic inference frequently makes simplifying assumptions about distributions in order to end up at a tractable optimization problem. If we are ultimately interested in more accurate answers, then we require more complicated models that better represent the true prior and likelihood functions. 35 Compare the different segmentation results from the unary-only model of (1.17) versus the pairwise Potts model of (1.19), as seen in Figure 1.4. In our probabilistic inference framework, the unary-only model corresponds to a MAP problem with a totally uniform prior, p[x] = 1 |X| . This prior does not repre- sent real-world segmentations well, and consequently the results we see in Fig- ure 1.4 are noisy, with boundaries that do not closely match the actual object. By adding in the pairwise Potts terms, we have complicated the model (requiring a reduction to min-cut to solve, rather than the simple separable optimization of (1.17)); however, the observed segmentations correspond much more closely to what we expect real objects to look like. This was achieved by adding a prior p[x] which explicitly prefers segmentations with a short boundary (a quality which is shared with the true distribution of objects in images). Compared to first-order models, higher-order MRFs allow even more flexibility to match the underlying distribution, and in many cases can express properties of images that are inexpressible in first-order models. We will consider two examples, patch-based priors for denoising and curvature regularizing priors for stereo. 1.5.2 Image Denoising and Patch-Based Priors A common problem in photography is that images suffer from noise — low light images in particular suffer from photon noise due to the quantized nature of light, thermal noise, and read-noise from imperfect electronics. In the image denoising problem, we attempt to remove this noise from an image, in order to reconstruct the underlying scene being imaged. We have a 36 forward model of image formation, where we observe a noisy image y, which is obtained from an underlying (noise-free) image x by the addition of a noise function η to each pixel: yi = xi + η. (1.48) In our forward model, we will assume that η is independent for each pixel i, and that η is unbiased Gaussian noise, η ∼ N(0, σ). For concreteness, we’ll assume that the label space Xi is discretized to 256 intensity values Xi = {0, . . . , 255}. Our likelihood function is determined by our forward model. Since yi = xi + N(0, σ), we have p[yi|xi] = −(yi − xi )2 e 2σ2 . (1.49) Taking negative-log probabilities (as in Section 1.4.3) we get unary terms fi(xi) = − log p[yi|xi] = 1 2σ2 (xi − yi)2. (1.50) Note that this is an example of a separable likelihood function. Without any prior on our problem, the optimization problem is just min x 1 2σ2 (xi − yi)2 (1.51) so the minimizing answer is just to set x = y — that is, if we have unbiased noise, and no prior beliefs about noise-free images, the MAP answer is to say that the original noise-free image is whatever we observed. Of course, noisy images are quite different from noise-free images. In particular, we expect images to be composed of connected regions formed by individual objects (which may be actual objects, or patches of similar color within an object such as the spots on a cow) where the variation of intensity within each object is roughly constant, and with a few sharp transitions where objects meet. 37 We can capture this intuition with an edge-preserving prior, which is a firstorder MRF model. For our clique structure C, we’ll use either the 4- or 8connected neighborhood model of Figure 1.5, and let the clique structure C be all edges C = E. On each edge {i, j} we will have what is called a robust distance function fi, j of the form fi, j(xi, x j) = min{|xi − x j|, τ}. (1.52) This specific cost is called the truncated L1 cost. To explain this choice, note that it has minimum cost when xi = x j, with gradually increasing cost as xi and x j have different intensities. This represents the part of our intuitive description that within an object, neighboring pixels should have similar intensities. Then, because we expect there to be edges between objects where the intensity can jump arbitrarily, we cap the maximum cost at τ. These edge-preserving priors do a reasonable job matching the observed statistics of neighboring pixel differences, and form a first-order MRF, so we have fast algorithms for optimizing them. However, they fail to take into account more complicated statistics of images, especially those related to image texture. In particular, the truncated L1 cost always prefers neighbors to have identical intensity values, so totally flat regions will always be preferred to regions of small texture variations. It is difficult to come up with a texture-aware first-order MRF. However, with higher-order MRFs it is relatively easy. Instead of putting a cost fi, j on pairs of neighboring pixels, we will put a cost fC on patches of the image (such priors are called patch-based priors). These patch functions fC are designed to match the statistics of noise-free image patches, and can thus account for local intensity variation due to texture (among other features). We will focus on a particular 38 Figure 1.6: From left to right: (a) Original image, before noise. (b) Image after adding independent Gaussian noise to each pixel. (c) Denoised result using the pairwise edge-preserving prior of (1.52). (d) Denoised result using the Field-of-Expert prior of (1.53). prior known as Field of Experts [77], which consists of a set of k linear patch filters Ji run through a non-linear response function to give a cost function k fC(xC) = αi log(1 + JiT xC). i=1 (1.53) This cost function is not as clearly intuitive as the edge-preserving truncated L1 cost; however, we can use machine learning techniques so that the filters Ji and weights αi are chosen such that the resulting prior p[x] matches the observed distribution of image patches as closely as possible. The results of optimizing a denoising problem with a first-order MRF and a higher-order Field of Experts MRF are shown in Figure 1.6. As expected, the more complicated higher-order model is better able to capture the local variations due to shading gradations and texture, and leads to a less “flattened” result. Of course, this complexity is not free — optimizing these higher-order models is more challenging. Consequently, we will use this particular denoising model as a benchmark for evaluating various higher-order optimization meth- 39 ods in later sections. 1.5.3 Curvature Regularizing Priors for Stereo In some cases, certain image properties are simply impossible to express in a first-order MRF, as we will see from this example of stereo reconstruction. In the stereo reconstruction problem, we are given a left and right image taken by two cameras separated by a baseline. The cameras are calibrated so that rows in one image correspond to rows in the other image, and if a point in space maps to a pixel at column i of the right image, and column i + δ of the left image, then that point must be at a depth ∆/δ, where ∆ is the separation of the cameras, known as the baseline. The difference δ is known as the disparity. We will turn stereo reconstruction into a labeling problem by discretizing the disparity into a finite set of pixel differences {0, 1, . . . , D}. Each variable xi gives the disparity for each pixel i in the right image, from which we can infer the depth (given the baseline ∆). We are interested in what kinds of priors are appropriate for stereo reconstruction. As with image denoising, we expect the depth image to consist of connected regions, so some sort of edge-preserving prior will be required. This prior should reflect the variation we expect within each region. A simple prior is to put a cost on neighboring pixels whenever their disparities are different. There are other choices for this, but one possibility is the truncated L1 cost that we used for denoising fi, j(xi, x j) = min{|xi − x j|, τ}. (1.54) This function has minimum cost when the disparity values for i and j are the 40 same — for this reason, this prior prefers the depth image to be composed of regions which are all at the same depth, i.e., regions which are each on a plane parallel to the camera image plane. For this reason, such priors are called frontoparallel priors. However, even simple depth images are not fronto-parallel: for example the depth image of a flat wall taken from a 45 degree angle is a slanting plane. Even with the simple assumption that real scenes are formed of totally flat objects, we would expect each region to be a slanting plane, but not necessarily fronto- parallel. An example of a prior based on this intuition (from the work of [102]) is to penalize the curvature of the depth map, as flat planes have no curvature. We do this by putting a cost on 3 × 1 windows of the image, with a cost function fi, j,k of the three disparities xi, xj and xk (with corresponding depths ∆ x1 , ∆ x2 , ∆ x3 ): fi, j,k(xi, x j, xk) = min ∆ ∆ −2 + ∆ ,τ . x1 x j xk (1.55) The quantity ∆ x1 − 2 ∆ xj + ∆ xk is a discrete approximation to the curvature of the depth map at pixel j, and as with the truncated L1 cost, we cap this cost at τ to not overly penalize discontinuities between objects. As shown in Figure 1.7, the results achieved by including a curvature regularizing prior are much more realistic, and do not arbitrarily chop the result into fronto-parallel planes, as in the first-order prior. It is not just more natural to express curvature priors as a higher-order MRF, it is actually impossible to express these constraints using only neighboring pixel differences. To see this, in Figure 1.8, we have two possibile sets of depths for a group of three pixels. The first possibility is planar, and should have low cost, while the other has a corner, and should be higher cost. A higher-order MRF with cliques of size 3 can distinguish between these two cases, and assign 41 Figure 1.7: (Left) Example synthetic input to a stereo reconstruction problem. (Center) Reconstruction using a first-order prior, resulting in many fronto-parallel planes. (Right) Reconstruction using a third-order curvature regularizing prior, resulting in smooth planes and curves. All images from [102]. a higher cost to the case with a corner. However, a first-order MRF sees 4 pairs of pixels, each with the same distance |xi − x j| in the disparities. By symmetry, we shouldn’t penalize pairs of pixels sloping back left-to-right versus right-to-left, so we must assign the same cost to the plane and the corner. Despite the extra (but necessary) expressiveness of these higher-order priors, they are challenging for existing optimization methods to handle. Ad-hoc methods have been proposed, including a reduction of these second-order cliques to a first-order model, as proposed in the paper which introduced this model [102]. However, it is our goal to show how to optimize such models in general. As with patch-based image denoising, this model is an important benchmark of higher-order inference algorithms. 42 Figure 1.8: (Left) Two possible disparities for a set of three pixels. (Center) A higher-order model with size 3 cliques can see all three pixels, and assign different costs accordingly. (Right) A first-order model sees 4 different pairs of pixel differences, each with the same difference between the two pixels (with some slanting left to right, and some the opposite direction). 1.6 Conclusion There are three main points from this introduction, which will hopefully frame the problem of inference in higher-order MRFs. First, we have seen that MRFs arise from probabilistic inference, in particular in probabilistic inference problems in which the variables have spatial locality, as in the conditional dependence of nearby pixels in an image. Not all MRFs are probabilistic inference problems, many are hand con- structed without reference to forward models or prior distributions. However, all of the optimization techniques developed in further chapters will handle equations of the form min fC(xC) x C (1.56) regardless of how the clique functions are chosen. That is, this equation is ab- stracts away the issues of probabilistic inference, to a form which can be directly handled by optimization algorithms. Finally, we have seen how modeling more sophisticated priors leads to better answers, but can also result in harder inference problems. In particular, higher- 43 order models allow much greater flexibility for modeling than first-order priors, sometimes including constraints on the solution that cannot be expressed using only first-order terms. 44 CHAPTER 2 MATHEMATICAL BACKGROUND In this chapter, we cover the mathematical tools necessary to present both the related work, as well as the main results of this thesis. To recall: our main problem is the MAP problem in MRFs, in which we are trying to minimize a function of the form f (x) = fC(xC) C∈C where the variables x = (x1, . . . , xn) come from a label space X = (2.1) i Xi, and the functions fC are clique functions, each depending on a corresponding subset C of the variables, for each C in the clique set C. We will begin by looking at transformations of (2.1) called reparameterizations in Section 2.1, which will be a useful tool in many optimization algorithms. The special case of MRFs with binary labels in Section 2.2 provides most of the known theoretical results on hardness of optimization and approximation. Section 2.3 covers submodular functions, which are a class of MRFs for which exact optimization is tractable. Convex relaxations of the MRF optimization problem, called the Marginal Polytopes are presented in Section 2.4. The Marginal Polytope, and Linear Programming relaxations in general, form the basis for the main algorithms of this thesis, so we give an overview of Linear Programming in Section 2.5, along with the related concepts of convex functions (Section 2.6), linear programming duality (Section 2.7), and optimality conditions (Section 2.8). Finally, we give specific applications of these linear programming contexts to MRF inference, in particular the dual of the Local Marginal Polytope in Section 2.9. We conclude with a description of graph cuts algorithms, and how max-flow min-cut algorithms fit in to our linear programming framwork 45 in Section 2.10. 2.1 Reparameterization A simple but useful fact is that any particular function may have many ways of being written, and that different forms may reveal useful information about an optimization problem, or be otherwise more convenient for algorithms. These multiple representations of the same function are called reparameterizations. Definition 18. Let f, f : X → R be MRFs on the same label space, where f (x) = fC(xC) C∈C f (x) = fC(xC). C∈C (2.2) Then f is a reparameterization of f if f (x) = f (x) for every x ∈ X. That is, they are equal as functions X → R. Note that the two functions f and f may have very different values for the clique functions fC and fC, and may even have different clique structures C and C. A common use of reparameterization is to put an MRF into a particular normal form as a starting point for optimization. For example, if we don’t want to deal with negative values, we can always re-write an MRF so that each clique function is nonnegative (except for a constant term which may be negative). Lemma 19. Any MRF f : X → R can be reparameterized to a form f (x) = f∅ + fC(xC) C∈C where for every C ∅ we have fC(xC) ≥ 0 for all xC ∈ XC. (2.3) 46 Proof. For every C ∈ C, let δC = minxC fC(xC). We’ll set fC = fC − δC, for C ∅ and f∅ = f∅ + δC. C∅ (2.4) First, we have achieved our goal that the clique functions fC are nonnegative, since fC(xC) = fC(xC) − δC and fC(xC) ≥ δC by choice of δC, hence fC(xC) ≥ 0. Now, to see that f is a reparameterization of f , we just expand out the terms to see that we have added and subtracted constants in a way that cancels out. For any x we have f (x) = f∅ + fC(xC) C∅ = f∅ + δC + ( fC(xC) − δC) C∅ C∅ = f∅ + fC(xC) = f (x) C∅ (2.5) Most of the reparameterizations used later follow this same basic form: we take some of the cost from one clique function fC and move it to another clique function fC , in such a way that the addition and subtraction balances out. Even for the simple reparameterization in Lemma 19, we can prove useful facts about the original MRF f , in this case, giving an easy lower-bound on value of the optimal solution. Corollary 20. If f is reparameterized as in Lemma 19, then f (x) ≥ f∅ for all x ∈ X. In particular, OPT( f ) ≥ f∅. Proof. Since every term of f has fC(xC) ≥ 0 (except for possibly f∅), we have f (x) = f (x) ≥ f∅. 47 Another useful reparameterization that moves cost from higher-order terms to unary terms is the pencil reparameterization, from the Min-sum diffusion al- gorithm of [100]. In the terminology of [100], a pencil1 consists of a label a for a variable xi, together with all the values fC(xC) of a single clique C, for just the labels xC with xi = a. We can get a reparameterization by subtracting a value δ from all the fC(xC), and adding the same δ to the unary term fi(a).  fC (xC ) =   fC(xC) − δ fC (xC ) xi = a otherwise  fi (xi) =   fi(xi) + δ fi(xi) xi = a otherwise (2.6) This transformation doesn’t change the cost: if x happens to have xi = a then whatever the rest of the labels in xC are, we subtracted δ from fC(xC), while adding δ to fi(xi), which cancel out. And if xi a then we didn’t change the value of fC or fi. 2.2 Pseudoboolean functions In the following two sections, we will restrict our attention to binary problems; that is, MRFs where the label set for each variable is {0, 1}. We have already seen that binary MRFs are of particular importance for graph-cuts methods, since the min-cut problem is itself is binary (each vertex is either on the s or t side of the cut). Consequently, such functions have attracted a good deal of research in the combinatorial optimization literature, as far back as [34], where they are known as pseudoboolean functions. 1Presumably so-called because in the graph construction of [100] a pencil consists of edges from nodes representing all the fC(xC) to the single node representing xi — these edges come to a point, thus resembling a pencil tip. 48 Definition 21. A pseudoboolean function (of n variables) is a function f : {0, 1}V → R. As with MRFs, many pseudoboolean functions have local structure in which they can be written as a sum of clique functions, so that f (x) = C fC(xC). Each fC : {0, 1}C → R is also a pseudoboolean function, defined just on the clique C. A pseudoboolean function f is higher-order if f has cliques C of size 3 or greater, and f is first-order if |C| ≤ 2 for all cliques. 2.2.1 Representations of MRFs For all MRFs (including multi-label MRFs), there are two main representations worth mentioning. • In the explicit representation, the MRF f is given as a table of values for each clique function fC. That is, for every C ∈ C and xC ∈ XC, the value fC(xC) is given as part of the input to the optimization algorithm. • In the implicit (or black box) representation, the clique structure C is given explicitly (as a collection of subsets C ⊆ V) but each clique function fC is given as an oracle. That is, there is a piece of code or an algorithm which computes fC given xC. Generally, we assume that MRFs are given in the implicit representation, since this is the more general form2, but both make sense in different scenarios. However, for the binary case, there are additional representations which make 2The black box is more general from the point of view of the algorithm, since a black box algorithm can be used on any other representation. 49 use of the combinatorial structure of binary labels: set functions and multilinear polynomials. 2.2.2 Set Functions In the binary segmentation example (Section 1.3), we used a mapping between binary labelings {0, 1}V and subsets S ⊆ V in order to convert between minimum cuts in a graph and segmentations. In general, we can use this conversion to map any pseudoboolean function f : {0, 1}V to a function taking subsets S ⊆ V as input. Such functions are called set functions. Definition 22. A set function with base set V is a function f : 2V → R. We can convert from boolean vectors x ∈ {0, 1}V to subsets S ⊆ V by defining S (x) = {i ∈ V | xi = 1} (2.7) In the other direction, we get a boolean vector x from a subset S by setting xi = 1 if i ∈ S and 0 otherwise. We’ll denote this map x(S ). It is easily checked that these maps are inverses of each other. From this bijection, it is clear that we can just as easily define pseudoboolean functions to be functions f : 2V → R, i.e., set functions. In many of the following results, it will be convenient to use either the boolean vector representation or the set function representation depending on the situation, so we will switch between them freely. Finally, note that we have chosen the convention that xi = 1 if and only if i ∈ S . The other choice of xi = 0 if and only if i ∈ S is also valid, but less 50 common. 2.2.3 Multilinear Polynomials A distinguishing feature of the set {0, 1} is that for both x = 0 and x = 1 we have x2 = x. This lets us write polynomial functions of x very simply: higher powers xid collapse to linear terms xi. For example, for x ∈ {0, 1}V we have 7x13x32x44 = 7x1x3x4. That is, all polynomials over {0, 1}V are multilinear. Therefore, to specify a monomial, we don’t need the powers on each variable, only the subset H ⊆ V containing the variables in the monomial, and a coefficient aH — in the example 7x13x32x44, we have H = {1, 3, 4} and aH = 7. With this notation, every monomial can be written aH i∈H xi for some H ⊆ V and aH ∈ R. Definition 23. A multilinear polynomial is a pseudoboolean function of the form f (x) = aH xi H∈H i∈H (2.8) where H ⊆ 2V is the collection of variables in each monomial, and aH ∈ R are the coefficients for each term. 2.2.4 Properties of Multilinear Polynomials A non-obvious fact is that every pseudoboolean function can be written as a multilinear polynomial. In fact, this multilinear representation is unique, up to ignoring terms with coefficient aH = 0. Since adding terms with aH = 0 doesn’t 51 change the polynomial, we get a canonical form by including all the terms for every H ⊆ 2V (setting aH = 0 for H ∈ 2V \ H). Lemma 24. The multilinear polynomial representation of a pseudoboolean function f is unique. In particular, the coefficients aH of f are given by aH = (−1)|H\S | f (x(S )) S ⊆H where x(S ) is the binary vector corresponding to the set S (see Section 2.2.1). (2.9) We can simplify the proof of this lemma by noting the following formula for f (x) in terms of the coefficients aH: Proposition 25. For a multilinear polynomial f , we have f (x(S )) = aH. H⊆S Proof. For H ∈ H, we have two cases: (2.10) • If H ⊆ S , then in the binary vector x(S ), xi = 1 for all i ∈ H, so aH i∈H xi = aH. • If H S then there is some j ∈ H with j S . So, in x(S ), we have x j = 0, and hence aH i∈S xi = 0. Therefore, we have that aH i∈H xi = aH H ⊆ S , so f (x(S )) = aH xi = aH H ⊆ S = aH H∈H i∈H H∈H H⊆S (2.11) Using this fact, the lemma follows: 52 Proof of Lemma 24. We prove this by induction on |H|. For H = ∅, we have f (x(∅)) = aH = a∅ H⊆∅ so a∅ = f (x(∅)) = (−1)|∅| f (x(∅)) = S ⊆∅(−1)|∅\S | f (x(S )). (2.12) Then, for general H we have f (x(H)) = aH = aH + aH . H ⊆H HH By induction we expand out aH to get (2.13) f (x(H)) = aH + (−1)|H \S | f (x(S )) H H S ⊆H = aH + (−1)|H \S | f (x(S )) S ⊆H H :S ⊆H H = aH + f (x(S )) (−1)|H \S | S ⊆H H :S ⊆H H = aH + −(−1)|H\S | f (x(S )) SH and rearranging, we have (2.14) (2.15) (2.16) (2.17) aH = f (x(H)) + (−1)|H\S | f (x(S )) SH = (−1)|H\S | f (x(S )) S ⊆H (2.18) (2.19) 2.2.5 Computational Complexity and Hardness of Approximation Pseudoboolean functions are closely linked with the family of NP-complete problems relating to boolean satisfiability. In particular, it is trivial to give a 53 reduction from the maximum satisfiability problem (MAX-SAT) to the pseudoboolean optimization problem. Recall that in boolean satisfiability problems, the literals (the boolean variables xi, along with their negations x¯i) are combined into a set of logical formulas by the operators conjunction, ∧, and disjunction, ∨. In the MAX-SAT problem, we have a set of clauses, each of which is a disjunction of a subset of the literals, for example, C = x1 ∨ x¯3 ∨ x¯9. MAX-SAT is an optimization problem: our objective is to find a setting of the variables which maximizes the total number of satisfied clauses. It is clear that this objective is already a pseudoboolean function: f (x) = |{C | C is satisfied}| is a function Rn → R. We can make this a minimization problem by minimizing − f , hence we have given a reduction from MAX-SAT to minimization of pseudoboolean functions, and therefore we have: Theorem 26. Minimization of pseudoboolean functions is NP-hard. Even restricting ourselves just to the simplest case of first-order pseudoboolean functions doesn’t make things any easier. Like MRFs, satisfiability problems are also distinguished by their order: MAX-2SAT restricts the size of all cliques to be of size at most 2, and is still NP-complete. In this case, the objec- tive function f can be written as a first-order pseudoboolean function. Denote the set of clauses as C. For a clause C, we get a clique function3  fC(xi, x j) =   1 0 C satisfied by xi, x j otherwise (2.20) 3We are abusing notation to use C for both the clause and corresponding clique C, since both are just a collection of variables. 54 In this case, the MAX-2SAT objective is f (x) = fC(xi, x j). C∈C (2.21) Minimizing this is a first order pseudoboolean minimization problem, and therefore we have: Theorem 27. Minimizing first-order pseudoboolean functions is NP-hard. Having established that even simple pseudoboolean functions are NP-hard to optimize, we can ask the secondary question of whether any approximation algorithm is possible. Definition 28. For a number α ≥ 1, we say that a set of minimization problems { f } can be α-approximated if there is a polynomial time algorithm which, for any instance f returns an assignment of the variables x with f (x) ≤ αOPT( f ). Such an algorithm is called an α-approximation for { f }. In other words, we may not be able to find an optimal solution x∗ in polynomial time, but we can at least find a solution x with objective cost at most α times the optimal cost. Clearly, we want α as close to 1 as possible, so that the cost of our polynomial-time-computable solutions will be not far from the optimum. For pseudoboolean functions, the question of approximability is somewhat complicated by the fact that the optimal value may be negative. We cannot have f (x) ≤ αOPT( f ) for α > 1 and OPT( f ) < 0, since this would mean f (x) < OPT( f ), which is impossible. So the above definition is meaningless for pseudoboolean functions which can be negative. To get around issues with negative objectives, we define P+ to be the set of all strictly positive pseudoboolean 55 functions, and P+1 to be all the strictly positive first-order pseudoboolean functions.4 That is: P+ = { f : {0, 1}n → R | f (x) > 0, ∀x ∈ {0, 1}n} P+1 = { f : {0, 1}n → R | f (x) > 0, ∀x ∈ {0, 1}n, f is first order} (2.22) Even with this assurance that the function is always positive, we still cannot approximately optimize first-order pseudoboolean functions. Theorem 29. There is no α-appoximation for P+1 unless P = NP. Proof. We will give a reduction from graph coloring, since we have the following theorem from [69]. Theorem 30 (Lund, Yannakakis). There is an > 0 such that graph-coloring cannot be n -approximated, unless P = NP. Recall that in the graph-coloring problem, we are given a graph, and we must assign colors to each node such that no neighboring nodes share a color. The minimization problem for graph-coloring is to find the minimal number of colors for which there is a valid coloring. Let G = (V, E) be the given graph. Let n = |V|. Any graph is n colorable, since we can just give every node its own distinct color. So, colorings of G are functions c : V → {1, . . . , n} with c(i) c( j) for {i, j} ∈ E. The graph-coloring problem is to minimize the number of used colors: min |c(V)| c (2.23) 4Finding an algorithm to optimize P+ or P+1 is known as a promise problem, since we are given the additional promise that the function is never negative. 56 We denote the minimum number colors used by a coloring as χ(c), and the minimum possible number of colors as χ(G). First, we’ll write graph coloring as a higher-order pseudoboolean function. Graph coloring is a multi-label problem (each node can take a label from {1, . . . , n}), however we can make it a binary problem by introducing variables xi, j where xi, j = 1 if node i takes color j, and 0 otherwise. The binary vector x defines a valid coloring if exactly one xi, j = 1 for each i, and if whenever i, i are neighbors we never have xi, j = xi , j = 1. The corresponding coloring c is c(i) = j for the unique j with xi, j = 1. Then, the graph-coloring objective is a single higher-order term  f (x) =  χ(c) x gives a valid coloring c  n + 1 otherwise (2.24) Finally, we can reduce this higher-order pseudoboolean function to first order, using the results from later in this thesis (Chapter 4). In particular, there exists a function g (which we can compute in polynomial time) such that g is a function of x and some auxiliary variables y with miny g(x, y) = f (x). The minimum value of g is minx,y g(x, y) = minx f (x) = χ(G), the same as the minimum value to the original coloring problem. The function g is in P+1 , so if there exists an α-approximation algorithm for this class, then it will compute a solution (x , y ) in polynomial time, with g(x , y ) ≤ αχ(G). Throwing away the auxiliary variables y , we can consider the cost of x alone. We have f (x ) = miny g(x , y) ≤ g(x , y ) ≤ αχ(G). It’s possible that x is not a valid coloring, but in this case, f (x ) = n + 1, so replace x with the one corresponding to the coloring giving every node a distinct color (i.e., c(i) = i for 57 i = 1, . . . , n), this can only reduce f (x ) so now we have a valid coloring c with χ(c) = f (x ) ≤ αχ(G), and thus we would have an α-approximation for graph coloring. Therefore, an α-approximation algorithm for P+1 would give an αapproximation algorithm for graph coloring, which we know cannot happen unless P = NP. 2.3 Submodular Functions Despite the hardness of optimizing general pseudoboolean functions, there is an important subclass where we can efficiently do exact minimization. These functions, called submodular, are the basis for generalizing the min-cut problem to higher-order MRFs. 2.3.1 Decreasing Marginal Gains Submodular functions are usually defined as set functions f : 2V → R. Since we have already noted the equivalence between set functions and pseudoboolean functions (Section 2.2.2), all of these definitions will translate to pseudoboolean functions as well. Intuitively, submodular functions capture the notion of diminishing marginal gains – specifically, diminishing marginal gains over a set of discrete binary choices. 58 As an example,5 consider the two (non-exclusive) choices of whether or not to have cake, and whether or not to have cookies. Having either cake or cookies is certainly better than having nothing, and having both cake and cookies is better than having either alone. However, having both is perhaps a bit too-much, and the added benefit of having “cookies with cake” over “just having cake”, isn’t as large as that of having cookies over nothing — that is, the marginal gain of adding cookies has decreased. An example table of utilities is in Table 2.1. No Cake Cake No Cookies 05 Cookies 37 Table 2.1: Utilities for various dessert options. The marginal gain for adding cookies to nothing is 3 − 0 = 3, whereas the marginal gain of adding cookies to {Cake} is 7 − 5 = 2. To formalize this notion, we have a ground set V of choices, of which we may pick any subset. Each subset is assigned a value, or objective, f (S ) ∈ R. The marginal benefit of adding i ∈ V to a subset S ⊆ V is the difference in values f (S ∪ {i}) − f (S ). To simplify notation, we will write S ∪ {i} as S + i, so this difference is f (S + i) − f (S ). To say we have decreasing marginal gain means that if we take a larger set T (where larger means T ⊇ S ), then the marginal gain f (T + i) − f (T ) is smaller. Definition 31. A function f : 2V → R is submodular if for every S ⊆ T ⊆ V, and i ∈ V \ T , we have f (S + i) − f (S ) ≥ f (T + i) − f (T ) (2.25) 5Admittedly whimisical. 59 2.3.2 Equivalent Definitions of Submodularity There are many different properties that are equivalent with Definition 31 which are useful depending on the situation. Theorem 32. The following are equivalent: 1. f is submodular 2. for every S , T ⊆ V we have f (S ∪ T ) + f (S ∩ T ) ≤ f (S ) + f (T ) 3. For every S ⊆ V and i, j ∈ V \ S with i j we have f (S ) + f (S + i + j) ≤ f (S + i) + f (S + j) (2.26) (2.27) Frequently, the second condition above is taken as the primary definition, however since they are equivalent, we could have taken any as the definition of submodular. The third condition is notable for using many fewer constraints than the others (only O(n22n) as opposed to O(22n) for (1) and (2)). Proof. For (1) ⇒ (3), note that S ⊆ S + j, so from (2.25) we have f (S + i) − f (S ) ≥ f (S + i + j) − f (S + j) which can be rearranged to get (2.27) for every S ⊆ V and i, j S . (2.28) For (3) ⇒ (2), let S , T be any subsets of V. If S ⊆ T then S ∩ T = S and S ∪ T = T so the inequality (2.26) trivially holds. The same is true if T ⊆ S , so we will now consider the case where S T and T S . 60 Enumerate the elements of T \ S as T \ S = {i1, . . . , ik1} and the elements of S \ T as S \ T = { j1, . . . , jk2}. Consider the following sum, over all pairs of ik, jk k1 k2 f (S + i1 + · · · + ik + j1 + · · · + jk ) − f (S + i1 + · · · + ik−1 + j1 + · · · + jk ) k=1 k =1 − f (S + i1 + · · · + ik + j1 + · · · + jk −1) + f (S + i1 + · · · + ik−1 + j1 + · · · + jk −1) (2.29) Simplify the above by letting S k,k = S + i1 + · · · + ik−1 + j1 + · · · jk −1, and we get that (2.29) is k1 k2 f (S k,k + ik + jk ) − f (S k,k + jk ) − f (S k,k + ik) + f (S k,k ) k=1 k =1 (2.30) Each term of this sum is a rearrangement of (2.27), applied to S k,k , so the whole sum is ≤ 0. We can split up this sum into 4 separate summations, and reindexing we get that (2.30) is equal to k1 k2 k1 k2 k1 k2 k1 k2 f (S k+1,k +1) − f (S k+1,k ) − f (S k,k +1) + f (S k,k ) k=1 k =1 k=1 k =1 k=1 k =1 k=1 k =1 k1+1 k2+1 k1+1 k2 k1 k2+1 k1 k2 = f (S k,k ) − f (S k,k ) − f (S k,k ) + f (S k,k ) k=2 k =2 k=2 k =1 k=1 k =2 k=1 k =1 (2.31) = f (S )k1+1,k2+1 − f (S k1+1,1) − f (S 1,k2+1) + f (S 1,1) = f (S ∪ T ) − f (S ) − f (T ) + f (S ∩ T ) Therefore, f (S ∪ T ) + f (S ∩ T ) ≤ f (S ) + f (T ) for all S , T ⊆ V, so we have that (3) ⇒ (2). Finally, for (2) ⇒ (1), assume that (2.26) holds, and take any S ⊆ T ⊆ V and i ∈ V \ T . Then (S + i) ∪ T = T + i and (S + i) ∩ T = S , so we have f (T + i) + f (S ) = f ((S + i) ∪ T ) + f ((S + i) ∩ T ) ≤ f (S + i) + f (T ) (2.32) Rearranging, we get that (2.25) holds, so f is submodular. 61 2.3.3 Properties of Submodular Functions Submodular functions are notable for sharing many properties with convex functions, and consequently they fill a similar role in discrete optimization problems as convex functions do for continuous optimization. See [68] for an excellent summary of the connections between convex and submodular functions. The basic calculus of submodular functions is that they are closed under addition and multiplication by positive constants (but not subtraction or multiplication by negative constants). Lemma 33. Submodular functions are closed under positive linear combinations. That is, if f1, . . . , fk are submodular and a1, . . . , ak ∈ R are nonnegative, then a1 f1 + · · · + ak fk is submodular. Proof. Since fi is submodular and ai ≥ 0 we have ai fi(S ∩ T ) + ai fi(T ∪ T ) ≤ ai fi(S ) + ai fi(T ). Sum these inequalities together, and we get ai fi(S ∩ T ) + ai fi(S ∪ T ) ≤ ai fi(S ) + ai fi(T ) i i ii (2.33) A powerful tool in the analysis of convex functions are their subdifferentials: linear functions (x) which are also lower bounds, (x) ≤ f (x). Submodular functions similarly have linear lower bounds, called subbases. For binary labels, linear functions can be defined by a vector ψ ∈ RV. For each such ψ, we get a linear function (also called ψ by abuse of notation) with ψ(S ) = i∈S ψi. 62 Definition 34. For a submodular function f , a subbase is a linear function ψ with f (S ) ≥ ψ(S ) for all S ⊆ V. A base of f is a subbase with ψ(V) = f (V). There is a simple algorithm to compute a base for any submodular function f , as long as f (∅) ≥ 0.6 In fact, we can greedily construct this vector ψ (this is known as Edmond’s algorithm [16]). We let ψ1 = f ({1}) and for i = 2, . . . , n we set ψi = f ({1, . . . , i}) − f ({1, . . . , i − 1}). Lemma 35. The vector ψ defined above satisfies f (S ) ≥ ψ(S ) for all S ⊆ V and f (V) = ψ(V ). Proof. We prove the lemma inductively on the size of S . For S = ∅ we have f (∅) ≥ ψ(∅) = 0. For S ∅ let i be the largest element of S . Since f is submodular, it has decreasing marginal gains, and S − i ⊆ {1, . . . , i − 1} so f (S ) = f (S ) − f (S − i) + f (S − i) ≥ f ({1, . . . , i − 1} + i) − f ({1, . . . , i − 1}) + f (S − i) = ψi + f (S − i) (2.34) ≥ ψi + ψi = ψ(S ) i∈S −i For the second part of the claim, φ(V) = i∈V f ({1, . . . , i} − f ({1, . . . , i − 1}) = f (V) − f (∅) = f (V). Submodular functions have additional structure, in addition to their similarity to convex functions. In particular, the set of all minimizers of a submodular function will be closed under intersections and unions. This follows easily from 6Note that if f (∅) < 0 then there are no subbases of f at all, since ψ(∅) = 0 by definition, and we require f (∅) ≥ ψ(∅). 63 one of the equivalent conditions for submodularity, Theorem 32. If S ∗ and T ∗ are both minimizers, then f (S ∗ ∩ T ∗) + f (S ∗ ∪ T ∗) ≤ f (S ∗) + f (T ∗) (2.35) and since S ∗ and T ∗ are minimizers, we must have f (S ∗∩T ∗) = f (S ∗∪T ∗) = f (S ∗). In fact, we can generalize this to the sets where a submodular function is equal to a given subbase. Definition 36. If ψ is a subbase of a submodular function f , the tight sets T ( f, ψ) are all S for which f (S ) = ψ(S ). That is, T ( f, ψ) = {S ⊆ V | f (S ) = ψ(S )}. Lemma 37. For a submodular function f and subbase ψ, T ( f, ψ) is a lattice, meaning it is closed under intersection and union. Proof. Let S , T ∈ T ( f, ψ). So, in particular we have f (S ) = ψ(S ) and f (T ) = ψ(T ). We want to show that S ∩ T and S ∪ T are in T ( f, ψ), so we want f (S ∩ T ) = ψ(S ∩ T ) and f (S ∪ T ) = ψ(S ∪ T ). Since f ≥ ψ we have f (S ∩ T ) ≥ ψ(S ∩ T ) and f (S ∪ T ) ≥ ψ(S ∪ T ). Now, since f is submodular, we have ψ(S ∩ T ) + ψ(S ∪ T ) ≤ f (S ∩ T ) + f (S ∪ T ) ≤ f (S ) + f (T ) = ψ(S ) + ψ(T ) (2.36) Because ψ is linear, we have ψ(S ) + ψ(T ) = ψ(S ∩ T ) + ψ(S ∪ T ), by inclusionexclusion. Therefore, the inequalities in (2.36) are all equalities, and we have f (S ∪ T ) + f (S ∩ T ) = ψ(S ∪ T ) + ψ(S ∩ T ) (2.37) Finally, since f (S ∪ T ) ≥ ψ(S ∪ T ) and f (S ∩ T ) ≥ ψ(S ∩ T ), we have that f (S ∪ T ) = ψ(S ∪ T ) and f (S ∩ T ) = ψ(S ∩ T ). 64 A particularly useful special case of the above lemma is when f is nonnegative, meaning f (S ) ≥ 0 for all S ⊆ V. In this case, 0 is a sub-base of f , and the tight sets T ( f, 0) are the zero-valued sets, Z( f ) = {S ⊆ V | f (S ) = 0}. Corollary 38. If f is a non-negative submodular function, then the zero sets Z( f ) form a lattice. 2.3.4 Submodular First-order Pseudoboolean Functions For checking if an arbitrary set function is submodular, the above definitions all have at least O(n22n) equations to verify. There is some redundancy between the equations, but determining whether a function is submodular is NP hard in general [104]. For first-order pseudoboolean functions, whether or not a function is submodular is particularly easy to verify — all we need to check is that each of the pairwise coefficients (in the polynomial representation) is non-positive. Lemma 39. A first-order pseudoboolean function, given as a multilinear polynomial f (x) = a∅ + ai xi + ai, j xi x j i i, j is submodular if and only if ai, j ≤ 0 for all i, j ∈ V. (2.38) Proof. Throughout this proof, we will treat f as a set function, with f (S ) := f (x(S )), where x(S ) is the corresponding binary vector for the set S . We use the explicit representation for the function values f (S ) given by Prop 25: f (S ) = aH. H⊆S (2.39) 65 We can relate this to one of our conditions for submodularity (Theorem 32) by expanding out f (S ) + f (S + i + j) − f (S + i) − f (S + j) ≤ 0: 0 ≥ f (S ) + f (S + i + j) − f (S + i) − f (S + j) = aH + aH − aH − aH H⊆S H⊆S +i+ j H⊆S +i H⊆S + j    = H⊆S +i+ j aH − H⊆S +j aH  −  H⊆S +i aH − H⊆S aH  (2.40) (2.41) (2.42) To complete the proof, we’ll apply a simple proposition simplifying these sums over subsets: Proposition 40. For any set of coefficients aH defined for subsetsets H ⊆ V we have aH − aH = aH+i H⊆S +i H⊆S H⊆S (2.43) To see this, note that that 2S +i \ 2S = {S + i | S ∈ 2S }. Using this fact, we have f (S ) + f (S + i + j) − f (S + i) − f (S + j) (2.44) = aH+i − aH+i H⊆S + j H⊆S (2.45) = aH+i+ j = ai, j H⊆S (2.46) where the last line follows from aH = 0 for |H | > 2 (because f is first-order). Therefore, we have that f (S ) + f (S + i + j) − f (S + i) − f (S + j) ≤ 0 if and only if ai, j ≤ 0, hence f is submodular if and only if ai, j ≤ 0 for all i, j. Another way of recognizing first order submodular functions is if each individual pairwise term is submodular. First, note that we have that fi, j(xi, x j) is submodular if and only if a single inequality holds: fi, j(0, 0) + fi, j(1, 1) ≤ fi, j(1, 0) + fi, j(0, 1) (2.47) 66 Then, since sums of submodular functions are submodular (Lemma 39) we have Lemma 41. A first-order pseudoboolean function f (x) = fi(xi) + fi, j(xi, x j) i i, j (2.48) is submodular if for each i, j we have fi, j is submodular (i.e., it satisfies (2.47)). Note that this is a sufficient but not a necessary condition. 2.4 Local and Marginal Polytopes for MRFs We will now begin to consider multilabel problems, i.e., those with more than just two labels. The next several sections are building to a key theoretical tool called the Local Marginal Polytope, which is a linear programming formulation of the MRF inference problem. In particular, we want to introduce this linear programming relaxation, as well as the major tools for dealing with linear programs, including duality and complementary slackness. A major difficulty in optimizing MRFs is that they are discrete problems — there is a combinatorial space i Xi of possible states for x, and as we have seen, it is difficult to make global statements about the function f (since efficiently finding either the minimum or any constant approximation to it would mean P = NP). Contrast this with optimizing a convex, continuous function f : Rn → R. A major feature of such functions is that they always have a global optimum, and simple algorithms (including gradient descent) will always lead to this global 67 optimum. In this section we will show how MRF optimization can be cast as a particular kind of convex minimization problem called Linear Programming. 2.4.1 Weighted Averages as Linear Programs To motivate Linear Programming, we will consider a simple example: finding the minimum element over a finite set {a1, . . . , ak}. The brute-force approach is of course to simply examine each element in turn, and remember the smallest. If we want to turn this into a continuous minimization problem instead, one way we can do this is to consider weighted averages of the elements ai. That is, we have non-negative weights µi for each ai with the weights summing to 1 (i.e., i µi = 1) where the resulting weighted averge, aµ, is equal to the sum aµ = µiai. i (2.49) We know that the weighted average has to be between the minimum and maximum elements, so aµ ≥ min{ai}. And, if we put all the weight on the minimizing i∗ (i.e., µ∗i∗ = 1 and µ∗j = 0 for j i∗), then we have aµ∗ = ai∗ = min{ai}. In other words, the minimum over all weighted averages of the ai is exactly the minimum element: min µiai = min{ai} µ:µ≥0, i µi=1 i i (2.50) This is a continuous optimization problem with a linear objective i µiai, and linear constraints i µi = 1, and µi ≥ 0 for i = 1, . . . , n, which makes (2.50) an example of a linear program. 68 2.4.2 Marginal polytopes Recall that an MRF is called discrete if the state space X is finite. Equivalently, each variable xi has a finite label set |Xi| < ∞. Let f (x) = C fC(xC) be a discrete MRF, with X = i Xi. For simplicity, we’ll assume that |Xi| = for all i, although nothing of the following breaks if we have non-uniformly-sized label sets. Given the discussion in the previous section, we can convert this discrete, combinatorial optimization problem to a continuous linear program by minimizing the weighted average of all solutions. We have a weight for every state x ∈ X, which we will write as µ(x), and get a minimization problem min µ(x) f (x) µ x s.t. µ(x) = 1 x µ≥0 (2.51) Note that in the above, x is no longer a free variable — we are instead summing the weights µ(x) over all possible states x. In fact, if we treat the weighted average as instead giving a probability distribution over the states x, then the objective x µ(x) f (x) is exactly the expectation of f (x) when each state is chosen with probability µ(x). This explains our notation of µ(·) as a probability density function on X. Thinking in terms of probability distributions gives us a pre-existing toolbox with which to reason about our problem. For example, it’s clear that the expectation of f under the probability distribution µ is bounded between the minimum and maximum values of f (x), and that if we’re optimizing over all 69 probability distributions, the best we can do is to choose the minimizing x∗ with probability 1 (i.e., set µ(x∗) = 1 and µ(x) = 0 for x x∗). For any discrete set X, let P[X] be the set of all probability distributions on X. That is, P[X] = {µ ∈ RX : µ ≥ 0, x µ(x) = 1}. We will use µ, f to denote the expectation of f with respect to µ, so that µ, f = x µ(x) f (x). With this notation, we get a very simple version of (2.51): min µ, f µ∈P[X] (2.52) A major drawback in optimizing over all probability distributions in P[X] is that we have exploded the number of variables from |V| to |V|. As a result, this linear program is much too large to solve efficiently even for small MRFs. However, we have not yet used the clique structure of the MRF f . We can use the clique structure to get a much smaller number of variables — our eventual goal is to specify a probability distribution µC ∈ P[XC] just for each clique C separately. For a clique of size k, this requires only k variables, which is much smaller than the n variables in P[X] (assuming, per Remark 16, that k << n). Our first step towards this goal is to use linearity of expectation to expand out f in the objective of (2.52): µ, f = µ, fC C = µ, fC C (2.53) Each of the terms µ, fC is an expectation of a function fC which only depends on a few variables, namely the subset of variables in C. Therefore, we only care about the probability that a clique labeling xC is chosen. That is, we are interested in the probability that x restricted to C is xC. We can compute this by 70 summing µ over all assignments of the remaining variables, xV\C. To explain the notation, in the sum we get the combined vector x = (xC, xV\C). We will write µ|C for the marginal distribution of µ on the subset of variables in C. This marginal probability is µ|C(xC) = µ(x) xV \C (2.54) A useful fact about marginalization is that it preserves expectations, as long as we marginalize onto the set of variables that a function depends on. Proposition 42. If µ ∈ P[X] and fC is a clique function, fC : XC → R, then µ, fC = µ|C, fC . Proof. This equality is just a re-grouping of the sums in the definition of expec- tation: µ, fC = µ(x) fC(xC) x = µ(x) fC(xC) xC xV\C (2.55) = µ|C(xC) fC(xC) = µ|C, fC xC This lets us re-write the expectation over µ in (2.52) as a sum over expecta- tions on each clique: min µ∈P[X] µ|C, fC C (2.56) This particular linear program is known as the Marginal Polytope, because the variables are marginal probabilities of the joint distribution µ. This problem is still equivalent to our original MRF optimization problem, but still has the problem of an exponentially large set of variables. We can fix the latter problem, but to do so, we must move to a relaxation of our original problem. Instead of 71 ensuring we have a global probability distribution µ, we will instead have a separate probability distribution µC on each clique C, and these distributions are required to only locally agree, meaning that if C and C share a variable i, then they agree on the marginal distribution on that single variable: µC|i = µC |i. We’ll denote this single distribution for xi by µi which we constrain to be equal to µC|i for all C containing i. min µC, fC {µC ∈P[XC ]} C s.t. µC|i = µi ∀C, i ∈ C (2.57) This linear program is known as the Local Marginal Polytope, as we have replaced marginalization from a global joint probability distribution µ with local constraints on the consistency of the distributions µC. This linear program is a great simplification compared to the full marginal polytope, and still provides a global lower-bound on the optimum of the original (integral) problem. However, we have moved to a relaxation, so it is possible that the optimal linear programming solution may not have any corresponding integral solution with as-good a value. In particular, whenever the clique functions are non-submodular, or whenever there are cycles in the graph, then the local marginal polytope may not be a tight relaxation. 2.5 Linear Programming The marginal polytopes above are particular examples of a general class of problems called Linear Programs (LPs). A Linear Program is a constrained optimization problem, with variables xi ∈ R, a linear objective, and linear constraints. For 72 now, we will only consider the case where we have finitely many variables and constraints, although generalizations to infinitely many variables are possible.7 The constraints in a Linear Program may be equality constraints, or inequal- ity constraints (either greater-or-equal or less-than-or-equal), and additionally, we may require that some of the variables are either non-negative or non- positive. As a result, without some unifying notation (which we will present shortly) there are many cases to consider to write down a “general LP”. As an example, the following LP with 3 variables and 3 constraints has each of these possibilities: min x1,x2,x3 3x1 − 2x2 + 7x3 s.t. x1 − x2 = 3 2x2 + 3x3 ≥ 2 − 3x1 + x3 ≤ 4 (2.58) x1 ≥ 0 x2 ≤ 0 x3 ∈ R The objective is linear in the 3 variables, as are each of the three constraints, with one each of an =, ≥ and ≤ constraint. Additionally, x1 and x2 are respectively required to be non-negative and non-positive, while x3 may be positive or negative. In general, an LP has m1 greater-than constraints, and m2 less-than contraints and m3 equality constraints (with m1 + m2 + m3 = m), as well as n1 non-negative variables, n2 non-positive variables, and n3 variables which can be any real num- 7In particular, Linear Programming relaxations for continuous MRFs (where the label space is a continuous interval [a, b] ⊆ R) are infinite dimensional. See the author’s paper [20] for an example of how to generalize the marginal polytope to continuous MRFs. 73 ber (with n1 + n2 + n3 = n). Partitioning the indices by I1 = {1, . . . , n1}, I2 = {n1 + 1, . . . , n1 + n2} and I3 = {n1 + n2 + 1, . . . , n} (and similarly for J1, J2, J3) we get the general form of an LP: n min ci xi x i=1 n s.t. a j,i xi ≥ b j i=1 n a j,i xi ≤ b j i=1 n a j,i xi = b j i=1 ∀ j ∈ J1 ∀ j ∈ J2 ∀ j ∈ J3 (2.59) xi ≥ 0 ∀i ∈ I1 xi ≤ 0 ∀i ∈ I2 We can more easily organize these linear functions by writing them as dotproducts and matrix-vector products: let b = (b1, . . . , bm) and c = (c1, . . . , cn) be vectors, and A be the m × n matrix with entries a j,i. To handle the different types of constraints, let AJk be the submatrix of A with just the rows corresponding to Jk (for k = 1, 2, 3) and similarly bJk the subvector of b with rows from Jk. Then we can more compactly write (2.59) as min cT x x AJ1 x ≥ bJ1 AJ2 x ≤ bJ2 AJ3 x = bJ3 (2.60) xI1 ≥ 0 xI2 ≤ 0 Note that inequalities regarding vectors (such as x ≥ 0) are always treated com- ponentwise (i.e., xi ≥ 0 for all i). 74 2.5.1 Linear Cone Programming Keeping track of which variables are positive or negative, and which constraints are equality vs. inequalities is tedious and can complicate equations. Simplifying this notational complexity gives us a good excuse to move to a slight generalization of Linear Programming called Cone Programming. Fortunately, the main theorems concerning Linear Programming (especially regarding duality) are most naturally stated using the theory of cones, so we will solve two problems at once by considering conic problems here. In Cone Programming, we require the variables and constraints to lie in a type of convex subset of Rn called a cone. Definition 43. A subset K ⊂ Rn is a cone if K is closed under addition and multiplication by nonnegative scalars c ∈ R, c ≥ 0. That is, for x, y ∈ K we have x + y ∈ K and for c ≥ 0 we have cx ∈ K. The following 4 subsets of R are particularly useful cones: • K0 := {0} • KR := R • K≥ := {x ∈ R | x ≥ 0} • K≤ := {x ∈ R | x ≤ 0} It is trivial to verify that these are each closed under addition and multiplication by nonnegative scalars. Using these cones, we can re-write the constraints of an LP: for example, the constraint xi ≥ 0 is the same as xi ∈ K≥, and the constraint i a j,ixi = b j is the 75 same as i a j,ixi − b j ∈ K0. This lets us re-write all our constraints as membership in a cone. Given the partition (I1, I2, I3) of the variables into xi ≥ 0, xi ≤ 0 and unconstrained xi, we get a cone K defined by K = K≥ × K≤ × KR i∈I1 i∈I2 i∈I3 (2.61) Then, the relation x ∈ K is identical to the intersection of the various constraints that xi ≥ 0 for i ∈ I1, xi ≤ 0 for i ∈ I2 and xi ∈ R for i ∈ I3. Similarly, given the partition (J1, J2, J3) of the constraints into ≥, ≤ and = relations, we can define K = K≥ × K≤ × K0 j∈J1 j∈J2 j∈J3 (2.62) so that Ax − b ∈ K is identical to the original constraints AJ1x ≥ bJ1, AJ2x ≤ bJ2 and AJ3x = bJ3. Therefore, the general form of an LP (2.60) is much more simply expressed as min cT x x Ax − b ∈ K (2.63) x∈K 2.6 Convex Sets and Convex Functions Much of the theory of Linear Programming comes from convex optimization — specifically, since the feasible set for an LP is a convex set, and the linear objective is likewise convex, an LP is an instance of a convex program. In this section, we will review the basics of convexity and convex programs, however, even the basics of convex optimization are large enough to fill a book [9]. 76 The basic definitions of convexity is that any line joining two elements within a convex set is also contained in the set. Definition 44. A set Ω ⊆ Rn is convex if for every a, b ∈ Ω and t ∈ [0, 1] we have ta + (1 − t)b ∈ Ω. For functions, convexity says that the set of points lying above the graph of f is convex. This set is called the epigraph. Definition 45. A function f : Rn → R is convex if for every a, b ∈ Rn and t ∈ [0, 1] we have f (ta + (1 − t)b) ≤ t f (a) + (1 − t) f (b). Equivalently, f is convex if and only if the epigraph of f (i.e., the set S ⊆ Rn+1, where S = {(x, z) | z ≥ f (x)}) is a convex set. One of the most useful theorem regarding convex functions (for the purposes of optimization) is that all local minima of a convex function are also global optima. Lemma 46. If f : Rn → R is continuous, and x∗ is a local minimum of f (meaning f (x∗) ≤ f (x) for all x ∈ U, where U is an open neighborhood of x∗) then x∗ is a global minimum of f , meaning f (x∗) ≤ f (x) for all x ∈ Rn. Therefore, algorithms which converge to local optima (such as gradient descent methods) also always find a global optimum. For proving results relating to convexity, one of the most powerful theorems is the hyperplane separation theorem. Recall that a hyperplane in Rn is defined by an equation v1x1 + · · · + vnxn = c, or more compactly, vT x = c. This divides Rn in half, with those x for which vT x ≥ c on one side, and those with vT x < c on the other. 77 The hyperplane separation theorem says that for any closed convex set Ω, if we have any element x outside Ω, then there is a hyperplane such that Ω is on one side of the hyperplane, and x is on the other. Theorem 47. For any closed convex set Ω ⊆ Rn, and for any point x not in Ω, there is a hyperplane separating x from Ω — that is, there is a vector v and scalar c ∈ R such that for any y ∈ Ω we have vT y > c but vT x < c. This theorem may not seem especially important, however in practice it holds a similar place to the Intermediate Value Theorem in one-dimensional calculus, in that many other more useful theorems follow directly from it. Finally, we’ll note that convexity immediately applies to our results of the previous section: Lemma 48. A cone K ⊆ Rn is convex. Proof. The cone K is closed under addition and multiplication by non-negative scalars, so let a, b ∈ K, and t ∈ [0, 1]. Then ta and (1 − t)b are in K, and hence ta + (1 − t)b ∈ K as well. 2.7 Duality One of the most important tools for working with linear programs is the theory of duality. Every linear program has another linear program associated to it, called the dual program. This dual program can be obtained by a purely mechanical process (which we will describe shortly), and provides a great deal 78 of information regarding solutions to our original linear program (henceforth called the primal problem). At a high level, for a minimization problem (as in (2.60)) the dual program uses the constraints of the original problem to give a lower bound on the optimal objective. In fact, the dual program has a variable corresponding to every constraint of the primal program, and similarly a constraint corresponding to every variable of the primal. That is, dualization exchanges constraints and variables — this can be very helpful for problems with large numbers of constraints but few variables, or vice-versa. 2.7.1 Exchanging Minimization and Maximization The simple observation at the heart of duality is that exchanging nested minimization and maximizations gives a lower bound. Lemma 49. If f : X × Y → R is any function (not necessarily continous) and X, Y are any sets, then min max f (x, y) ≥ max min f (x, y) x∈X y∈Y y∈Y x∈X (2.64) Proof. We’ll define two helper functions, g(x) = maxy∈Y f (x, y) and h(y) = minx∈X f (x, y). Note that, by definition, minx∈X maxy∈Y f (x, y) = minx∈X g(x), and similarly maxy∈Y minx∈X f (x, y) = maxy∈Y h(y). Let x be any element of X, and y any element of Y. Since g(x) is the maximum over all y of f (x, y ), we have g(x) ≥ f (x, y), and similarly since h(y) is the minimum over all x of f (x , y) we have h(y) ≤ f (x, y). In particular, for any x and y 79 we always have g(x) ≥ f (x, y) ≥ h(y) (2.65) A general fact regarding maxima and minima is that if we have two sets A, B ⊆ R with a ≥ b for any a ∈ A and b ∈ B, then min A ≥ max B. Therefore, we have min max f (x, y) = min g(x) ≥ max h(y) = max min f (x, y) x∈X y∈Y x∈X y∈Y y∈Y x∈X (2.66) 2.7.2 Linear Programming Duality: An Example The notion of duality just presented seems quite trivial — it is just a rule for in- terchanging minimization and maximization. However, in the context of Linear Programming, it becomes quite powerful. To see this in action, let’s consider the following simple LP: min x1,x2 3x1 + 4x2 s.t. 2x1 + x2 ≥ 2 (2.67) x1, x2 ≥ 0 The feasible set of this LP is illustrated in Figure 2.1. We might guess that the minimizer of (2.67) is achieved at (1, 0) with value 3. However, how can we prove this? With only 2 variables and 3 constraints, it’s not especially difficult to prove that this is indeed the minimizer, for example by geometric arguments. With many variables and constraints, though, this becomes a much more difficult task. If we can find some easily obtainable lower bound to the objective of (2.67), then it may be possible to prove something about the minimum value. In par- 80 Figure 2.1: The feasible set of (2.67) is graphed above in grey. ticular, if we could prove that the objective of (2.67) is always at least 3, then our proposed solution (1, 0) with objective value 3 has to be the global minimizer (since any other solution can’t have value less than 3). The duality lemma, Lemma 49, is our method to finding this lower bound. The general recipe for constructing the dual is: 1. Take the constraints of the original LP and move them into the objective, using indicator functions (Section 1.2.2). 2. Write these indicator functions as a maximization over a new variable (called a Lagrange multiplier) times the original constraint. 3. Exchange minimization and maximization, using Lemma 49. 4. Take any terms that look like indicator functions out of the objective, and make them constraints. Let’s examine each of these steps in turn, for our example LP. 81 Re-writing constraints as indicator functions Recall from Section 1.2.2 that we can convert any constrained optimization prob- lem to an unconstrained problem by using indicator functions. If g j(x) ≥ 0 is a constraint, then the indicator function in a minimization problem for g j takes value ∞ for all infeasible solutions:  Igmji(nx)≥0(x) =   0 ∞ g j(x) ≥ 0 otherwise (2.68) For a maximization problem, the inidicator function takes value −∞ for infeasi- ble solutions:  Igmja(xx)≥0(x) =   0 −∞ g j(x) ≥ 0 otherwise (2.69) We can take any constrained optimization problem and get an equivalent prob- lem by removing the constraint g j(x) ≥ 0 and adding Igj(x)≥0(x) to the objective. For our example problem, we’ll use this to eliminate the constraint 2x1 + x2 ≥ 2. We get a new objective which is equal to the original objective, plus the indicator function for this constraint: F(x1, x2) = 3x1 + 4x2 + I2mxi1n+x2≥2(x1, x2) (2.70) As we’ve already noted, replacing constraints by indicator functions gives an unchanged optimization problem, so our original minimization (2.67) is equal to min x1,x2≥0 3x1 + 4x2 + Imin 2x1+x2 ≥2 ( x1 , x2) (2.71) 82 Lagrange Multipliers For the case of linear constraints, there is a simple way to re-write the indicator function which preserves the linear structure. To do so, let’s look at what happens when we multiply the original constraints by a new, auxiliary variable (called a Lagrange Multiplier). For the constraint 2x1 + x2 ≥ 2, we define the residual to be the quantity 2 − (2x1 + x2). When the residual is positive, then (x1, x2) is infeasible, and the residual is the amount by which it the constraint has been violated. Conversely, when the residual is non-positive, then the constraint is satisfied. Consider multiplying the residual by a non-negative variable y ≥ 0: y(2 − x1 − x2) (2.72) If the residual is positive, then we can send (2.72) to +∞ by making y arbitrarily large. However, if the residual is non-positive then the biggest we can make (2.72) is 0, by setting y = 0. In other words, we have  max y(2 − 2x1 y≥0 − x2) =   0 ∞ 2x1 + x2 ≥ 1 otherwise = Imin 2x1+x2 ≥1 ( x1 , x2) (2.73) Therefore, we have found a way to re-write the indicator function as a maximization over this Lagrange Multipler. Applied to our example problem, we have that our original primal problem is equal to min x1,x2≥0 3x1 + 4x2 + max y≥0 y(2 − 2x1 − x2) (2.74) 83 Rearrange and exchange minimization and maximization The next step of the process is purely algebraic. We want to exchange maximization and minimization, using Lemma 49, so we will group together all the terms containing x1 and x2, and then apply our lemma. min 3 x1,x2≥0 x1 + 4x2 + max y≥0 y(2 − 2x1 − x2) = min max x1(3 − 2y) + x2(4 − y) + 2y x1,x2≥0 y≥0 ≥ max y≥0 min x1,x2≥0 x1(3 − 2y) + x2(4 − y) + 2y = max 2y + min x1(3 − 2y) + min x2(4 − y) y≥0 x1≥0 x2≥0 (2.75) Transform indicator functions to constraints Now we notice something interesting: the inner minimizations are a product of a variable xi, times a linear function of the variable y. This is the exact same situation for our expression of the indicator function in (2.73), except now the roles of xi and y have been reversed. In fact, we have  min x1(3 − 2y) x1≥0 =   0 −∞ 2y ≤ 3 otherwise  min x2(4 − y) x2≥0 =   0 −∞ y≤4 otherwise (2.76) These are the indicator functions for the constraints 2y ≤ 3 and y ≤ 4, for the maximization problem over y. We can replace these inidicator functions by the 84 corresponding constraints to get a linear program: max 2y y≥0 s.t. 2y ≤ 3 y≤4 This problem is called the dual of our original problem (2.67). (2.77) Obtaining a proof of optimality We now return to our original question: how can we prove that the solution (1, 0) with value 3 is optimal for the primal problem? Note that the dual program (2.77) is a lower bound on our primal problem (2.67), because we exchanged minimization with maximization (and all other steps maintained equality). Let’s consider a potential dual solution: y = 3 2 . It’s much easier to see that this is optimal for the dual problem because there’s just a single variable, and we’ve raised it as high as possible without violating the constraint 2y ≤ 3. Fur- thermore, the value of this solution is 3. Since the dual is a lower bound on the primal problem, we now know that any primal solution must have value at least 3 — therefore, our proposed solution of (1, 0) is indeed optimal. The process by which we arrived at this proof was somewhat complicated; however, all the steps we performed were purely mechanical. Using the language of cone programming, we can give a simple recipe for obtaining the dual program of any LP. 85 2.7.3 Conic Duality The example above shows how to obtain the dual LP for a single linear constraint. We’ll now show how to find the dual for any Linear Program, using concepts from cone programming. The key definition to make this work is that of a dual cone. Definition 50. For a cone K ⊆ Rn, its dual cone K∗ is the set of all y ∈ Rn whose dot product with all elements of K is nonnegative. So, K∗ := {y ∈ Rn | yT x ≥ 0, ∀x ∈ K} (2.78) Using this definition, we can get a general formula for the dual of any linear cone program. Theorem 51. The linear cone program min cT x x Ax − b ∈ K x∈K has a dual program max bT y y AT y − c ∈ −K∗ y ∈ (K )∗ In particular, weak duality holds, so that OPT(2.79) ≥ OPT(2.80). (2.79) (2.80) Provided we can actually compute the dual cones K∗ and (K )∗, then this gives a simple formula for computing the dual of any LP. Let’s see how this works for the cones that define our basic equality and inequality constraints. 86 For K0, KR, K≥ and K≤, we can compute their duals easily. Note that for x, y ∈ R, xT y is just the product xy. • K∗ R = K0: according to the definition, for y to be in K∗ R we would need yx ≥ 0 for all x ∈ R. Consider 3 cases for y: either y > 0, y < 0 or y = 0. If y > 0 set x = −1 (which has x ∈ KR). Therefore, we have yx < 0 for x ∈ KR so y K∗ . R If y < 0 then set x = 1, and we have yx < 0 for x ∈ KR, and so y K∗ . Finally, R if y = 0 then yx = 0 for all x, so y ∈ K∗ . R • K0∗ = KR, since there is only one x ∈ K0 and it is x = 0. Then, y0 = 0 for all y ∈ R, and hence K0∗ = R. • K≥∗ = K≥. We have two cases. If y ≥ 0 then yx ≥ 0 for any x ≥ 0, and hence y ∈ K≥∗. If y < 0 then set x = 1 and we have y · 1 < 0, and hence y K≥∗. • K≤∗ = K≤ by similar argument. When we have multiple constraints, the cone K is a cartesian product of cones, according to (2.62), so we need to know how taking the dual relates to cartesian products. Fortunately, the dual of a product is the product of the duals. Proposition 52. If K1, . . . , Kn are cones, then (K1 × · · · × Kn)∗ = K1∗ × · · · × Kn∗. Proof. Let K = K1 × · · · × Kn. To see that K1∗ × · · · × Kn∗ is the dual of K, let x ∈ K. We’ll write x as the concatenation of subvectors xi ∈ Ki so that x = (x1, . . . , xn). We know that for any i and yi ∈ Ki∗ that yTi xi ≥ 0, so in particular, letting y = (y1, . . . , yn) we have yT x ≥ 0. Therefore, K1∗ × · · · × Kn∗ ⊆ K∗. In the other direction, if y ∈ K∗ then for each i, we want to show that yi ∈ Ki∗. Indeed, for any xi ∈ Ki let x be (0, . . . , xi, . . . , 0), and we have yT x = yixi must be ≥ 0. This is true for any xi ∈ Ki, so yi ∈ Ki∗, and hence K∗ ⊆ K1∗ × · · · × Kn∗. 87 In the remainder of the section, we prove Theorem 51. First, we can extend our observation about Lagrange variables resulting in indicator functions to the conic programming case. Lemma 53. If x ∈ K then maxy∈K∗ −yT x = 0. If x K, then maxy∈K∗ −yT x = ∞. That is, Ixm∈iKn (x) = max y∈K∗ −yT x (2.81) Proof. If x ∈ K, then by definition of K∗, since y ∈ K∗ we have yT x ≥ 0 and hence −yT x ≤ 0. By choosing y = 0 we get that the maximum of −yT x is equal to 0. Now, let x K. Then, since K is closed and compact, by the Separating Hyperplane Theorem (Theorem 47) there exists y˜ and c with y˜T x < c but y˜T z > c for all z ∈ K. In particular, since 0 ∈ K we have y˜T 0 > c so c < 0. It turns out that this y˜ defining the separating hyperplane is actually an element of the dual cone, y˜ ∈ K∗. To show this, assume by way of contradiction that there’s some z˜ ∈ K with y˜T z˜ = < 0. Since K is a cone, and c > 0 we have c z˜ ∈ K. However, then y˜T ( c z˜) = c = c which contradicts y˜T z > c for all z ∈ K. Therefore, there is no z ∈ K with y˜T z < 0 and hence y˜ ∈ K∗. Now, for our x which is not in K we have that y˜T x < c so −y˜T x > −c > 0. Since K∗ is a cone, we have that λy˜ ∈ K∗ for all λ ≥ 0, so −(λy˜)T x > −λc and hence max −yT x ≥ max −(λy˜)T x = max −λc = ∞ y∈K∗ λ≥0 λ≥0 (2.82) Note that this Lemma gives some intuition for separating hyperplanes, namely that they are directions along which we can send yT x to −∞ for x K, and furthermore, separating hyperplanes of cones are elements of the dual cone. 88 Corollary 54. Ixm∈aKx(x) = min y∈K∗ yT x Proof. This follows from Ixm∈aKx(x) = −Ixm∈iKn (x). (2.83) With this Lemma, we can prove the main theorem of this section. Proof of Theorem 51. Let’s start with our primal problem min cT x x Ax − b ∈ K x∈K (2.84) We then follow the recipe for our single variable example: (1) replace the constraint Ax − b ∈ K with an indicator function, and then use Lemma 53 to write this as a maximization over a new variable y. (2) Exchange maximization with minimization. (3) Replace terms that look like indicator functions with constraints: min (2.84) = min x∈K cT x + IAmxin−b∈K (x) Replace constraints with indicators = min cT x + max −yT (Ax − b) x∈K y∈(K )∗ Lemma 53 = min max yT b + cT x − yT Ax x∈K y∈(K )∗ ≥ max min yT b + (c − AT y)T x y∈(K )∗ x∈K Exchange max with min = max yT b + min xT (c − AT y) y∈(K )∗ x∈K = max{bT y | y ∈ (K )∗, c − AT y ∈ K∗} Replace indicators with constraints y (2.85) 89 Finally, the last line is equal to our dual problem max bT y y AT y − c ∈ −K∗ y ∈ (K )∗ (2.86) 2.8 Optimality for Linear Programs The dual program is not just useful as a lower bound to the primal program — in fact, if we have a pair of solutions (x∗, y∗) which are optimal for the primal and dual problems respectively, then these solutions obey a property called complementary slackness. The main idea behind complementary slackness is that whenever a variable in the optimal solution xi∗ is nonzero, then the constraint corresponding to that variable in the dual program must be satisfied with equality (we say that such a constraint is tight). Conversely, whenever a dual constraint is not tight, the variable xi must be 0, so complementary slackness is useful for proving facts about the sparseness of optimal solutions (i.e., that only certain variables xi are nonzero). Let’s start with a linear conic program program, where we have expanded 90 out K and K into a product of cones (as in (2.62)), with primal program min cT x x Ax − b ∈ K j j (2.87) and dual program x ∈ Ki i max bT y y AT y − c ∈ − Ki∗ i y ∈ (K j)∗ j (2.88) For each constraint j, the slack is the value of the linear constraint, (Ax − b) j. Complementary slackness says that for each i, the product of the primal variable xi and the dual slack (AT y − c)i must be zero. Similarly, the product of the dual variable y j and the primal slack (Ax − b) j is also zero. Theorem 55. Given the linear conic programs above, if there exist optimal solutions x∗ and y∗, then these satisfy xTi (AT y − c)i = 0 for all i and yTj (Ax − b) j = 0 for all j. We can specialize this to linear programs, where each cone Ki, K j are just subsets of Rn. Corollary 56. If x∗ and y∗ are respectively primal and dual optimal for a Linear Program, and si, t j are the primal and dual slack variables, given by si = aTi x − bi, t j = yT a j − c j, then xisi = 0 and y jt j = 0 for all i, j. Finally, we conclude with the Strong Duality theorem, which says that for LPs, we do not actually lose anything by exchanging maximization and minimization. 91 Theorem 57 (Strong Duality). For a Linear Program, if both the primal and dual have feasible solutions, then their optimal values are equal. 2.9 Duality for the Local Marginal Polytope Recall the Local Marginal Polytope of Section 2.4 min µi(xi) fi(xi) + µC(xC) fC(xC) {µC } i xi C xC s.t. µC(xC) = µi(xi) ∀C, i ∈ C, xi ∈ Xi xC\i µi(xi) = 1 ∀i xi µ≥0 (2.89) Since we are interested in this linear program (as a relaxation of our MRF optimization problem), we should examine its dual, to see if there is any structure to the dual LP that can be exploited. As a bonus, we will also use this as an example of how to take the dual of an LP in practice. The language of cones and dual cones are convenient for stating the main theorems of Linear Programming, but aren’t necessarily the most convenient for doing calculations. However, the preceding sections have given us a recipe for computing the dual: (1) multiply the constraints by new Lagrange variables, and bring them into the objective as indicator functions, (2) exchange minimization with maximization, (3) group terms containing the original variables, and replace terms that look like indicator functions with constraints. For the Local Marginal Polytope, we have two types of constraints, and correspondingly, two types of dual variables. The first set of constraints are for each 92 clique C, each i ∈ C and each label xi ∈ Xi, so we’ll denote the corresponding dual variable as λC,i(xi).8 The other constraints are for each i, with corresponding dual variable κi. The indicator functions for these constraints are the max of the Lagrange variables times the residual, so we have λC,i(xi) · µi(xi) − xC\i µC(xC) for the first set of constraints and κi 1 − xi µi(xi) for the second set of constraints. Since these are equality constraints, these maximizations are over the dual cone (K0)∗ = R. Therefore, we get that (2.89) is equal to min max µ≥0 λ,κ µi(xi) fi(xi) + µC(xC) fC(xC) i xi C xC    + C,i∈C xi λC,i(xi) µi(xi) − xC\i µC(xC) + i κi 1 − xi µi(xi) = min max µ≥0 λ,κ κi + ii  µi(xi)  fi(xi) − κi + λC,i(xi) xi C:i∈C  + µC(xC)  fC(xC) − λC,i(xi) C xC i∈C and hence (2.90) (2.91) (2.92) (2.93)  ≥ max λ,κ κi + iC xC min µC (xC )≥0 µC (xC )  fC (xC ) − i   + i xi min µi(xi)≥0 µi(xi)  fi(xi) − κi + C:i∈C λC,i(xi)  λC,i(xi) (2.94) (2.95) We have two expressions of the form mina≥0 a · b, which is −∞ when b < 0 and 0 otherwise, so this is the same as the indicator for the constraint that b ≥ 0. Therefore, we can replace these expressions with the corresponding constraints 8We could have denoted this variable as λC,i,xi , but as we’ll see shortly, these variables have a natural interpretation as functions of xi. 93 to get the linear program max κ,λ κi i s.t. λC,i(xi) ≤ fC(xC) i κi ≤ fi(xi) + λC,i(xi) C:i∈C (2.96) Note that we are maximizing over κ, and the only constraints involving κ are all of the form κi ≤ hi(xi) for some functions hi. Specifically, let hi(xi) = fi(xi) + C:i∈C λC,i(xi), which we call the height of label xi at variable i (following the language of [57]). Since we’re maximizing over κ, we will always have κi = minxi hi(xi). We can informally think of the dual variable λC,i(xi) as taking part of the cost fC(xC), and redistributing it to the unary terms. The height functions hi(xi) can be thought of as the original cost fi(xi), plus any redistribution λC,i from the cliques to the unary terms at i. The dual is always a lower bound on the value f (x) of any labeling. 2.10 First-order Binary MRFs and Minimum Cut We will conclude the chapter by returning to binary first-order problems, and in particular how we can use max-flow/min-cut to solve them. First, we will give a general solution for solving any submodular first-order MRF with mincut. Furthermore, we will also show how the Local Marginal Polytope for these problems is itself equivalent to a cut polytope — that is, max-flow/min-cut is not just a convenient algorithm for this purpose, but actually is the same prob- 94 lem from a Linear Programming perspective. 2.10.1 Solving First-order Submodular MRFs with Graph Cuts We will start with a binary first-order submodular MRF, that is, a minimization problem of the form min x fi(xi) + fi, j(xi, x j) i i, j (2.97) where the variables xi are in {0, 1} and each fi, j is submodular, meaning it satisfies fi, j(0, 0) + fi, j(1, 1) ≤ fi, j(1, 0) + fi, j(0, 1). (2.98) In computer vision, Graph Cuts refers to the various different algorithms which solve MRF minimization problems using the minimum-cut/maximumflow problem. For the simplest case of first-order submodular MRFs, we can reduce (2.97) directly to a minimum cut problem. To do so, it is easiest to first reparameterize the problem. We have already seen in Section 2.1 that there are several ways to change the values of the fi and fi, j while keeping the total energy function the same for any labeling x. In this case, we want to find a reparameterization of (2.97) so that fi, j(xi, x j) ≥ 0 for all xi, x j ∈ {0, 1}, and so that fi, j(0, 0) = fi, j(0, 1) = fi, j(1, 1) = 0.9 I.e., fi is nonzero for a single label f (1, 0). We use a series of reparameterizations to make this happen. Then, for each fi, j we first subtract fi, j(0, 0) from all values fi, j(xi, x j), and add the corresponding amount to the constant term of f . Next, we use the pencil reparameteri- 9Note that we’ve broken the symmetry between i and j with these requirements, so we’ll assume that each unordered edge {i, j} has i < j. 95 zation (2.6) of Section 2.1 to subtract δ = ( fi, j(1, 1) − fi, j(0, 1)) from fi, j(1, x j) for x j ∈ {0, 1} and add δ to fi(1). Similarly, we subtract δ = ( fi, j(0, 1) − fi, j(0, 0)) from fi, j(xi, 1) for xi ∈ {0, 1} and add δ to f j(1). Overall, we have a new energy function fi, j given by: fi, j(0, 0) = fi, j(0, 0) − fi, j(0, 0) = 0 fi, j(0, 1) = fi, j(0, 1) − fi, j(0, 0) − ( fi, j(0, 1) − fi, j(0, 0)) = 0 fi, j(1, 0) = fi, j(1, 0) − fi, j(0, 0) − ( fi, j(1, 1) − fi, j(0, 1)) fi, j(1, 1) = fi, j(1, 1) − fi, j(0, 0) − ( fi, j(1, 1) − fi, j(0, 1)) − ( fi, j(0, 1) − fi, j(0, 0)) = 0 (2.99) Note that, because fi, j is submodular, we have that fi, j(1, 0) ≥ 0, as the expression above for fi, j(1, 0) is just a rearrangement of the equation defining submodularity (2.98). For the unary terms, we sum up the added δ for each edge i, j to get fi (1) = fi(1) + fi, j(1, 1) − fi, j(0, 1) + fi, j(1, 0) + fi, j(0, 0) j:i< j j: j 0 ∀a b fi, j(a, b) = fi, j(b, a) fi, j(a, b) + fi, j(b, c) ≤ fi, j(a, c) ∀a, b, c ∈ X (3.2) then the binary subproblem g will always be submodular, and hence we can ex- actly find the optimal S ∗ using min-cut. Finally, α-expansion can have provable approximation bounds (in the sense of Definition 28, Section 2.2.5): when the fi, j are all Potts terms, with fi, j(xi, x j) = λi, j whenever xi x j, and 0 otherwise, then alpha-expansion is always a 2-approximation [12]. When the fi, j are all the same, and form a metric, then the work of [47] showed that the approximation ratio is 2 f max f min where f max is the maximum value of fi, j and f min is the minimum nonzero value of fi, j. For Fusion Moves [66], the binary subproblems may be non-submodular, in which case only approximate solutions to g can be found. However, proposals can be specifically chosen to better explore the label space XV. The original paper [66] proposed a number of variants, including Jump Moves, which search through label spaces for problems like Optical Flow, by allowing the current 113 label to move in horizontal or vertical translations from the current solution x. The work of [38] proposed Gradient Descent Fusion Moves, for energy functions which are differentiable (in particular, for Fields of Experts) — the proposed move y is along the direction of the energy gradient ∇ f (x), which quickly moves x towards lower energy solutions. Finally, for special cases of energy functions, there are globally optimal solutions. In the case where the label set can be given an order Xi = {a1 < a2 < · · · < a }, and the pairwise functions fi, j are all convex according to this order, then there is a graph construction due to Ishikawa [39] which finds the globally optimal labeling in a single min-cut solve. This condition has been generalized to a multi-label submodularity condition by [84], which similarly has a globally optimal solution. 3.4 Dual Algorithms The class of dual algorithms all make use of the Local Marginal Polytope, and in particular, of the dual program described in Section 2.9. The Local Marginal Polytope was originally introduced by [86], and extended to the higher-order case by [99]. Recall that the dual program is a maximization over dual variables λC,i(xi), and that these dual variables can be interpreted as a reparameterization, in which some of the energy of the clique functions is partitioned among the unary terms. Another way of interpreting the dual variables is as messages between variables. In particular, in the pairwise case, the dual variables λC,i for an edge C = {i, j} becomes a message from variable j to i, where λi, j(xi) communicates 114 some function of node j’s belief that the true state for i is xi. For MRFs whose edges form a tree, this intuition can be made exact, with the Max-Product Belief Propagation algorithm [74]. For tree-structured MRFs, Max-Product BP can find the exact global optimum in linear time. Early message passing algorithms were developed without knowledge of the Local Marginal Polytope; however, global optimality can be proved by noting that Max-Product BP on a tree is actually a form of Dynamic Programming. For MRFs which are not trees (for example, image grids) Max-Product BP can be applied to each variable in an iterative algorithm called Loopy BP [25]. When the MRF graph has loops, LBP may fail to converge, and has been observed to enter cycles, without even reaching a local optimum. However, it has achieved good performance in many practical problems, and was a widely-used alternative to graph-cuts methods. The first provably convergent message passing algorithm is the TreeReweighted Sequential message passing algorithm of [50]. The analysis of TRWS interprets the messages as reparameterizations of the energy function f , which provide a global lower bound. This lower bound is maximized, using maxproduct steps on subtrees of the graph. This lower bound is related to the Local Marginal Polytope, but using a different dual program (obtained by organizing the energy into higher-order terms on the subtrees before taking the dual). Dual Decomposition [56] splits the objective into a set of overlapping terms, and uses subgradient ascent on the dual to enforce consistency among the labelings of the separate parts of the objective. In general, Dual Decomposition works for any splitting of the objective, so long as the subproblems can be exactly optimized efficiently. However, the subgradient ascent may take many 115 iterations to converge. Other message passing algorithms with provable convergence explicitly use the Local Marginal Polytope dual, including Max-Sum Diffusion [100] and MPLP [29]. The latter two algorithms can be interpreted as block-coordinate ascent on the dual program. Because the dual is not smooth (it is piecewise linear) block coordinate ascent will converge to a solution, but it may not be dual-optimal. Methods to smooth the dual objective can converge to the optimal dual solution: these include Accelerated Dual Decomposition [42] and Adaptive Diminishing Smoothing [81]. All of the above methods are for first-order MRFs, however the same basic ideas translate to higher-order MRFs as well. A dual decomposition approach based on higher-order pattern-based priors, using Dynamic Programming as the optimizer for the subproblems, was proposed in [55]. Max-sum diffusion was applied to the higher-order case in [101], and TRW-S was similarly generalized to higher-order MRFs in [54]. These latter methods are based on the corresponding algorithms for first-order, using the higher order Local Marginal Polytope proposed in [99]. 3.5 Primal-Dual Algorithms Finally, Primal-Dual methods use both the primal and dual programs simultaneously. The connection between graph cuts and primal-dual techniques was established by [57] which showed that α-expansion could be interpreted as simultaneously optimizing primal and dual solutions. 116 The general recipe of primal-dual algorithms is that they iteratively update a primal integer solution (i.e., a labeling x) similarly to alpha-expansion. However, they also maintain a dual solution λ, which guides the search during the binary min-cut solve. Furthermore, the primal and dual solutions are simultaneously updated, so as to satisfy invariants related to complementary slackness, which results in the final solution having provable approximation bounds. These technical details will be expanded on in Chapter 7. The primal-dual algorithm of [57] overcomes the most important limitation of the α-expansion algorithm, which is the requirement that the pairwise energy must be a metric [12]. These methods also extend the approximation bounds for alpha-expansion with metric energies from [47]. The same approximation ratio still holds, but over a much broader class of energy functions. Empirically, keeping track of the dual variables also allows a number of implementation speedups compared to α-expansion, resulting in the very efficient algorithm FastPD [59], which can be 3-9 times faster than alpha-expansion in practice. For higher-order MRFs, the first primal-dual algorithm is the Sum-ofSubmodular Primal Dual algorithm, SoSPD, which is covered in Chapter 7. 117 CHAPTER 4 HIGHER ORDER REDUCTIONS We have already seen that first-order MRFs have well-understood and effective optimization algorithms, compared to higher-order MRFs. One way of bridging this gap is to find a way to transform higher-order MRFs into an equivalent first-order one. As described in Section 3.2.3, reduction methods can transform a binary MRF f (x) to a quadratic function g(x, y), such that f (x) = miny g(x, y). Then, we can apply existing algorithms to the reduced firstorder form g to get a solution to the higher-order original problem. In this chapter, we will focus on binary MRFs, since the reduction methods discussed herein all work on binary problems only. Multi-label problems can be handled by repeated application of solving binary subproblems, as in (for example) alpha-expansion or fusion moves. The main result of this chapter is a reduction method which exploits the hypergraph structure of the cliques C to transform a group of terms at once. For n binary variables, each of which appears in terms with k other variables, at worst we produce n non-submodular terms, while [37, 40] produces O(nk). We identify a property (called local completeness) under which our method perform even better, and show that under certain assumptions several important vision problems (including common variants of fusion moves) have this property. We show experimentally that our method produces smaller weight of non-submodular edges, and that this metric is directly related to the effectiveness of QPBO [51]. Running on the same field of experts dataset used in [37, 40] we optimally label significantly more variables (96% versus 80%) and converge more rapidly to a lower energy. Preliminary experiments suggest that some other higher-order 118 MRFs used in stereo [102] and segmentation [1] are also locally complete and would thus benefit from our work. 4.1 Introduction While graph-cuts are a popular method for solving first-order MRFs, such as the benchmarks described in [90] and [45], they are much more difficult to apply to higher-order MRFs. As a result, until recently this powerful optimization method has only been used for a few specialized higher-order MRFs, such as [48, 102]. The first general-purpose practical graph-cuts method for higher-order MRFs is that of Ishikawa [37, 40]. This method works by transforming the higher-order input MRF into an equivalent quadratic (pairwise) MRF by adding additional auxiliary variables and edges. The general class of such methods are known as higher-order reductions — this particular reduction is commonly referred to as Higher-Order Clique Reduction (HOCR). Since the resulting firstorder MRF is non-submodular, it is optimized using QPBO, which produces a partial labeling (see Section 3.2.2). The quality of this partial labeling (i.e., the number of labeled pixels) is highly sensitive to the energy function. A more theoretically-motivated approach to finding higher-order reductions is Generalized Roof Duality (GRD) [43, 44], which proposed a class of submodular relaxations for an arbitrary higher-order MRF with degree at most 4. GRD finds the best such relaxation by solving a linear program, and optimizes the relaxed function exactly using graph cuts. The relaxations found by GRD provide very good approximate solutions to the original higher-order problem. Beyond 119 the restriction on the MRFs degree, which appears difficult to overcome, GRD is also computationally much more intensive than HOCR. In this paper we propose an alternative construction to HOCR and GRD, with improved theoretical and experimental performance. Instead of considering terms in the energy function one at a time, we make use of the fact that the clique structure of an MRF is a hypergraph, in order to reduce many terms at once. We will review existing reduction methods for solving higher-order MRFs with graph cuts in section 4.2, We present our new algorithm in section 4.3, and analyze its worst case performance in section 4.4. In section 4.5 we show that for problems with property called local completeness our method performs even better. Under certain assumptions we prove that some important vision problems are locally complete, including the fields of experts MRF considered by Ishikawa. Experimental results are given in section 4.7, along with experimental evidence that other vision problems [1, 102] are also locally complete. 4.2 Related work There are a number of methods for reducing an arbitrary multilinear polynomial over binary variables into a quadratic one. The performance of the different methods is summarized in figure 4.1. For all methods, we are interested in the size of the obtained quadratic function, including the number of additional vertices and edges required, as these directly affect the size of the min-cut problem which will be solved by QPBO. We make a particular note of the number of nonsubmodular edges as well as the weight of these edges, as these can negatively impact the solution returned 120 Substitution [76] Negative [24] HOCR [37] GRD [43] (d ≤ 4) Ours (worst case) Ours (local completeness) New variables O(nk) t O(td) n + O(t) n + O(td) n + O(t) Non-submodular edges O(nk) – O(nk) – n n Submodular edges O(nk) td O(td2) O(td) O(td2) O(td) Non-submodular weight O(nkM) – O(d2 W ) – O(dW ) O(dW ) Figure 4.1: Resources required to reduce t terms of degrees up to d, for an energy function with n variables each of which occurs with up to k other variables. W is the total weight of all positive terms in the higher-order function. Unlike the other algorithms listed, GRD is only defined for terms of limited degree. There is no clear notion of non-submodular edges in the relaxation produced by GRD, so we mark these entries “–”. Non-submodular weight is the total weight of non-submodular edges in the reduced function. by QPBO [90], as confirmed in our experiments1. 4.2.1 Reduction by substitution The original reduction method was introduced by Rosenberg [76]. The reduction operates on the multilinear polynomial representation of f . The algorithm iteratively eliminates all occurrences of some product xix j by introducing a new variable z, replacing xix j by z everywhere it occurs, and then adding the following penalty terms to the energy function: Mxix j − 2Mxiz − 2Mx jz + 3Mz, where M is a suitably large constant. This forces z to take the value of xix j in any optimal solution. 1Note that the total weight of nonsubmodular edges is not a perfect measure of the performance of QPBO. For example, many functions can have non-submodular edges, but after permuting some labels become submodular [84], and these functions can be exactly minimized by QPBO. Nevertheless, our experiments show this is a useful heuristic. 121 If each variable is in terms with at most k other variables, this reduction can be done with O(nk) pairs, which results in O(nk) new variables, O(nk) nonsubmodular terms and O(nk) submodular quadratic terms. Note that the non-submodular terms have large coefficients. Experimentally it has been reported that QPBO performs very poorly on such energy functions (see, for example, [40, §8.3.4], which states that QPBO finds almost no persistencies). 4.2.2 Reducing negative-coefficient terms Kolmogorov and Zabih [52] for d = 3 and Freedman and Drineas [24] for d ≥ 3 suggested the following transformation for negative higher degree terms: d −x1 · · · xd = min y (d − 1) − x j y∈B j=1 (4.1) If we have t negative-coefficient terms of degree d, this gives t new variables and td submodular quadratic terms, but no non-submodular terms. Let us note that the above equality remains valid even if we replace some of the x j variables with their complements x j = (1 − x j). In [78] this was used to obtain a transformation for sparse functions, i.e., those with only a few labels xC have fC(xC) 0 (see type-II transformations in [78]). 4.2.3 Reducing positive-coefficient terms The HOCR transformation [37, 40] was the first practical method for general higher-order functions. For a term of degree d, let nd = d−1 2 and set ci,d = 1 if 122 d = i and i is odd, and ci,d = 2 otherwise. Each positive term is reduced by nd d x1 · · · xd = min u1,··· ,ud i=1 ui ci,d − x j + 2i j=1 −1 + xi x j i< j For each term of degree d, we get O(d) new variables. Each new variable is connected to each original variable by a submodular edge, for a total of O(d2) submodular edges. We also get non-submodular edges between all pairs of original variables xi, x j whenever xi and x j are in the same clique. If each vari- able occurs in terms with at most k other variables, then the number of non- submodular edges is O(nk) (note that if the pair xi, x j occurs in multiple cliques, we only count the pair once, as they can be combined to a single edge in the flow network). Finally, if the positive term has positive weight α > 0 then this term creates d(d−1) 2 non-submodular edges of weight α. So, if the total weight of positive terms is W, then the quadratic function has non-submodular edges of total weight O(d2W). Note that this reduction uses a large number of non-submodular edges: the d original variables are fully connected by positive weight edges. This is problematic, as it has been observed [90] that non-submodular edges can result in poor performance for graph cut optimizers like QPBO. 4.2.4 Generalized Roof Duality Unlike the above methods, which are all rewrite rules for the individual terms of the multilinear polynomial, Generalized Roof Duality (GRD) [44] finds a reduction to quadratic form which is globally the best among a large class of candidates, called submodular relaxations. 123 GRD uses a characterization of all submodular functions expressible in quadratic form due to Zivny et. al. [104]. This reduction uses one additional variable for each term, as well as O(d) additional edges for a degree d term. The reduction also ensures the existence of persistencies, similar to the persistencies of Roof Duality [33] (and its implementation, QPBO [51]) whereby after solving the submodular relaxation, each variable is assigned a value in {0, 1, ?}, and every variable taking value 0 or 1 in the partial labeling actually takes that value in the (possibly unknown) global optimum. To find the best submodular relaxation, GRD solves a linear program. Because GRD finds the tightest submodular relaxation, the returned labeling is typically of very high quality; however, solving the LP is computationally very expensive, making this algorithm impractical for large-sized problems. Instead of directly solving the LP, the authors also give heuristics to find nearly optimal submodular relaxations. These heuristic relaxations (denoted GRD-heur) also give very good labelings. However, we will show in our experiments that even these heuristic methods are several times slower than HOCR and our technique. Finally, it is worth noting that GRD can only be applied when all terms have degree 3 or 4, and it is doubtful that the method can be generalized. The reduction [104] used by GRD to convert the submodular relaxation to quadratic form has only been described for functions of arity 4. Furthermore, writing down an LP for the optimal submodular relaxation requires being able to compactly describe the set of submodular functions with terms of degree d, and this task is NP hard for d ≥ 4. In contrast, our method and Ishikawa’s have no restriction on the degree of terms involved. 124 4.3 Reducing groups of higher-order terms The terms of a multilinear polynomial form a hypergraph H. The vertices are the polynomial’s variables, and there is a hyperedge H = {x1, . . . , xd} with weight αH whenever the polynomial has a term αH x1 · · · xd. In contrast to earlier methods which reduce term-by-term, our new method uses this hypergraph structure to reduce a group of terms all at once. The two theorems below are both concerned with reducing respectively all the positive or all the negative terms containing a single variable, (or small set of variables); we will write this common subset of variables as U. The most important special case of our reduction is shown in figure 4.2, where we consider all positive terms which contain the variable x1, i.e., U = {x1}. Theorem 58. Let H be a set of terms such that each H ∈ H contains U, i.e., U ⊆ H. Furthermore, we require all the hyperedges H have positive weights αH > 0. Let f (x) = H∈H αH j∈H x j be this polynomial. Then f (x) is equal to  min y∈{0,1}  H∈H αH y j∈U xj + H∈H αHy j∈H\U xj. (4.2) Proof. Given any assignment of the variables x1, . . . , xn, either (1) all the variables in U are 1, or (2) some variable in U is 0. Case 1: Subsituting 1 for the variables in U, f (x) is equal to H∈H αH j∈H\U x j and (4.2) is miny( H∈H αH)y+ H∈H αHy j∈H\U x j. If we assign y = 1, then (4.2) becomes H∈H αH, and if we assign y = 0, then it becomes H∈H αH j∈H\U x j. This quantity is always less than or equal to H∈H αH, so the minimum is achieved when y = 0, in which case, f (x) equals (4.2). 125 Case 2: The product j∈U x j is 0. Since all the terms of f (x) share the common subset U, f (x) = 0. Similarly, (4.2) is H∈H αHy j∈H\U x j. If we assign y = 1, then this sum is 0, whereas if we assign y = 0, then it is positive, since each αH is positive. Thus, the minimum is achieved when y = 1, in which case (4.2) is 0 hence equal to f (x). For every positive term containing the common subset U, equation (4.2) replaces it with a new term αHy j∈H\U x j. To get a multilinear polynomial, we replace the negated variable y with 1 − y, which splits each term into two: aH xj = αH xj − αHy xj j∈H j∈H\U j∈H\U (4.3) Corollary 59. When we apply equation (4.2) to a positive term, we obtain a positive term of smaller degree, and a negative term with y replacing the common subset U. For reducing the negative-coefficient terms all sharing some common subset, we have a similar theorem. Theorem 60. Consider H and U as above, where now the coefficients αH are negative for all H. Let g be the corresponding polynomial. Then for any assignment of the variables, g(x) is min y∈{0,1} −αH 1 − xj − xj y H∈H j∈U j∈H\U (4.4) Proof. The proof is similar to the proof for Theorem 58. The minimum is achieved when y = j∈U x j. A crucial difference between this reduction and theorem 58 is that in the positive case, we could let the common subset U be a single variable. However, 126 x1
 α4
 α1
 α2
 α3
 α4
 α1
 α2
 y
 α3
 ‐α4
 ‐α1
 ‐α2
 ‐α3
 Figure 4.2: Our main reduction. At left are all the original positive terms containing the common variable x1 (so αi > 0). At right are all the new terms we obtain from equation (4.2). The positive terms on top are just the original terms minus x1, and the negative terms on bottom are the original terms with y replacing x1. applying Theorem 60 to U = {xi} removes the term αH j∈H x j and replaces it with αHy j∈H\{1} x j, another negative term of the same degree. Trying to apply this reduction repeatedly will thus never terminate. However, if U consists of two or more variables, then grouping all terms containing U and reducing results in smaller degree terms replacing every term that we start with. 4.3.1 Our method Equations (4.2) and (4.4) can be used for different reduction strategies. Both depend upon the choice of common variables U. Besides choosing |U|, we can also decide the order to consider different choices of U; for example, which single variable to use to apply equation (4.2), or which pair of variables to use to apply equation (4.4). We will focus on the simplest case: we let the common part U be a single variable xi, and reduce positive terms containing this variable via equation (4.2). 127 Negative terms will be reduced using the method of section 4.2.2. Note that more complicated schemes are also possible, such as picking pairs of variables and reducing both positive and negative terms containing this pair via equations (4.2) and (4.4). Our method reduces a multilinear polynomial with higher-order terms, to quadratic form in two steps: Step 1. Eliminate all higher-order positive terms by repeated application of Theorem 58, with the common subset U set to a single variable x1. Gather all terms containing x1, and replace them with equation (4.2). If H consists of all positive terms containing x1, then αH xj = min y∈{0,1} αH x1y H∈H j∈H H∈H (4.5a) + αH xj H∈H j∈H\{1} (4.5b) − αHy xj H∈H j∈H\{1} (4.5c) The positive terms now form a hypergraph on one fewer variable, so repeat with x2, . . . , xn until all positive terms are reduced. Step 2. All higher-order terms now have negative coefficients. Reduce them term-by-term using the methods in section 4.2.2. Note that equation (4.5) is simply the special case of equation (4.2) for a single variable. This special case is illustrated in figure 4.2. 128 4.4 Worst case performance The results of applying equation (4.5) consist of three parts: a positive quadratic term (4.5a); and for each term, a positive term on the original variables with x1 removed (4.5b); and a negative term with y replacing x1 (4.5c). Note that in the course of the reduction, we may create a monomial on some variables x1, . . . , xd and another monomial on the same variables already existed in the input. In this case, we get lucky, since we can just sum the new term’s coefficient with the existing monomial, which doesn’t increase the size of the representation. To analyze the worst-case performance, we will assume that this never happens. In section 4.5 we will revisit this possibility. Under this assumption, each positive term of degree d that we start with will have a single variable removed every time we apply the reduction. To be fully reduced it must go through d − 1 applications of the rule, producing negative terms of degrees 2, . . . , d. Reducing these d − 1 negative terms by section 4.2.2 results in O(d) new variables and O(d2) submodular quadratic terms. Overall, to reduce t positive terms of degree d on n variables, in the worst case our method requires n + O(td) new variables, O(td2) submodular terms and at most n non-submodular terms. Even in the worst case our algorithm’s asymptotic performance is similar to HOCR (see figure 4.1). However, our method produces at most n non-submodular terms, compared to O(nk) for HOCR. For the weight of non-submodular edges, each positive term contributes αH to (4.5a) each time it is reduced, for a total of (d−1)αH weight in non-submodular edges. If the total weight of positive terms is W, we get O(dW) non-submodular 129 weight in the reduced form, a factor d improvement over HOCR. 4.5 Local completeness We can improve on this worst-case analysis for some common vision problems such as [77, 102]. We have identified a property of certain energy functions that we call local completeness, where our algorithm (unlike HOCR or GRD) has improved asymptotic performance. The basic idea behind local completeness is that whenever a monomial x1 · · · xd occurs in the input, we are also likely to see all the monomials on all subsets of these d variables as well. In essence, local completeness argues that typical inputs to vision problems are “bad” in the sense of having lots of terms. If a certain problem is locally complete, then this is a lower-bound argument to show that the input is necessarily large, and hence methods (such as ours) which exploit the shared structure of the graph will be more successful. To be precise, consider a multilinear polynomial on the binary variables x1, . . . , xn, and denote by H the hypergraph of its monomials, as before. Note that H is not necessarily in minimal form (i.e., if H ⊆ H then both H and H may both be hyperedges in H). Let H be the “completed” hypergraph, formed by all subsets of edges in H (that is, H = H∈H 2H). Definition 61. A polynomial is locally complete with completeness c (or has local completeness c) if |H| ≥ c|H | for some c ∈ (0, 1]. To explain the terminology, note that the larger hypergraph H is obtained by completing our input H, to include all the subsets of every term that we started 130 with. Every polynomial is locally complete for some completeness c, as we can al- ways choose c = |H |H | | . However, we are interested in classes of problems which remain complete as the problem size grows, so we say that a family of polynomi- als is locally complete if there is a fixed c such that all the polynomials have local completeness c. For example, a family P of polynomials arising from a particu- lar vision problem would be locally complete if we always had 1/2 of all subsets of terms appearing in all instances of P. 4.5.1 Performance on locally complete problems Recall the procedure for reducing positive terms, using equation 4.5. We would like the extra positive term we create, with variables H \ {1}, to combine with some existing term. If it happens that H \ {1} is already a term with coefficient βH\{1}, then we add αH to this coefficient, and do not create a new term. This motivates the definition of local completeness: the new positive terms in (4.5b) have variables which are subsets of our original terms, so if our energy function has local completeness c, the new positive terms will combine with existing terms a fraction c of the time. Theorem 62. If an energy function has local completeness c, the procedure of Sec- tion 4.3.1 for reducing positive terms will result in at most 1 c |H | negative coefficient terms Proof. By the definition of local completeness |H | ≤ 1 c |H |. As a notational con- venience, add in all the extra subsets contained in H as monomials with co- 131 efficient 0, so there are now |H | terms. Having done this, since H is closed under subsets, the positive terms produced by (4.5b) will always combine with existing terms. Applying equation 4.5 removes the term αH j∈H x j, changes the coefficient on the term with variables H \ {1}, and adds a new negative term αHy j∈H\{1} x j. The total number of terms remains constant. Therefore, when we have finished reducing all positive terms, we are left with only negative terms, and we have as many as we started with, namely |H | ≤ 1 c |H |. If we started with t terms of up to degree d on n variables, the entire re- duction results in at most n+ 1 c t new variables, 1 c td submodular terms and n non-submodular terms. For a family of locally complete inputs, c is constant, giving the asymptotic results in figure 4.1. Local completeness is stronger than strictly necessary — we only really need that when reducing terms containing x1, all the terms H \ {1} already exist. In this case, the analysis depends on the order in which we pick variables. Local completeness gives the stronger property that no matter what order we choose variables to reduce, we always get a large fraction of terms combining. Finally, we reiterate that local completeness does not state that such problems are easy to solve, but rather the opposite: such problems (which include many vision problems, as shown below) have intrinsically large representations as multilinear polynomials; but nevertheless, in such cases our reduction uses few additional variables and edges. 132 4.6 Locally complete energy functions in vision We can show that under some reasonable assumptions an important class of vision problems will have locally complete energy functions. Specifically, we consider fusion moves [66] under an FoE prior [77] with random proposals, as used as a benchmark for HOCR in [37, 40]. The original (non-binary) energy function can be written as a sum over cliques C in the image C fC(xC). A single fusion move has an input image I and a proposed image I , and for every pixel there is binary variable that encodes whether that pixel takes its intensity from I or I . This results in a binary energy function on these variables to compute the optimal fusion move. We can better analyze fusion moves by moving to a continuous framework. Embed the original intensities in R, and extend the clique energies fC to func- tions on Rd. We need two assumptions: (1) fC is d − 1 times continuously differ- entiable and (2) each of the d different mixed partials ∂d−1 f ∂x1···∂xi···∂xd (where ∂xi means to omit the i-th partial) take their zeros in a set of measure 0. Theorem 63. Under these two assumptions, the set of proposed-current image pairs (I, I ) for which the fusion move binary energy function does not have local completeness 1 has measure 0 as a subset of Rn × Rn. We defer the proof of this theorem to Appendix A, and provide a proof sketch. We write the fusion move binary energy function in terms of n binary variables bi. Writing this as a multilinear polynomial in b, each clique C can result in terms tS for each subset S of C. We can show that the energy function is locally complete, if the coefficient on tS is almost never (i.e., with probability 0) zero. 133 For example, here is how to calculate the coefficient on the term b1b2 in a clique of size 3. If I1, I2, I3 are the labellings in the current image on C, and I1, I2, I3 are the proposed labellings, then the coefficient on b1b2 is fC(I0, I1, I2) − fC(I0, I1, I2) − fC(I0, I1, I2) + fC(I0, I1, I2) (4.6) Since the labels are in R, the four 3-pixel images mentioned in this coefficient lie on a rectangle in R3. If we give each of these points v heights of fC(v), then 4.6 is 0 if and only if the four points are coplanar. In general, we do not expect 4 arbitrary points to lie on a plane. However, if fC has no curvature, then any 4 such points will be coplanar. In the full proof, we show that if there exists any open ball of image pairs with zero coefficient on b1b2, then the energy function is flat ( ∂2 f ∂x1∂x2 = 0) in the same ball (contradict- ing our assumption that the partials are nonzero almost everywhere). We also extend this to larger degree terms, to prove the general case. Corollary 64. The energy functions obtained from fusion moves with FoE priors and proposals chosen as random images are locally complete with probability 1. Proof. The functions fC for the FoE model given in [77] are infinitely differentiable, and their mixed partials have their zeros in a set of measure 0. Since the proposed images are chosen from a continuous distribution over Rn, events of measure 0 occur with probability 0. 4.7 Experimental results We have provided a freely available open source implementation of our algorithm at http://www.cs.cornell.edu/˜afix/software.html. We ex- 134 Figure 4.3: Denoising examples. At left is the noisy input image, with our result in the middle and Ishikawa’s at right. Results are shown after 30 iterations. More images are included in the supplemental material. To compare energy values with visual results, the images on the top row have energies 118,014, 26,103 and 38,304 respectively; those on the bottom have energies 118,391, 25,865 and 38,336. 120000 110000 100000 90000 80000 70000 60000 50000 40000 30000 0 Our method HOCR GRD-heur 200 400 600 800 Time (seconds) 1000 1200 1 0.9 0.8 0.7 0.6 0.5 0.4 0 �� ���������� ������������������ ���� ���� ������������ ���� ���� Our method HOCR GRD-heur ���� 200 400 600 800 1000 1200 �� ����� ����� ���� ����� ����� ����� ����� ���� ����� ����� ����� Time (seconds) �������������������������� Figure 4.4: Energy after each fusion move (left), and percentage of pixels labeled by QPBO (center), for the image at top of figure 4.3. Other images from [40] give very similar curves. (right) Fraction of pixels labeled by QPBO vs total weight of non-submodular edges (as a fraction of the total weight of all edges), along with best-fit lines for each method. Energy Labeled fraction ��������������� 135 perimentally compared our reduction with the two available, general-purpose higher-order reductions: HOCR [37, 40] and GRD [43, 44]. These three methods are all direct competitors in the types of energy functions they can handle (with the limitation that GRD is restricted to d ≤ 4). All methods have publicly available code implemented in C++, and provide very similar interfaces for setting up and optimizing a higher-order MRF. For all experiments, we only report the results from the heuristic version of GRD, GRD-heur. Because the exact version solves an LP, running this method on vision-sized inputs proved to be prohibitive. A single iteration of fusion move took an average of an hour, compared to 40 seconds for GRD-heur. Consequently, it was impossible to run this method on the full dataset. Fortunately, GRD-heur has been shown [43, 44] to have similar optimization quality to the exact GRD, at the gain of significantly less computation time. Our benchmark for evaluating the methods is the Fields of Experts prior for image denoising with fusion moves, using a dataset of 200 images. This same experiment was used in the original evaluation of HOCR [37], and later in the evaluation of GRD [43]. We use the same MRF, and as similar energy functions as possible. The fusion moves alternate between a randomly generated uniform image and a blurred image, and the energy function has clique size 4. We do multiple fusion moves on multiple images, so the effects of randomness are minimal. To compare the effectiveness of each method on the individual binary subproblems, we ran all three algorithms on the first 30 fusion moves for each image, using the same starting point and proposal for each method (we averaged over only 30 iterations, because later iterations reduce the energy by much 136 smaller amounts, and wash-out the differences in the methods). The results, averaged over the 200 × 30 = 6000 fusion moves are summarized in figure 4.6. Overall, we see that our method is strictly preferable to HOCR in all metrics, giving a better energy improvement per fusion move, labeling more pixels in QPBO, and taking less time. Compared to GRD, we see that GRD does label more pixels, and reduces the energy by an additional 35%; however, it takes over 7 times as long per iteration to compute. To compare the effect of total-weight of non-submodular edges, we plotted the fraction of pixels labeled by QPBO vs. the fraction of non-submodular weight (divided by the total weight of all edges) in figure 4.4. There is a clear negative relation between these quantities — best fit lines had slopes of -15.6 and -8.2 for our reduction and HOCR respectively. Correspondingly, our method had a lower average fraction of non-submodular weight, 19.7% vs 26.0%, and a higher fraction labeled by QPBO, 76.1% vs 59.4%. We also tested the effect of choosing which order to reduce variables. The default for all experiments was to reduce the variables in order x1, . . . , xn. An alternative would be to choose the variables in decreasing order of number of positive terms. As predicted by our theoretical analysis, this difference had no measurable benefit. The “smart ordering” had 1% more average energy reduction, 0.4% fewer variables labeled, and won against the standard ordering (in terms of energy reduction) in only 42% of the 6000 fusion-moves. We also tested a random ordering, which performed somewhat worse than the standard ordering (better in only 2% of fusion moves, and 2% worse energy reduction). Because of extra bookkeeping, the smart ordering took 51% longer on average — consequently, we recommend using the standard order over more complicated 137 HOCR GRD-heur Our method Final energy 32,199 (+2.3%) 31,375 (−0.3%) 31,473 Time (seconds) 2,050 (+102%) 5,587 (+450%) 1,012 Figure 4.5: Comparison of end-to-end performance on benchmarks in [40] at convergence of the fusion move optimization, averaged over all images. Relative performance compared to our method is shown in parenthesis. HOCR GRD-heur Our method Energy improvement 1,302 (−45%) 3,183 (+35%) 2,351 Percent labeled by QPBO 59.4% (−22%) 86.3% (+13%) 76.1% Time (seconds) 14.1 (+150%) 40.2 (+620%) 5.6 Figure 4.6: Performance comparison of reductions, on benchmarks in [40], averaged over 30 iterations of fusion move. Relative performance compared to our method in parenthesis. schemes. As a second experiment, we compared the end-to-end performance of us- ing each optimizer all the way through a complete run of the fusion-move al- gorithm. These results are displayed in figures 4.4 and 4.5.2 Our method con- verges much faster, despite GRD reducing the energy by slightly more each step. Overall, the computational inefficiency of GRD greatly outweighs the marginal 2We averaged together pairs of consecutive fusion moves in the graph shown at right in figure 4.4. This avoids the distracting sawtooth pattern visible in [37, 40], due to the alternation between random fusion moves and blurred fusion moves. Extra variables Non-submodular terms Total terms HOCR 224,346 421,897 1,133,811 Our method 236,806 (+6%) 38,343 (−90%) 677,183 (−40%) Figure 4.7: Total size of reductions, on Ishikawa’s benchmarks in [40]. Relative performance of our method in parenthesis. 138 improvement in per-subproblem solution quality for fusion move. The sizes of the obtained reductions for our method and the other term- rewriting reduction, HOCR, are summarized in figure 4.7. Overall, our method does better in practice than the asymptotic analysis in figure 4.1 suggests. As predicted, we produce many fewer non-submodular terms, but we also produce fewer submodular terms (a relative improvement of 10%). Visual results are shown in figure 4.3. In the boat image our results appear more accurate in smooth areas like the water, and the face image (shown magnified at bottom left) is also noticably smoother. These results are after 30 fusion moves. The images after convergence (shown, along with more examples, in the supplemental material) are visually similar, though we still obtain lower energy. Finally, we experimentally computed the local completeness of two early vision problems that are quite far from denoising, namely stereo [102] (clique size 3) and segmentation [1] (clique size 4). We analyzed the binary energy functions produced from 60 iterations of [102]. These energy functions have a very high local completeness; on average the energy functions are c-complete for c = .98, and their least locally complete energy function had c = .96. We also discovered that the higher-order segmentation energy function of [1] is absolutely locally complete (c = 1). These results suggest that our method may be particularly well suited to a number of important vision problems. 139 CHAPTER 5 SUM OF SUBMODULAR MINIMIZATION The reduction methods of the previous chapter provide an algebraic method for transforming higher-order binary MRFs to first-order. However, while this transformation preserves some features of the original energy function (e.g., the global minimum value remains the same) the resulting first-order functions are not always easily solvable. In particular, we have seen that non-submodular terms in the resulting reduced energy can lead to poor solutions from optimizers like QPBO. An alternate approach to minimizing binary MRFs is to apply flow algorithms directly to the higher-order energy. Our goal is to preserve two key properties of max-flow based solvers: (1) global optimality of the solution obtained and (2) fast performance on typical inputs for vision problems. To do this, we will need a generalization of max-flow to the higher-order case — this generalization is called Sum-of-Submodular flow. Sum-of-Submodular flow, and the corresponding cut problem, Sum-ofSubmodular minimization, occupy a middle ground between the max-flow min-cut problem, and general submodular function minimization. A set function f : 2V → R is Sum-of-Submodular (SoS) if it is a sum f (S ) = fi + fi + fC(S ∩ C) i∈S i S C where each clique function fC is submodular. (5.1) Recall that a sum of submodular functions is itself submodular, so this is a special case of general submodular function minimization. However, we will 140 show that the structure of f (in particular, that the cliques C form a hypergraph) allows much faster minimization than the O(n6) algorithm of [73]. Additionally, we also have that standard min-cut is a special case of SoS minimization, since for each directed arc (i, j) with capacity ci, j, the cost of that edge being cut can be written as a submodular clique function  fi, j(S ) =   ci, j 0 i ∈ S, j S otherwise (5.2) So, SoS minimization lies in between min-cut and general submodular minimization: MIN-CUT ⊆ SOS MINIMIZATION ⊆ SUBMODULAR MINIMIZATION (5.3) In this chapter, we describe how a Sum-of-Submodular (SoS) function can be minimized by means of an SoS flow network, and give a fast algorithm for solving this minimization, as originally presented in [19]. Throughout this chapter, we will assume that f is a set function, which is a sum of unary and clique functions of the form (5.1). We will also assume that f has been reparameterized such that fC ≥ 0 and fC(∅) = fC(C) = 0, and the linear terms have fi, fi ≥ 0. 5.1 Sum of Submodular Minimization via Submodular Flow For SoS functions, the cut problem is easy to describe: minimize f (S ) over all sets S ⊆ V. In this section, we detail the dual problem: Sum-of-Submodular flow. 141 5.1.1 Definitions and Graph Construction Submodular flow has existed in the combinatorial optimization literature for some time [15, 26]. However, these algorithms are designed for full-order (or nth order) submodular functions, meaning there is no internal clique structure on the function f . The work of [53] was the first to develop an algorithm for the clique structured case (SoS flow), and the problem formulation and mathematical notation of this section are based on that work. SoS flow is similar to the max-flow problem, in that there is a network of nodes and arcs on which we want to push flow from s to t. However, the notion of residual capacity will be slightly modified from that of standard max-flow. We begin with a network G = (V ∪ {s, t}, A). We will denote V + s + t by V. As in the max-flow reduction for Graph Cuts, there are source and sink arcs (s, i) and (i, t) for every i ∈ V. Additionally, for each clique C, there is an arc (i, j)C for every pair {i, j} ∈ C.1 Every arc a ∈ A also has an associated residual capacity ca. The residual capacity of arcs (s, i) and (i, t) are the familiar residual capacities from max-flow: these arcs have starting capacities cs,i and ci,t (determined by the unary terms of f ), and whenever we push flow on a source or sink arc, we decrease the residual capacity by the same amount. For the interior arcs, we need one further piece of information. In addition to residual capacities, we also keep track of residual clique functions f C(S ), related to the flow values by the following rule: whenever we push δ units of flow on 1To explain the notation, note that {i, j} might be in multiple cliques C, so we may have multiple edges (i, j) (that is, G is a multigraph). We distinguish between them by the subscript C. 142 arc (i, j)C, we update f C(S ) by  f C (S ) ←   f C(S ) − δ f C(S ) + δ f C(S ) i ∈ S, j S i S, j ∈ S otherwise (5.4) A flow φ is a function φ : A → R≥0 which satisfies the usual conservation constraints (meaning flow into a node i is equal to the flow out of i for all i s, t). The residual clique functions f C result from applying (5.4) each time an arc has δ units of flow, so we have that f C(S ) = fC(S ) − φi, j,C + φi, j,C i∈S , j S j∈S ,i S (5.5) That is, the residual clique function f C(S ) is the original clique function fC(S ), minus the outflow from S along arcs in C, plus the flow into S along arcs in C. The residual capacities of the interior arcs are chosen so that the f C are always nonnegative. Accordingly, we define ci, j,C = minS { f C(S ) | i ∈ S , j S }. A flow is feasible if all residual capacities are nonnegative. 5.1.2 Flow as a Reparameterization The key to understanding Sum-of-Submodular flow is that any flow gives a reparameterization of the original cost function. That is, flows in the graph defined above are just different ways of re-writing the function f . Finding the maximum flow will end up with a reparameterized f which is particularly easy to optimize, leading to an algorithm for finding the global minimum of f . Interestingly, this same idea is used in several algorithms for energy minimization, including QPBO (see [51] for a very accessible review of this idea) and 143 the dual-LP methods for optimizing multilabel MRFs described in Section 3.4. We will also see this idea in the development of the SoSPD algorithm of Chapter 7. To define this reparameterization, we first define the residual cost function f (S ) to be f (S ) = ci,t + cs,i + f C(S ∩ C) i∈S i S C (5.6) and also define the value of the flow φ to be the total outflow of the source: ν(φ) = φs,i i∈V (5.7) Lemma 65. Any flow φ gives a reparameterization of the original cost function, with f (S ) = ν(φ) + f (S ) (5.8) for all feasible flows φ and all sets S ⊆ V. Recall that we defined a flow φ to be feasible whenever f C ≥ 0 and f i, f i ≥ 0 for all C and i. In particular, if φ is feasible then f (S ) ≥ 0 for all S , which immediately gives the following lower bound on the minimum of f . Corollary 66. If φ is feasible, then f (S ) ≥ ν(φ) (5.9) for all S ⊆ V. Proof of Lemma 65. As with most proofs of one function being a reparametization of another, we have that various quantities have been added and subtracted to the energy in a way that cancels out. In this case, we expand out the residual 144 capacities in (5.6):  f (S ) = ( fi − φi,t) + ( fi − φs,i) + i∈S i S C  fC(S ∩ C) − i∈S , j∈C\S φi, j,C + j∈S ,i∈C\S φi, j,C = fi + fi + fC(S ∩ C) i∈S i S C    − i∈S φi,t + C j∈C\S φi, j,C + i S −φs,i + C φi, j,C i∈S ∩C = f (S ) − φ(S , V \ S ) − φ({s}, V \ S ) + φ(V \ S , S ) (5.10) where φ(A, B) is the total flow from a subset A ⊆ V to B ⊆ V. We will use the fact that φ(A ∪ B, C) = φ(A, C) + φ(B, C) when A, B and C are disjoint. Because flow is conserved at all nodes i s, t we have that the flow into S equals the flow out of S , so φ(S , V \ S ) = φ(V \ S , S ), and in particular, the last line above is f (S ) = f (S ) − φ(V \ S , S ) − φ({s}, V \ S ) + φ(V \ S , S ) = f (S ) − φ({s}, S ) + φ(V \ S , S ) − φ({s}, V \ S ) + φ(V \ S , S ) (5.11) = f (S ) − φ({s}, V) = f (S ) − ν(φ) 5.1.3 The Max-Flow Min-Cut Theorem for SoS Functions The key theorem2 relating SoS flow and SoS function minimization is a direct analogue of the max-flow min-cut theorem. Recall that in standard max-flow on a graph, a flow is a maximum flow if and only if there are no augmenting paths from s to t. Once we have found this maximum flow, the minimum cut 2All proofs, theorems and lemmas in this section are adapted from [53] 145 is obtained by taking S ∗ to be the set of nodes reachable from s along arcs of positive residual capacity (which, by definition, can’t include t, otherwise there would be an augmenting path from s to t). In SoS flow, we get a directly analogous theorem. Given a feasible SoS flow φ, define the set of residual arcs Aφ to be all arcs a with ca > 0. An augmenting path is an s − t path along arcs in Aφ. We will say that a feasible flow φ∗ is maximal if there are no augmenting paths in Aφ∗.3 Theorem 67. Let φ∗ be a maximal flow. Let S ∗ be the set of all i ∈ V reachable from s along arcs in Aφ∗. Then f (S ∗) is the minimum value of f over all S ⊆ V, and f (S ∗) = ν(φ∗). The simplest proof of this theorem uses the reparameterization result above (Lemma 65). We have mentioned that at a maximal flow φ∗, the reparameterization f is particularly easy to minimize. In fact, its minimum is the set S ∗ defined above, which can be found by computing a depth-first search from s in the residual graph Aφ∗. Lemma 68. Let φ∗ be a maximal flow, and let S ∗ be the set of i reachable from s along arcs in Aφ∗. Then f (S ∗) = 0 (5.12) Then, the theorem follows immediately from this Lemma, Lemma 65 and Corollary 66, since f (S ) ≥ ν(φ∗) for all sets S ⊆ V and f (S ∗) = f (S ∗)+ν(φ∗) = ν(φ∗). 3As in standard max-flow, any maximal flow is in fact a maximum flow (i.e., all maximal flows have the same value ν(φ∗)). Note that this follows immediately from Theorem 67, since all maximal flows have value f (S ∗), but to avoid being circular, we’ll state the theorem in terms of maximal flows. 146 Proof of Lemma 68. First, note that we have that f i = 0 for i ∈ S , otherwise we would have that t is reachable from some i ∈ S , which would give an augment- ing path from s to t passing through i. Similarly, we have f i = 0 for i otherwise i would be reachable from s along arcs in Aφ∗. S, Now, we just need to show that f C(S ∗ ∩ C) = 0 for all cliques C. Fix a C, and let T = S ∗ ∩ C. If T = ∅ or T = C then we’re done, since f C(∅) = f C(C) = 0. So, we can assume that T and C \ T are both nonempty. Pick any i ∈ T and j ∈ C \ T . We know that j is not reachable from i along arcs of positive residual capacity, so in particular we must have ci, j,C = 0. Since ci, j,C = minS ⊆C:i∈S, j S f C(S ) there is some Ti, j with f C(Ti, j) = 0 and i ∈ Ti, j, j Ti, j. Now, let Ti = j∈C\T Ti, j. We have that i ∈ Ti and each j ∈ C \ T is not in Ti (since j Ti, j ⊇ Ti). Let T = i∈T Ti. We have that T = T , since each i ∈ T is in Ti, hence also in T , and each j ∈ C \ T is not in any of the Ti, hence not in T . Therefore, we can write T = i∈T j∈C\T Ti, j, and f C(Ti, j) = 0 for all i ∈ T, j ∈ C \ T . So, we must have f C(T ) = 0 since the zero sets of f C form a lattice (i.e., they are closed under intersection and union), by Corollary 38. The key idea of this proof, and the reason that this algorithm only works for submodular functions, is that the zero sets of nonnegative submodular functions form a particular structure called a lattice, meaning they are closed under intersections and unions. In fact, the flow values we’re adding and subtracting 147 are linear functions in S :  i∈S , j S φi, j,C − i S , j∈S φi, j,C = i∈S , j S φi, j,C − i∈S, j∈S φi, j,C − i∈S , j∈S φi, j,C − i S , j∈S φi, j,C = φi, j,C − φi, j,C i∈S , j∈C i∈C, j∈S = (φi, j,C − φ j,i,C) i∈S j∈C = ψi i∈S (5.13) where ψi := j∈C(φi, j,C − φ j,i,C) is the net outflow of node i along arcs in C. Then, we have that f C(S ) = fC(S ) + ψ(S ), and in particular we have that ψ ≤ fC and ψ(C) = fC(C) so that ψ is actually a base of fC (see Section 2.3.3). So, the flow values we’re adding and subtracting end up being a search over the base polytope of fC. An arc (i, j)C becomes saturated exactly when there’s a set Ti, j with i ∈ Ti, j, j Ti, j which is tight. Then, since the tight sets of f C form a lattice, we can take intersections and unions to get a single, consistent S ∗ which includes all the nodes reachable from s, and which is itself a tight set. 5.2 IBFS for Submodular Flow The max-flow min-cut theorem gives a simple algorithm for finding the minimizer of an SoS function — keep track of the current flow and residual graph Aφ, and each iteration find an augmenting path from s to t until no more exist. It is easy to show that with integer cost functions this algorithm must terminate in finitely many steps. This augmenting path algorithm was used in Generic Cuts [3], which was the first application of SoS optimization to computer vi- 148 sion, and the first implementation of SoS flow. For flow on graphs, the current state of the art for computer vision applications is Incremental Breadth First Search (IBFS) [30]. This algorithm is based on the Boykov-Kolmogorov algorithm of [10], with an additional guarantee of polynomial time complexity. In this section, we show how to modify IBFS to compute maximum SoS flows, giving a fast algorithm for sum-of-submodular optimization for typical computer vision inputs. 5.2.1 IBFS on Graphs IBFS is an augmenting paths algorithm: at each step, it finds a path from s to t with positive residual capacity, and pushes flow along it. Additionally, each augmenting path found is a shortest s-t path in Aφ. To ensure that the paths found are shortest paths, we keep track of distances ds(i) and dt(i) from s to i and from i to t, and search trees S and T containing all nodes of distance at most Ds from s or Dt from t respectively. Two invariants are maintained: • For every i in S , the unique path from s to i in S is a shortest s-i path in Aφ. • For every i in T , the unique path from i to t in T is a shortest i-t path in Aφ. The algorithm proceeds by alternating between forward passes and reverse passes. In a forward pass, we attempt to grow the source tree S by one layer (a reverse pass attempts to grow T , and is symmetric). To grow S , we scan through the vertices i at distance Ds away from s, and examine each out-arc (i, j) that has positive residual capacity. If j is not in S or T , then we add j to S at distance level Ds + 1, and with parent i. If j ∈ S then (i, j) is a back-arc, and is not on the 149 shortest path from s to j. If j is in T , then we found an augmenting path from s to t via the arc (i, j), so we can push flow on it. The operation of pushing flow may saturate some arcs (and cause previously saturated arcs to become unsaturated). If the parent arc of a node i becomes saturated, then i becomes an orphan. After each augmentation, we perform an adoption step, where each orphan finds a new parent. The details of the adoption step are similar to the relabel operation of the Push-Relabel algorithm [14], in that we search all potential parent arcs in Aφ for the neighbor with the lowest distance label, and make that node our new parent. If this increases the distance ds(i), then the children of i also become orphans and are recursively adopted as well, to maintain the shortest-path invariant. 5.2.2 Modifying IBFS for SoS Flow In order to apply IBFS to the SoS flow problem (instead of standard graph-flow), all the basic datastructures still make sense: we have a graph where the arcs a have residual capacities ca, and a maximum flow has been found if and only if there is no longer any augmenting path from s to t. The main change for the SoS flow problem is that when we increase flow on an edge (i, j)C, instead of just affecting the residual capacity of that arc and the reverse arc, we may also change the residual capacities of other arcs (i , j )C for i , j ∈ C. A problematic case would be where (i , j )C is saturated because a set S has f C(S ) = 0, with i ∈ S , j S . If we push δ units of flow from i to j going into S (meaning j ∈ S and i S ) then f C(S ) will now be δ > 0, so (i , j ) is no longer saturated. If d( j ) > d(i ) + 1 before the push, then we would have created 150 a shortcut between i and j , which would violate our invariants on the trees S and T . However, the following result ensures that this is not a problem. Let Aφ be the set of arcs with residual capacity, according to the current flow. Lemma 69. If (a, b)C was previously saturated, but now has residual capacity as a result of increasing flow along (c, d), then (1) either a = d or there was an arc (a, d) ∈ Aφ and (2) either b = c or there was an arc (c, b) ∈ Aφ. Corollary 70. Increasing flow on an edge never creates a shortcut between s and i, or from i to t. These results are based on [26], we will prove them in the next section. Corollary 70 ensures that we never create any new shorter s-i or i-t paths not contained in S or T . A push operation may cause some edges to become saturated, but this is the same problem as in the normal max-flow case, and any orpans so created will be fixed in the adoption step. Therefore, all invariants of the IBFS algorithm are maintained, even in the submodular flow case. The final difference between IBFS and a standard augmenting paths algorithm is the “current arc heuristic”, which is a mechanism for avoiding iterating through all possible potential parents when performing an adoption step. In the case of Submodular Flows, it is also the case that whenever we create new residual arcs we maintain all invariants related to this current arc heuristic, so the same speedup applies here. We cover this heuristic in Section 5.4. 151 5.2.3 Running Time The asymptotic complexity of the standard IBFS algorithm is O(n2m). In the submodular-flow case, we still perform the same number of basic operations. However, note finding residual capacity of an arc (i, j)C requires minimizing f C(S ) for S separating i and j. If |C| = k, this can be done in time O(k6) using [73]. However, for k << n, it will likely be much more efficient to use the O(2k) naive algorithm of searching through all values of f C. Overall, we add O(2k) work at each basic step of IBFS, so if we have m cliques the total runtime is O(n2m2k). This runtime is better than the augmenting paths algorithm of [3] which takes time O(nm22k). Additionally, IBFS has been shown to be very fast on typical vision inputs, independent of its asymptotic complexity [30]. 5.3 Proof of the “No Shortcuts” Lemma Lemma 69. If (a, b)C was previously saturated, but now has residual capacity as a result of increasing flow along (c, d), then (1) either a = d or there was an arc (a, d) ∈ Aφ and (2) either b = c or there was an arc (c, b) ∈ Aφ. Proof. The flow before the push on (c, d) is denoted by φ, and the set of all arcs with residual capacity for flow φ is Aφ. Recall that when we increase flow on 152 (c, d), we change the residual clique functions f C by  f C (S ) =   f C(S ) − δ f C(S ) + δ f C(S ) c ∈ S,d S c S,d ∈ S otherwise (5.14) Additionally, we defined the residual capacity of the arc (i, j)C to be ci, j,C = min{ S f C (S ) | i ∈ S, j S }. (5.15) We’ll say that a set S ⊆ C is saturated if f C(S ) = 0, and that S separates i from j if i ∈ S , j S . We’re interested in which saturated sets separate i from j, so denote the saturated sets by Si, j defined as Si, j = {S ⊆ C | i ∈ S , j S , f C(S ) = 0}. (5.16) By (5.15), (i, j)C is saturated if and only if Si, j ∅. In particular, since (a, b)C is initially saturated, we know that there’s some S a,b ∈ Sa,b. Now, to prove (1): assume a d (otherwise, we’re done). We want to show that the edge (a, d) had nonzero residual capacity in the flow φ, so by the above, we need to show that Sa,d = ∅. Assume by way of contradiction that there exists S a,d ∈ Sa,d. Let S a = S a,b∩S a,d. We know that a ∈ S a since it’s in both S a,b and S a,d. We also know that both b and d are not in S a, since b S a,b and d S a,d. Finally, S a is the intersection of two saturated sets, so it must also be saturated, by Corollary 38 (which says that the zero-valued sets of f form a lattice). Therefore, S a is a set containing a but not b or d, and with f C(S a) = 0. When we change flow on (c, d), according to (5.14), we only increase the capacity of 153 sets containing d. So after changing φ, we still have f C(S a) = 0. But then, S a separates a from b, so (a, b)C continues to be saturated, a contradiction. To prove (2), we use a similar argument. We have b c or else we’re done. Assume by way of contradiction that there exists S c,b ∈ Sc,b. Let S c = S c,b ∪ S a,b. We know that a, c ∈ S c and b S c. Furthermore, since S c is the union of two zero-valued sets, it also has f C(S c) = 0. But then, since c ∈ S c, we know that the capacity of S c doesn’t increase, and therefore after changing φ it continues to be a saturated set. But since a ∈ S c, b S c this means that (a, b)C continues to be saturated, a contradiction. Corollary 70. Increasing flow on an edge never creates a shortcut between s and i, or from i to t. Proof. Assume we just increased flow on arc (c, d)C, causing (a, b)C to have positive residual capacity. We’ll consider the case where c ∈ S , the case where d ∈ T follows by symmetry. Because we increased flow on (c, d)C, it is along the shortest path from s to d, so ds(d) = ds(c) + 1. We know that (c, d)C is either a tree arc, or an arc from S to T and hence ds(d) = ds(c) + 1. Since either d = a or (a, d) ∈ Aφ, we have ds(d) ≤ ds(a) + 1. Similarly, since either c = b or (c, b) ∈ Aφ we have ds(b) ≤ ds(c) + 1. Putting these inequalities together, we get that ds(b) ≤ ds(c) + 1 = ds(d) ≤ ds(a) + 1 (5.17) Therefore, when we cause (a, b) to become unsaturated, we don’t create a 154 new shortest path from s to b. The argument that we don’t create a new shortest path from a to t is symmetric. 5.4 The Current Arc Heuristic Finally, we will describe the “current arc heuristic” of IBFS, and show how its invariants still hold in the submodular flow case. We will discuss the currentarc mechanism for the source tree S , the case regarding T is symmetric. Some useful terminology: an arc (u, v) is admissible if ds(v) = ds(u) + 1 and (u, v) ∈ Aφ. Only admissible arcs can be tree arcs. For every v ∈ S , we maintain a list of potential parent arcs (u, v) in an unspecified order, denoted (u, v) ≺ (u , v). The parent node is the parent of v in the tree S , denoted p(v). At any point in the algorithm, the current arc is the arc from this list which is currently the parent arc of v (i.e., the current arc is always (p(v), v)). We maintain the invariant that all arcs before the current arc (according to the order ≺) are not admissible. Therefore, when searching for a new parent for v in an adoption step, we don’t have to scan the entire list of potential arcs, but can instead increment the current arc until we find an admissible arc. If we get to the end of the list, we know that all arcs into v are currently inadmissible, so we must increase the label of v. This scheme allows us to “charge” the scanning of edges to the relabels of v, and there are only O(n) relabels per vertex. In order to make our scheme work for the submodular flow case, we specify a particular order of in-arcs for each v. Put a linear ordering < on V, and order the arcs so that if u < w then (u, v) ≺ (w, v). We use the same linear ordering for doing our breadth-first search, so we scan through the nodes at distance Ds in 155 sorted order, according to <. We need to maintain the invariant that arcs before the current arc (p(v), v) are not admissible. Equivalently, if w < p(v) then (w, v) is not an admissible arc. When a vertex v first enters S , IBFS sets the current arc to the first arc in the list (because we’re scanning through nodes at distance Ds in <-sorted order), so the invariant is initially true. So, assume that (u, v) is the current arc of v at a particular stage in the algorithm, and consider all the ways we could violate the invariant. The first two cases are already covered by the proof of correctness of the standard IBFS algorithm, but we repeat them briefly for completeness. • We could change the current arc (i.e., change the parent of v). But we only do this during an adoption step, when we know that (u, v) is inadmissible. All arcs before (u, v) are also inadmissible, and we scan forward through the list till we find an admissible (w, v) with u < w (or reach the end), which maintains the invariant. • We could relabel a vertex w with w < u and (w, v) ∈ Aφ. But relabels only ever increase the label of w. If d(w) < d(v) then d(w) = d(v)−1 (if d(w) < d(v)− 1 then (w, v) would already be on the shortest path to v). So, increasing the label of w makes (w, v) inadmissible. • We could change the flow φ such that an arc (w, v) which was previously saturated becomes unsaturated, and d(v) = d(w) + 1. This is only a problem if w < u, as the invariant has nothing to say about creating admissible arcs after the current arc. This is illustrated in Figure 5.1 (Left). By Lemma 2.3, the only way we could cause (w, v) to become unsaturated is by pushing on some arc (x, y) such that (1) w = y or 156 v vy vy Increasing distance Increasing distance Increasing distance w Increasing node-order u wx Increasing node-order u ux Increasing node-order w Figure 5.1: Illustrations of the flow network regarding the current arc heuristic. Saturated edges are dotted while unsaturated edges are solid, parent arcs are colored red. The linear order in nodes increases from left to right. (Left) A potential failure of the current arc heuristic. The node w is less than u and pushing on some other arc causes (w, v) to become unsaturated. (Center) A potential arrangement of the nodes, where (x, y) is the node whose flow is increasing. Note that this configuration is impossible because the parent arc of y is (x, y) which comes after (w, y). Similarly the parent arc of v is (u, v) which comes after (x, v). (Right) the only possible configuration of these nodes. In this case, we don’t create a violation of the current arc heuristic, since the new edge created has u < w. (w, y) ∈ Aφ and (2) x = v or (x, v) ∈ Aφ. Recall, from our proof of Corollary 70 that this implies d(v) ≤ d(x) + 1 and d(y) ≤ d(w) + 1. Since we’re pushing on (x, y) we know that (x, y) is a tree-arc, so x = p(y) and ds(y) = ds(x) + 1. If w = y then d(w) = d(y) = d(x) + 1 ≥ d(v) so we don’t create an admissible arc. Similarly, if x = v then d(v) = d(x) = d(y) − 1 ≤ d(w) so we again don’t create an admissible arc. Since neither w = y or x = v, we’re in the case of Figure 5.1 (Center), where there are arcs (x, y), (w, y) and (x, v), all of which are in Aφ. If we’ve created a new admissible arc (w, v), then d(v) = d(w) + 1, and hence d(v) = d(w) + 1 ≥ d(y) = d(x) + 1 ≥ d(v) so d(v) = d(y) = d(w) + 1 = d(x) + 1. Therefore, each of the three arcs (x, y), (w, y) and (x, v) are also admissible. Since x = p(y) we know (since the invariant holds before we change φ) that 157 x ≤ w, as (w, y) is admissible. Similarly, since (u, v) is the current arc of v, we must have that u ≤ x since (x, v) is an admissible arc. Therefore, u ≤ x ≤ w, and hence whenever we create a new admissible arc, we create one after the current arc of v, so the invariant is maintained. Since the current arc invariants are maintained, all the arguments regarding the runtime of IBFS hold in the submodular flow case, and we get an overall number of O(n2m) basic steps, for a total of O(n2m2k) total time. 158 CHAPTER 6 SUBMODULAR UPPER BOUNDS FOR HIGHER ORDER ENERGY FUNCTIONS 6.1 Introduction Now that we have a tool, Sum-of-Submodular flow, for optimizing binary submodular MRFs, we want to apply in a graph-cuts algorithm for more general higher-order MRFs. The key challenge is that most higher-order vision MRFs are not submodular, including the denoising and stereo applications that popularized higher-order MRFs [77, 102]. Our approach for optimizing these nonsubmodular functions is to instead find a submodular function which is close to the original function, and then exactly optimize that submodular proxy. QPBO [8, 51] takes this approach, as does its generalization GRD [43]. Given an arbitrary binary function f , the natural choice of submodular proxy g is an upper bound on f . Since g ≥ f , when we make g small we also make f small. We will call g a submodular upper bound. Experimentally, minimizing submodular upper bounds has been successful at optimizing both pairwise binary functions with the local search algorithm LSA-AUX [31], as well higher-order multilabel problems with Auxiliary Cuts [6] and SoSPD [22]. The main contribution of this chapter is to derive a principled submodular upper bound for higher-order MRFs. We show that we can minimize the distance between the submodular upper bound and the original function, and we give fast algorithms for finding approximately optimal upper bounds in practice. 159 6.2 Background and Related Work For this chapter, the most useful definition of submodularity is the following equivalent condition from Theorem 32: f is submodular if f (S ) + f (S + i + j) ≤ f (S + i) + f (S + j) (6.1) for all S ⊆ V and i, j S . For higher-order functions, any submodular function can be minimized in O(n6) time [73]; however, this is impractical for vision-sized inputs. A more practical alternative is sum-of-submodular optimization, which fits between graph cuts and submodular optimization in complexity. The Submodular IBFS algorithm (Chapter 5) in particular is well-suited to vision inputs, as it generalizes the popular Boykov-Kolmogorov1 algorithm [10] to higher-order inputs, and for fixed-sized cliques runs in worst-case time O(n2m) for m cliques and n variables (the same complexity as IBFS for pairwise inputs). However, it scales as O(2|C|) as the clique size grows, so we will henceforth only consider higherorder inputs with a small constant for the size of the cliques |C|. Currently, all known methods for optimizing general higher-order functions have this exponential (or at least O(|C|6)) dependence on the size of the cliques. Faster methods exist given for certain special cases of energy functions (such as robust-Pn [49], and other structured cost functions), but cannot handle arbitrary energies, as the current method does. For non-binary problems, graph cuts methods typically reduce the solu- tion to a series of binary sub-problems, using move-making algorithms such 1The Boykov-Kolmogorov flow algorithm and IBFS [30] are very similar — the latter makes a small tweak to a subroutine which allows proving an O(n2m) runtime. Despite this, the implementation of BK flow remains popular within the vision community. 160 as alpha-expansion [11] and fusion moves [66], as well as more sophisticated primal dual algorithms such as FastPD [58], and its higher-order generalization SoSPD [22]. The major restriction in the above is that the binary problems are required to be submodular. If the functions are not submodular, then the problem is NP-hard [52]. Yet many optimization problems encountered in practice are not submodular, so we need a submodular function whose optimum is close to the original. Submodular upper bounds have been used before in several different methods to approximate arbitrary functions. The FastPD algorithm [58] uses a very simple upper bound — if, during a particular expansion move, the cut-cost of an edge would be negative, the variant PD3a simply truncates that cost to 0, ensuring the expansion move problem is submodular (this approach originated in [80]). A more sophisticated upper bound was employed in the LSAAUX method [31] which found a series of upper bounds gt, each of which has gt(xt) = f (xt) at the current point xt. Each move solves a pairwise submodular minimization problem xt+1 = arg min gt(x) with graph cuts, giving a fast localsearch method. However, for pairwise functions, submodular upper bounds are particularly simple: either the edge term fi, j is already submodular (in which case no upper bound is necessary) or it is supermodular, in which case the best upper bound is a linear function. For higher-order functions, Auxiliary Cuts [6] uses convexity and other properties of image functionals to compute upper bounds which are pairwise submodular functions. These upper bounds are iteratively minimized and updated similarly to LSA-AUX. The Pseudo-Bound method [92] extends this idea 161 by considering a parameterized family of functions which include a pairwise submodular upper bound, and finding all minimizers of the entire family using parametric max-flow — by looking at all minimizers, a greater decrease in energy per iteration is obtained. Note that for all these methods, the upper bounds are all pairwise functions, even if the functions they approximate are higher-order. Our approach is based on a linear program for minimizing the distance between the submodular upper bound and the original function. We give new approximations to this LP that have nice theoretical properties, and which perform well in practice. For multilabel problems, these upper bounds allow approximate optimization of expansion and fusion moves, even when the movemaking binary-subproblem is non-submodular. We will show that by employing in our upper bounds in a Fusion-moves framework, better energy optimization is obtained. 6.2.1 Notation Some notation we will use throughout the rest of the chapter: Recall that for binary problems, we are identifying set functions f (S ) with the vector notation f (x), x ∈ {0, 1}n. For a set S , and element i, we will write S + i for S ∪ {i}. For a set of coefficients ai for i ∈ V, we let a(S ) = i∈S ai. Note that a is a linear function. For a set of coefficients bi, j on edges {i, j}, we will write b(S ) = bi, j i< j:i, j∈S (6.2) Note that b is a quadratic function. Whenever we use a pair of indices i, j, this is always an unordered pair, so bi, j = b j,i and δS,i, j = δS, j,i. For two set functions 162 f, g : 2V → R we will define the 1-norm and ∞-norm distance between them by treating the functions as length 2|V| vectors, i.e., g − f 1 = S |g(S ) − f (S )| and g − f ∞ = maxS |g(S ) − f (S )|. 6.3 Submodular Upper Bounds Recall that a function is submodular if it satisfies (6.1) for all S ⊆ V and i, j S . Given this definition, we can write an optimization problem minimizing the p-norm distance between an arbitrary function f over all submodular upper bounds g. min g − f p g s.t. g(S ) ≥ f (S ) ∀S ⊆ V g(S ) + g(S + i + j) ≤ g(S + i) + g(S + j) ∀S ⊆ V, i, j S (6.3) For p = 1 and p = ∞ the problem (6.3) is a linear program. There are 2|C| variables, however for small clique size this can be solved exactly using off-theshelf LP solvers. Note that small clique sizes is also the restriction for when sumof-submodular flow is tractable, so this is not an additional restriction. We do apply this upper bound clique-by-clique, so that the final sum-of-submodular function g is the sum g = C gC of upper bounds for each clique, gC ≥ fC. It’s worth noting that other distance measures than the 1- and ∞-norm are possible here. For example, we also considered the 2-norm — however, this leads to a quadratic program in the objective of (6.3), which is harder to solve. Experimentally, we found that the 2 norm was outperformed by the ∞-norm on 163 the examples in Section 8.2, while taking longer to compute. One reason for choosing the ∞-norm is that we can prove a global approximation bound on the minimization problem minx f (x). Let kC = gC − fC ∞ be the ∞-norm objective for (6.3) for each clique C. Then, we always have fC(xC) ≥ gC(xC)−kC for any xC, and hence f (x∗) ≥ g(x∗)− C kC for the minimizing assignment x∗. That is, C kC is an additive approximation bound relating the minimum of g and the minimum of the original function f . In practice, this is a very weak bound; however, it does argue that improving the ∞-norm distance between f and g will result in better energy after minimizing g. Finally, we’ll note that the upper bound for pairwise MRFs proposed in the LSA-AUX method of [31] is a special case of (6.3). This upper bound is obtained by taking each quadratic edge term fi, j(xi, x j) = bi, jxix j of the objective: if bi, j ≤ 0 then the term is submodular, in which case the upper bound is gi, j = fi, j. If bi, j > 0 then the term is supermodular, in which case they give a linear upper bound gi, j(xi, x j) = bi, j xi+ 2 x j . It is simple to check that this linear upper bound minimizes (6.3) for both p = 1 and p = ∞. Essentially, in the pairwise case, there aren’t enough degrees of freedom, so all nontrivial submodular upper bounds are linear functions. This is not the case for cliques with |C| > 2. 6.4 Upper Bound Approximations In practice, even though we can solve (6.3) using linear programming, it is much more efficient to find approximate solutions (in some of our experiments, optimization using the LP upper bounds took upwards of an hour). In this section, we will present 2 alternate submodular upper bounds, both of which minimize 164 the 1-norm objective of (6.3) under some additional constraints, and both of which are much easier to compute. Our intuition for these upper bounds is to first consider the question: what simple families of submodular functions are there? In particular, there are two classes of functions which can easily be shown to be submodular. The first are quadratic (pairwise) functions, in which every quadratic term has non-positive coefficient. The second family are functions of the form φ(S ) = h(|S |) where h is a concave function. 6.4.1 The Iterative Heuristic of SoSPD The first upper bound method for SoS functions was the heuristic employed by the original paper for the SoSPD [22] algorithm (Chapter 7). The basic intuition is to first look at the equations defining submodularity, f (S ) + f (S + i + j) ≤ f (S + i) + f (S + j). If this inequality is violated, we either lower f (S ), f (S + i + j) or raise f (S + i), f (S + j). Since we’re finding an upper bound, we can only increase the values. Thus, our only options are to increase f (S + i) or f (S + j). The SoSPD heuristic simply iterates through all the sets in decreasing size, and increases both f (S + i) and f (S + j) by the amount needed to fix the violated inequality. However, increasing f (S + i) may cause some other inequality to be violated (e.g., f (S − k) + f ((S − k) + i + k) ≤ f ((S − k) + i) + f ((S − k) + k) for some k). Therefore, this method needs to repeatedly iterate through all the sets S until all inequalities are satisfied. The specifics of this method are given in the supplementary material of [22]; 165 however, it was primarily intended as a simple heuristic for a subroutine of the SoSPD algorithm, and is both theoretically and experimentally inferior to the principled lower bounds below. This method is guaranteed to converge, however there are no results concerning how close to the original function f this upper bound will be. 6.4.2 Quadratic-Based Submodular Upper Bounds Consider a quadratic function q(x) = i aixi + i< j bi, jxix j. In set notation, q(S ) = a(S )+b(S ), and q is submodular if and only if bi, j ≤ 0 for all i, j. Our goal is to find a submodular quadratic function q such that gq(S ) := f (S ) + q(S ) is submodular and gq ≥ f . To do so, define δS,i, j = f (S ) + f (S + i + j) − f (S + i) − f (S + j). That is, δS,i, j is the amount by which the submodularity inequality for S , i, j is violated (possibly negative if the inequality is satisfied). For i, j, define ∆i, j = max{maxS :i, j S δS,i, j, 0}, the maximum violation over all constraints containing i, j. Construct a quadratic submodular function q∗ by setting b∗i, j = −∆i, j for all i, j and a∗i = 1 2 j ∆i, j for all i. We now show that q∗ is actually the best q which makes gq a submodular upper bound to f , under the 1-norm. More specifically, consider the program min gq − f 1 q s.t. gq = f + q gq ≥ f gq and q are submodular (6.4) 166 Theorem 71. q∗ is a minimizer of (6.4). Before we prove Theorem 71, we will note that even though q is quadratic, gq may not be. In particular, for the function f of three variables, f (x) = 2x1x2 − x1 x2 x3, we have q∗(x) = x1 + x2 − 2x1 x2 and gq∗(x) = x1 + x2 − x1 x2 x3, which is submodular, but not quadratic. Also, it’s worth noting that q∗ is an approximate solution to (6.3) in that we are minimizing over a smaller set, by restricting g to be f + q for a submodular, quadratic q. However, we can compute q∗ quickly, by iterating first over all pairs (i, j) and then over all S with i, j S and computing δS,i, j for each. For small clique sizes C this is operation is very efficient (though it is O(2|C|) as |C| grows). We now prove Theorem 71. First, we show that the constraints of (6.4) can be re-written as inequalities on the coefficients a and b of q. Proposition 72. (6.4) is equivalent to min{ q 1 : q ≥ 0, bi, j ≤ −∆i, j ∀i, j} q (6.5) Proof. The only non-trivial part of the equivalence is showing we can replace the submodularity constraints with bi, j ≤ −∆i, j for all i, j. So, begin with the inequalities defining submodularity for gq: For every S and i, j S we have a constraint gq(S ) + gq(S + i + j) − gq(S + i) − gq(S + j) ≤ 0. (6.6) 167 Substituting in gq = f + q, and recalling the definition of δS,i, j we have 0 ≥ gq(S ) + gq(S + i + j) − gq(S + i) − gq(S + j) = f (S ) + f (S + i + j) − f (S + i) − f (S + j) + q(S ) + q(S + i + j) − q(S + i) − q(S + j) = δS,i, j + q(S ) + q(S + i + j) − q(S + i) − q(S + j) Then, we can expand out q to get (6.7) δS,i, j + q(S ) + q(S + i + j) − q(S + i) − q(S + j) = δS,i, j + a(S ) + a(S + i + j) − a(S + i) − a(S + j) + b(S ) + b(S + i + j) − b(S + i) − b(S + j) (6.8) = δS,i, j + bi, j Therefore, gq is submodular if and only if bi, j ≤ −δS,i, j for all S with i, j S . As for the constraint that q is submodular, this happens if and only if bi, j ≤ 0. Therefore, we have that all these constraints can be replaced by bi, j ≤ −∆i, j for all i, j. Since b∗i, j = −∆i, j, to show that q∗ is feasible for (6.5) we just need to show that q∗ ≥ 0. Note that we can rearrange q∗ using vector notation as follows q∗(x) = a∗i xi + b∗i, j xi x j i i< j  = i  1 2 j ∆i, j xi + i< j −∆i, j xi x j = i< j xi + 2 xj − xi x j ∆i, j (6.9) (6.10) (6.11) Then, we can check that for the four assignments of xi, xj ∈ {0, 1} that xi+x j 2 − xi x j ≥ 0, so q∗ ≥ 0, and hence q∗ is feasible for (6.5). 168 Finally, to prove Theorem 71 we look at the objective g − f 1 = q 1. q 1 = q(S ) = ai + bi, j S S i∈S S i, j∈S i< j (6.12) = ai + bi, j i S :i∈S i< j S :i, j∈S = 2|C|−1 ai + 2|C|−2 bi, j i i< j (6.13) (6.14) Since for feasible q we have gq(C) − f (C) ≥ 0 we must have q(C) ≥ 0, and hence i ai ≥ − i< j bi, j. Therefore, we get that gq − f 1 ≥ −2|C|−2 i< j bi, j ≥ 2|C|−2 i< j ∆i, j. On the other hand so q∗ is optimal. q∗ 1 = 2|C|−1 a∗i + 2|C|−2 b∗i, j i i< j  = 2|C|−1 i  1 2 j ∆i, j + 2|C|−2 i< j −∆i, j = 2|C|−2 ∆i, j i< j (6.15) 6.4.3 Cardinality-Based Submodular Upper Bounds In this section, we will focus another family of functions, the concave cardinality-based function, φ(S ) = h(|S |) where h is concave. To motivate this, recall that such a φ is submodular if and only if h is concave. The goal of this section is to find a concave cardinality-based function φ such that gφ(S ) := f (S ) + φ(S ) is submodular and gφ ≥ f . We are going to solve the following program: 169 min φ gφ − f p s.t. gφ = f + φ gφ ≥ f, gφ and φ are submodular. (6.16) Let’s introduce n = |C|, and ψk = h(k) for k = 0, 1, . . . , n as a shorthand, so that φ(S ) = h(|S |) = ψ|S |. We also let ∆k = max{max|S |=k−1,i, j S δS,i, j, 0} for k = 1, . . . , n − 1, which is the maximum submodularity violation over all sets of size k − 1.2 Note again that δS,i, j might be negative but ∆k is always non-negative. Now, we can rewrite (6.16) in an equivalent form: min ψ ψ p s.t. ψ ≥ 0, 2ψk ≥ ψk−1 + ψk+1 + ∆k, k = 1, . . . , n − 1 where we define n ψ 1 := k=0 n k ψk ψ ∞ = max ψk k Lemma 73. (6.16) and (6.17) are equivalent. (6.17) (6.18) Proof. Since g(S ) − f (S ) = ψ|S |, the objective is ψ p = gφ − f p by how we’ve just defined ψ p. The constraint ψ ≥ 0 is equivalent to gφ ≥ f . Next, for gφ to be submodular, we require f (S + i) + ψ|S |+1 + f (S + j) + ψ|S |+1 ≥ f (S ) + ψ|S | + f (S + i + j) + ψ|S |+2 (6.19) 2Recall that δS,i, j is defined as in Section 6.4.2, and gives the violation of the submodular constraint for S , i, j. 170 for all S , i, j S . Rearranging this, we get 2ψ|S |+1 − ψ|S | − ψ|S |+2 ≥ f (S ) + f (S + i + j) − f (S + i) − f (S + j) (6.20) = δS,i, j This must hold for all S and i, j S . Equivalently, for every k (where k = |S | + 1) we have 2ψk −ψk−1 −ψk+1 ≥ maxS :|S |=k−1,i, j S δS ,i, j. Also note that since h is concave we always have 2ψk − ψk−1 − ψk+1 ≥ 0. Therefore, the submodularity of gφ is equivalent to the constraints 2ψk − ψk−1 − ψk+1 ≥ ∆k for k = 1, . . . , n − 1. Since 2ψk − ψk−1 − ψk+1 is widely referred as the discrete Laplacian operator, we will refer these constraints as Laplacian constraints. Furthermore, if we require all the inequalities in the Laplacian constraints to be satisfied with equality, these are the called the inhomogeneous Laplacian equations. Our algorithm to find the cardinality-based upper bound is to solve these Laplacian equations, in a procedure detailed below. We will show later in this section that this algorithm gives us a feasible submodular upper bound. Additionally, ψ∗ is optimal under the 1-norm and a 2-approximation to (6.16) under the ∞-norm. Without loss of generality, we can assume ψ0 = ψn = 0 — since ψ0 and ψn only appear in the RHS of the Laplacian constraints, for any feasible solution ψ, we could decrease ψ0, ψn to get ψ0 = ψn = 0 and none of the constraints will be violated and we decrese the objective value. After fixing ψ0 and ψn to be 0, there are n − 1 variables ψ1, . . . , ψn−1 and n − 1 Laplacian constraints remaining. We can uniquely solve the corresponding Laplacian equations since the coefficient matrix has full rank. Also note that this is very efficient efficient to solve — since the coefficient matrix M is tridiag- 171 onal, we can easily factor it into an LU decomposition and using forward and backward substitution, solve for Mψ = ∆ in O(n) time. The bottleneck of the cardinality-based upper bound computation is still the computation of δS,i, j and ∆k, which takes exponential time in the clique size. The following lemma proves the correctness of the algorithm. Lemma 74. Solving the Laplacian matrix gives us a feasible solution for (6.17), where we define ψ∗ = M−1∆. Proof. Clearly, all the Laplacian constraints are satisfied since we treat them as equalities and solve the linear system for a point which makes all of them satisfied simultaneously. To show non-negativity, it’s straightforward to show that the inverse matrix of Laplacian coefficient matrix is a positive matrix; for completeness, we include this in Appendix B. Our definition of ∆k is non-negative, hence our solution ψ∗ = M−1∆ is a non-negative matrix times a non-negative vector hence also non-negative. Now, let’s show some optimality properties of the cardinality-based upper bound. The following lemma is the key fact for the analysis. Lemma 75. For ∀k ≤ n 2 , all feasible ψ have ψk + ψn−k ≥ Lk, where L0 = 0 and Lk = Lk−1 + n−k i=k ∆i. Solving the Laplacian equations gives us the ψ which simultaneously minimizes the quantity ψk + ψn−k for ∀k, meaning ψ∗k + ψ∗n−k = Lk. Proof. We prove this by induction on k. The base case k = 0 is trivial since we enforce ψ0 = ψn = 0 in our algorithm, and the non-negativity of ψ ensures that ψ0 + ψn reaches its lower bound L0 = 0. 172 For k ≥ 1, we can sum up the k-th to the (n − k)-th Laplacian constraints, to get and hence n−k n−k (2ψl − ψl−1 − ψl+1) ≥ ∆l l=k l=k n−k n−k n−k ∆l ≤ (ψl − ψl−1) + (ψl − ψl+1) l=k l=k l=k (6.21) (6.22) = ψn−k − ψk−1 − ψn−k+1 + ψk and rearranging, we have n−k n−k ψk + ψn−k ≥ ψk−1 + ψn−k+1 + ∆i ≥ Lk−1 + ∆i = Lk i=k i=k (6.23) Since we get ψ’s from solving the Laplacian equations, all the above inequalities hold with equality. So inductively assuming ψ∗k−1 + ψ∗n−k+1 = Lk−1 we have n−k ψ∗k + ψ∗n−k = ψ∗k−1 + ψ∗n−k+1 + ∆i = Lk i=k (6.24) Theorem 76. The cardinality-based upper bound is optimal under the p = 1 version of (6.17). Proof. We have the 1-norm objective of (6.17) to be n k=0 n k ψk. Since the binomial coefficients are symmetric, this objective function is a positive linear combina- tions of ψk + ψn−k. For odd n we have n ψ 1 := k=0 n k ψk = (n−1)/2 k=0 n k (ψk + ψn−k) (6.25) and for even n: ψ n/2−1 1= k=0 n k (ψk + ψn−k) (6.26) +1 2 n n 2 (ψ n 2 + ψn) 2 Since we separately minimize each sum ψk+ψn−k, due to Lemma 75, we minimize the whole objective as well. 173 Theorem 77. The cardinality-based upper bound gives a 2-approximation under the p = ∞ version of (6.17). Proof. Under the ∞-norm, the objective function in (6.17) is maxk ψk. In Lemma 75, we have established the lower bound for ψk + ψn−k ≥ Lk so that max{ψk, ψn−k} ≥ Lk 2 . Hence, the minimum of (6.17) is at least maxk Lk 2 . We also know our choice of ψ∗ has ψ∗k + ψ∗n−k = Lk for each k and all the ψk are non- negative, i.e., in the worst case, our algorithm can give us a feasible solution with objective value maxk Lk. Therefore, our algorithm is a 2-approx under ∞- norm. Additionally, we can prove the following general approximation ratio for arbitrary p-norm (p ≥ 1), which contains the previous two theorems as special cases. The proof for the general case is analogous to the proof for the 1-norm, and we will defer it to Appendix C Theorem 78. The cardinality-based upper bound gives a 2(1− 1 p ) -approximation for (6.17). r 174 CHAPTER 7 A PRIMAL-DUAL ALGORITHM FOR HIGHER-ORDER MULTILABEL MARKOV RANDOM FIELDS Now that we have adapted the state-of-the-art flow algorithm for vision problems to solve higher-order submodular MRFs, and also shown how to handle arbitrary non-submodular binary MRFs using upper bounds, we can turn to the problem of handling multi-label problems. In this chapter we propose a new primal-dual energy minimization method for arbitrary higher-order multilabel MRFs. Primal-dual methods provide guaranteed approximation bounds, and can exploit information in the dual variables to improve their efficiency. Our algorithm generalizes the PD3 [57] technique for first-order MRFs, and relies on the SoS IBFS algorithm of Chapter 5 to optimize the binary MRFs at each step. We provide approximation bounds similar to PD3 [57], and the method is fast in practice. It can optimize non-submodular MRFs, and additionally can incorporate problem-specific knowledge in the form of fusion proposals. 7.1 Higher-order Multi-label MRFs In multi-label problems, we now allow the label set Xi for each variable i to be larger than just {0, 1}. We minimize the cost of the labeling f : X → R defined by f (x) = fi(xi) + fC(xC). i C∈C (7.1) fC is a function of just the variables xi with i ∈ C (a subvector of x which we de- note xC). We assume, without loss of generality, that fi, fC ≥ 0 (by reparametriz- ing them using Lemma 19 of Section 2.1). Special cases include first-order MRFs where |C| = 2, and binary MRFs where |Xi| = 2; we are interested in the general 175 case of higher-order multi-label MRFs, where we restrict neither |C| nor |Xi|. For multi-label MRFs, the most popular graph-cuts algorithms are based on alpha-expansion. Recall that alpha-expansion solves a series of binary problems, where each variable i can either keep its current label xi, or switch to a fixed label α. Alpha-expansion cycles through all α ∈ Xi until no variable changes in an entire loop through all α. For certain classes of pairwise function fi, j, alpha-expansion has provable approximation guarantees. If the fi, j are all Potts terms, then alpha-expansion is a 2 approximation. When the fi, j are all the same, and form a metric, then the work of [47] showed that the approximation ratio is 2 f max f min where f max is the maximum value of fi, j and f min is the minimum nonzero value of fi, j. The connection between graph cuts and primal-dual techniques was established by [57] who showed that α-expansion could be interpreted as simultaneously optimizing primal and dual solutions. [57] proposed several primaldual algorithms that generalized α-expansion and provided both theoretical and practical advantages. These methods apply to much more general energy functions and extend the approximation bounds of [47]. Empirically, keeping track of the dual variables allows a number of implementation speedups compared to α-expansion, resulting in the very efficient algorithm FastPD [59]. 7.1.1 Summary of Our Method In this chapter, we provide an generalization of the primal-dual algorithm PD3 of [57] that can efficiently minimize an arbitrary higher-order multilabel energy 176 function. Briefly: PD3 relies on the max-flow / min-cut algorithm; the flow values update the dual variables and the min-cut updates the primal variables. Our method instead uses the SoS flow of Chapter 5 which can exactly minimize the class of Sum-of-Submodular functions, with a corresponding SoS max-flow. Primal-dual methods rely on the optimality conditions for linear programming, in particular the complementary slackness conditions of Section 2.8. These conditions relate the slack in a constraint of the primal problem with the non-zeros of dual variables, and give necessary conditions for a pair of primal and dual solutions to be optimal. Our algorithm begins with the Local Marginal Polytope (LMP) relaxation of Section 2.4. Recall that the LMP has two kinds of constraints, corresponding to the unary terms and clique-based terms in equation (7.1). We refer to the respective complementary slackness conditions as unary and clique slackness conditions. We will keep track of a primal solution x and (not necessarily feasible) dual solution λ. We will ensure that x, λ always satisfy the clique slackness conditions, and at each step of the algorithm, we will try to move x to be closer to satisfying unary slackness. The algorithm converges to a solution where both slackness conditions hold, but we generally lose feasibility. However, there exists some ρ such that λ/ρ is dual-feasible. This gives us a ρ-approximation algorithm for a class of functions we call weakly associative. We review the related work in Section 7.2. Our algorithm is presented in Section 7.3, and we conclude with an experimental evaluation in Section 8.3. 177 7.2 Related Work 7.2.1 Graph Cut Methods and Higher-Order MRFs The most popular graph cut methods for multilabel first-order MRFs rely on move-making techniques. Those methods, which notably include αexpansion [12] and fusion moves [66], reduce the multilabel problem to a series of binary subproblems which are then solved by max-flow [8, 52]. In αexpansion [12], the binary problem involves each pixel deciding whether to keep its current label or adopt a particular new label α. The expansion move algorithm also provides a guaranteed approximation bound. [57, 58] proposed a primal-dual framework that generalizes α-expansion. They interpreted this algorithm as optimizing the primal and dual problem of the LP-relaxation of the MRF energy function simultaneously. In addition, the general primal-dual algorithm overcomes the most important limitation of the α-expansion algorithm, which is the requirement that the pairwise energy must be a metric [12]. The same approximation ratio still holds for a much broader class of energy functions. Furthermore, by tracking the dual variables to speed up the optimization, it can be 3-9 times faster in practice [59]. 7.2.2 Linear Programming and Duality for MRFs Much of the theory of MRF optimization algorithms revolves around a specific linear programming relaxation of (7.1) known as the Local Marginal Polytope formulation [86], which was extended to higher-order MRFs in [99]. Every lin- 178 ear program (LP) has a corresponding dual, and the dual program has resulted in efficient algorithms such as [56, 57, 59]. We derived the dual program for the Local Marginal Polytope in Section 2.9. Recall that the dual program has variables for each clique C, i ∈ C and label xi, denoted λC,i(xi); and is given by max λ i min hi(xi) xi hi(xi) = fi(xi) + λC,i(xi) C λC,i(xi) ≤ fC(xC) i∈C ∀i ∀C, xC (7.2a) (7.2b) (7.2c) We can informally think of the dual variable λC,i(xi) as taking part of the cost fC(xC), and redistributing it to the unary terms. Following [57], the functions hi(xi) will be called the “height” of label xi at variable i, and semantically can be thought of as the original cost fi(xi), plus any redistribution λC,i from the cliques to the unary terms at i. The dual is always a lower bound on the value f (x) of any labeling. 7.2.3 Sum-of-Submodular Flow We will summarize the most important features of SoS flow from Chapter 5. We have a set of vertices V plus the source s and sink t, and arcs (s, i) and (i, t) for each i ∈ V. We are also given a sum-of-submodular function: g(S ) = gC(S ∩ C) + ci,t + cs,i C∈C i∈S i S (7.3) 179 where C ∈ C are called cliques in V, and each gC is a submodular function, called a clique function, with gC(∅) = gC(C) = min gC(S ∩ C) = 0. S (7.4) Intuitively, the difference between max flow and sum-of-submodular flow is that in addition to capacity and conservation constraints, we will also require that the flow out of any set S is at most gC(S ∩ C). To be precise, a sum-ofsubmodular flow has flow values φs,i and φi,t on the source and sink edges, as well as flow values φC,i for each clique C and i ∈ C. Then, a maximum sum-ofsubmodular flow is a solution to the following LP: max φ φs,i i s.t. φs,i ≤ cs,i, φi,t ≤ ci,t φs,i − φi,t − φC,i = 0 Ci φC,i ≤ gC(S ) i∈S ∀i ∀i ∀C, S ⊆ C (7.5a) (7.5b) (7.5c) (7.5d) Here, (7.5b) are the capacity constraints for source and sink edges, with capacities given by the unary terms cs,i, ci,t, (7.5c) are the flow-conservation constraints at i and (7.5d) are the additional constraints that the φC in a set S are at most gC(S ). [53] shows that this LP can be solved by a generalized flow algorithm. Finally, we have a sum-of-submodular version of the min-cut max-flow theorem, originally from [53], and described in Section 5.1.3. If φ maximizes (7.5), and S minimizes (7.3), then the objective value (7.5a) of φ is equal to g(S ). Furthermore, the notion of saturated edges extends to the clique function: (1) if 180 Initialize x arbitrarily. Initialize λC,i(xi) = 1 |C| fC (xC ), and λC,i(a) = 0 for a xi. while unary slackness condititions are not satisfied do y ← result of proposal generator PRE-EDIT-DUALS(x, y, λ) x , λ ← UPDATE-DUALS-PRIMALS(x, y, λ) POST-EDIT-DUALS(x , λ ) end while return x Algorithm 1: Our SoSPD algorithm. i ∈ S then φi,t = ci,t (2) if i S then φs,i = cs,i, and most importantly (3) for every clique C, gC(S ∩ C) = i∈S φC,i. 7.3 The SoS Primal Dual Algorithm Our algorithm, which we will call SoSPD, is designed around ensuring that two main conditions are satisfied regarding the primal and dual solutions. These conditions give us our approximation bound, as well as help design the rest of the algorithm. The conditions are complementary slackness conditions (Section 2.8), in which the inequalities in the dual that correspond to a particular primal solution are actually satisfied with equality. Definition 79. Given a labeling x and dual solution λ, we say that x, λ satisfy the clique slackness conditions if the constraints in (7.2c) corresponding to xC are satisfied with equality. That is, we have λC,i(xi) = fC(xC) i∈C ∀C (7.6) Proposition 80. If x, λ satisfy the clique slackness conditions, then f (x) = i hi(xi). 181 Proof. Remembering our redistribution argument, this means we have exactly partitioned fC(xC) among the λ, so the sum of the heights is the original cost f (x). That is,  hi(xi) =  fi(xi) + λC,i(xi) ii C    =  fi(xi) +  λC,i(xi) i Ci (7.7) = fi(xi) + fC(xC) = f (x) iC Definition 81. x, λ satisfy the unary slackness conditions if for each i we have hi(xi) = mina hi(a). Corollary 82. If x, λ satisfy both the clique and unary slackness conditions, and λ is feasible, then x minimizes f . Proof. From Proposition 80, the sum of heights i hi(xi) is equal to f (x), and by the definition of unary slackness, the sum of heights is also equal to the dual objective, the lower-bound on all possible values f (x). Since our original problem is NP-hard we can’t expect both slackness condi- tions to hold for a feasible dual λ and integral primal x (for any solution we can find in polynomial time). We instead apply a technique called dual scaling [57], in which we allow our duals to become slightly infeasible, but in a way that they can be multiplied by a scalar to become feasible. More specifically, the structure of (7.2) always allows us to scale down λ by 1 ρ for some ρ ≥ 1 to get a feasible solution. This gives us approximate optimality. Lemma 83. If x and λ satisfy the unary and clique slackness conditions, and λ/ρ is dual feasible, then f (x) ≤ ρ f (x∗), where x∗ is the true optimum. 182 Proof. Since x, λ satisfy both slackness conditions, we know that f (x) = i minai hi(ai), hence f (x) = ρ i 1 min ai ρ fi(ai) + C λC,i(ai) ≤ρ i min ai fi(ai) + C 1 ρ λC,i(ai) ≤ ρ f (x∗) where the first inequality is because fi ≥ 0, and the second from λ/ρ being dual- feasible. Lemma 83 gives the basic motivation behind our algorithm. Between iterations, x, λ will always satisfy the clique slackness conditions, and the goal of each iteration is to change x to move to lower height labels. At the end of the algorithm, all the xi will be the lowest height labels for each i, and the unary slackness conditions are satisfied. Then, we’ll prove that there exists some ρ such that λ/ρ is dual-feasible, and hence we have a ρ-approximation algorithm. The difficult step in this algorithm is that when we change the labeling x to decrease the height, we must still maintain the clique slackness conditions. We cannot simply set each xi to the lowest height label, lest the clique slackness conditions cease to hold. Instead we simultaneously pick a set of labels to change, and adjust the dual variables such that the new clique slackness conditions are tight. For the higher order case, we can show that sum-of-submodular flow is exactly the tool we need to ensure the clique slackness conditions still hold when changing labels. At a high-level, the algorithm works as follows. At each iteration, much like the α-expansion or fusion move algorithms, we have a current labeling x and a proposed labeling y. We use sum-of-submodular flow to pick a set S of variables that switch labels, and the max-flow min-cut theorem for sum-of-submodular 183 flow will ensure that the new variables x , λ also satisfy the clique slackness conditions. Our SoSPD technique is summarized in Algorithm 1, and each iteration has 3 subroutines. The main work of the algorithm occurs in UPDATE-DUALSPRIMALS, which sets up the sum-of-submodular flow problem, and picks a set of variables to swap. We will describe this subroutine first, in Section 7.3.1, making some assumptions about x, λ which may not hold in general. Then, it is the job of the other two subroutines, PRE-EDIT-DUALS and POST-EDIT-DUALS (Sections 7.3.2 and 7.3.3) to make sure these assumptions do hold, and that therefore the algorithm functions correctly. 7.3.1 Update-Duals-Primals To begin with, we need notation for fusion moves [66]. If we have current and proposed labelings x and y, and S is the set of variables that change label, we’ll denote the fused labeling by x = x[S ← y], which has xi = yi if i ∈ S , and xi = xi if i S . Given our current state x, λ, we’re going to construct a sum-of-submodular flow network. The values φC,i will be the amount we add or subtract from λC,i(yi), and the source-sink flow φs,i, φi,t will give the change in height of hi(yi). We will only ever adjust the dual variables λC,i(yi) corresponding to the proposed labeling y.1 The easy part is defining the source-sink capacities. If hi(yi) < hi(xi) then we 1Note that if xi = yi, we do not change λC,i(xi). We could accomplish this by simply removing such i from the flow network. However such vertices i will have, by construction, no outgoing capacity in the network, so φC,i must always be 0. 184 can raise the height of label yi by the difference, and still prefer to switch labels. Similarly, if hi(yi) > hi(xi), we can lower the height of yi by the difference without creating a new label we’d prefer to swap to. We define source-sink capacities by cs,i = hi(xi) − hi(yi), ci,t = 0 if hi(yi) > hi(xi), and cs,i = 0, ci,t = hi(yi) − hi(xi) otherwise. In addition to decreasing the heights of the variables, our other main concern is making sure that the clique slackness conditions continue to hold. Consider an individual clique C for now, and let us examine what our labeling xC could look like after a fusion step. The possible labelings are xC[S ← yC] for each subset S of C. We want to make sure that after the swap, i λC,i(xi) = fC(xC), so define a function gC equal to the difference: gC(S ) := fC(xC[S ← yC]) − λC,i(yi) − λC,i(xi) i∈S i S (7.8) For now, we’ll assume that (1) gC is a submodular function and (2) gC(∅) = gC(C) = 0, gC(S ) ≥ 0. These assumptions will end up being enforced by PREEDIT-DUALS, which we describe below. Under these assumptions the capacities c and functions gC define a sumof-submodular flow network, so we can find a flow φ and cut S such that gC(S ∩ C) = i∈S φC,i (by the sum-of-submodular version of the max-flow min-cut theorem [53], paraphrased at the end of Section 7.2.3). Then, we set x = x[S ← y], and λC,i(yi) = λC,i(yi) + φC,i. By definition of gC, we have fC(xC) = gC(S ∩ C) + λC,i(yi) + λC,i(xi) i∈S i S = [λC,i(yi) + φC,i] + λC,i(xi) = λC,i(xi ). i∈S i S i Therefore, the primal and dual solutions satisfy the clique slackness condi- 185 tions, and our source-sink capacities were chosen so that hi(xi) ≤ hi(xi). Finally, note that unless every edge out of s gets saturated (and hence S = ∅) then at least one height has strictly decreased. 7.3.2 Pre-Edit-Duals The job of PRE-EDIT-DUALS is to ensure that the assumptions we made in UPDATE-DUALS-PRIMALS are actually true. Namely, we need (1) the function gC must be submodular and (2) gC(∅) = gC(C) = 0 and gC(S ) ≥ 0. For (1), first note that if fC(xC[S ← yC]) is a submodular function of S , then so is gC, since a submodular function plus a linear function is still submodular. Such functions were called expansion-submodular in [19]. To handle general energy functions, we need an approach for the case where the fusion move is not submodular. We take a similar approach to the PD3 variant PD3a [57], which finds an overestimate of the original energy function. For pairwise energies finding a submodular overestimate simply consists of truncating negative capacities to 0. In our case, we must find a submodular upper bound, fC(S ), such that fC(S ) ≥ fC(xC[S ← yC]). Our only other requirements are that fC(∅) = fC(xC), fC(C) = fC(yC), and that f ({i}) ≤ maxxC fC(xC) for i ∈ C. We will use the methods of Chapter 6 for finding submodular upper bounds of functions. We consider each of the choices presented there in the experiments. Having computed fC, we then substitute it for fC, just for this iteration. To simplify the notation we will write fC(xC) to mean fC(S ) wherever xC = xC[S ← 186 yC ]. To establish assumption (2), we make use of Edmonds algorithm [16], described by Lemma 35 from Section 2.3.3. This states that for any submodular function g with g(∅) = 0, there is a vector ψ such that g(S ) + ψ(S ) ≥ 0 and g(C) = −ψ(C) (where we are using the standard notation ψ(S ) := i∈S ψi). In fact, the vector defined by ψi = g({1, . . . , i − 1}) − g({1, . . . , i}) will suffice. To ensure (2) holds, we start with gC(S ) defined as in (7.8). Note that we have gC(∅) = fC(xC) − i∈C λC,i(xi), which by the clique slackness condition, we know is 0. We can therefore compute a ψ as just described, and update λC,i(yi) ← λC,i(yi) − ψi. Since gC(S ) + ψ(S ) ≥ 0 and gC(C) + ψ(C) = 0, when we update gC ← gC + ψ with the new values of λ, we satisfy gC(S ) ≥ 0 and gC(C) = 0. 7.3.3 Post-Edit-Duals Having run UPDATE-DUALS-PRIMALS, we know that fC(xC) = i λC,i(xi). However, from PRE-EDIT-DUALS, f might be an overestimate of f . The subroutine POST-EDIT-DUALS enforces the clique slackness conditions, by setting λC,i(yi ) = 1 |C| fC (xC ) for each clique C. Note that if f is an overestimate, this can only ever decrease the sum of heights hi(xi) (since we first average, and then subtract the overestimate from λ). One final property of POST-EDIT-DUALS: since fC(xC) ≥ 0, we always know that λC,i(xi) ≥ 0. We will use this in the proof of approximation ratio, momentarily. 187 7.3.4 Proof of Convergence Much like the pairwise algorithms α-expansion and fusion move, we have monotonically decreasing energy. Lemma 84. The objective value f (x) is non-increasing. Proof. First, recall that x, λ satisfy the clique slackness conditions, so f (x) = i hi(xi). We also know that PRE-EDIT-DUALS doesn’t change any of the heights hi(xi), UPDATE-PRIMALS-DUALS can only decrease hi(xi) (by definition of the source-sink capacities) and POST-EDIT-DUALS also doesn’t increase the sum of heights. The convergence of our method is not guaranteed for arbitrarily bad fusion moves (for instance, we could have a bad proposal generator which always suggests labels which have greater height than xi). For α-expansion proposals, however, convergence is guaranteed. Proposition 85. With the proposal yi = α for each i, at the end of the iteration either f (x ) < f (x) or hi(xi) ≤ hi(α) for all i. Proof. From the discussion of UPDATE-DUALS-PRIMALS, one of two things happens: (1) the height of at least one variable is strictly decreased, or (2) the minimum cut is S = ∅. If (1), then neither of the other subroutines increases the sum of heights, so by Proposition 80 we have f (x ) < f (x). If (2) then all edges out of s are saturated, so UPDATE-DUALS-PRIMALS increased hi(α) to be at least hi(xi). Furthermore, x = x and so neither PRE-EDIT-DUALS nor POST-EDIT-DUALS changes any of the λC,i(xi), and therefore hi(xi) ≤ hi(α) holds at the end of the iteration. 188 Lemma 86. If after running through iterations of α-expansion for every label α, f (x) does not strictly decrease, then the unary slackness conditions must hold, and the algorithm terminates. Proof. Since every α-expansion iteration didn’t change the objective f (x), by Proposition 85 each such iteration ensures that hi(α) ≥ hi(xi). Also note that a β-expansion for β α doesn’t change any of the hi(α). Therefore, the xi are all minimum height labels, and the unary slackness conditions are satisfied. Overall, with integer costs, the objective decreases by at least 1 each outeriteration and therefore eventually halts. The running time of each iteration is dominated by the SoS flow computation — we use SoS-IBFS [19] which has runtime O(|V|2|C| 2k), where k = max |C|. It is difficult to provide a non-trivial bound on the number of α-expansion iterations, but in practice we always observe convergence after 4 passes through the label set. Note that this is exponential in the clique size, since we represent submodular functions as tables of 2k values. However, this is also true of other state of the art methods for higher-order MRFs such as [18, 40, 55]. 7.3.5 Approximation Bounds Let f max = max fC(xC), f min = min fC(xC), where the max and min are over all cliques C and all non-constant labelings xC (meaning there is no a with xi = a for all i ∈ C). There is a natural class of MRFs where f min > 0 (i.e. all non-constant labelings have positive costs), and where constant labelings have zero cost. We call such MRFs weakly associative; they encourage all variables in a clique to have 189 the same label, but are otherwise unrestricted on non-constant labelings. This generalizes what [57] calls non-metric energies. Our approximation ratio will be ρ = k .f max f min Note that ρ is finite only for a weakly associative MRF. This generalizes the approximation ratio for PD3, which is 2 f max f min . Theorem 87. SoSPD with α-expansion for a weakly associative MRF f is a ρapproximation algorithm, i.e., the primal solution x at the end will have f (x) ≤ ρ f (x∗). Proof. The first task is to show that λ doesn’t get too big. In particular, after any iteration, λC,i(ai) ≤ f max for all ai. Note that after UPDATE-DUALS-PRIMALS, we have φC,i ≤ gC({i}) := fC({i}) − λC,i(xi) − λC,i(yi) ji Since we constructed f to have f ({i}) ≤ f max and POST-EDIT-DUALS from the previous iteration makes sure λC,i(xi) ≥ 0, we get λC,i(yi) = λC,i(yi) + φC,i ≤ fCmax. If POST-EDIT-DUALS in the present iteration changes λC,i(yi), it sets it to 1 |C| fC (x ) ≤ fCmax. Therefore, λC,i(yi) ≤ fCmax, and we don’t change λC,i(ai) for any ai yi in this iteration, so inductively, at the end of the algorithm λC,i(ai) ≤ fCmax for all labels ai. For feasibility, we need to show that (7.2c) holds for each clique C and label- ing xC. For non-constant xC we have 1 ρ λC,i (xi) ≤ |C| fCmax ρ ≤ fCmin ≤ fC (xC ) i For constant labeling xC = α, note that in the last α-expansion, PRE-EDIT-DUALS enforces that fC(C) − i λC,i(α) = gC(C) = 0, and neither of the other subroutines violate this. Therefore i 1 ρ λC,i(α) = 1 ρ fC (C) = 1 ρ fC (α) = 0, where the second 190 equality is because we constructed fC with fC(C) = f (yC), and the last is since f is weakly associative. Finally, Lemma 83 says that at convergence, f (x) is no more than ρ f (x∗). 191 CHAPTER 8 EXPERIMENTAL EVALUATION OF THE SOSPD ALGORITHM The last two chapters presented an algorithm for optimizing general nonsubmodular higher-order MRFs, based off linear programming relaxations to the local marginal polytope, and using submodular flow and submodular upper bounds to solve the binary subproblems in each fusion or expansion move. In this chapter, we will give the experimental results for these algorithms, showing that they are also empirically faster than existing state of the art algorithms. For our benchmarks, we choose the two exemplar higher-order problems described in the introduction: the Field of Experts model from Section 1.5.2 and the curvature regularizing stereo model of Section 1.5.3. We will describe in detail the datasets and models used for the experiments in Section 8.1. There are two main questions we want to answer with these experiments. In Section 8.2, we explore which of the several proposed submodular upper bounds is the most effective for optimization in typical computer vision input problems. Then, in Section 8.3 we demonstrate the speedup of the primal-dual SoSPD algorithm over existing algorithms for multilabel higher-order inference. 8.1 Benchmarks and Datasets For our experiments, we are interested in benchmarks which represent typical higher-order vision inputs, and which are difficult to optimize. 192 8.1.1 Field of Experts Denoising The first benchmark, Fields of Experts denoising, we have already seen in the experiments for the reduction method of Chapter 4. This benchmark is based off the model of [77], and has been used in many higher-order optimization papers, including [40, 18, 43, 22]. The dataset consists of 100 grasycale images from the Berkeley Segmentation Database [70], to which independent Gaussian noise has been added to each pixel. Recall that the Field of Experts model has 255 labels for each pixel xi, with labels corresponding to denoised 8-bit intensity values. The unary terms are a L2 data-cost 1 2σ2 xi − yi 2 i (8.1) where yi is the observed, noisy image and σ is an estimate of the gaussian noise added to each pixel. The Field of Experts prior is applied to each local patch of the image. We use each (overlapping) 2x2 patch, for cliques of size 4. The clique functions are given by k fC(xC) = αi log(1 + JiT xC). i=1 (8.2) where the Ji are learned linear filters passed through a nonlinear activation function log(1 + ·), and αi are learned mixture components between these acti- vations. The exact coefficients were trained by maximum-likelihood estimation on a training set. To allow reproducibility of results, we obtained the specific energy functions to be minimized from the OpenGM benchmark [45].1 1Available at http://http://hci.iwr.uni-heidelberg.de/opengm2/. 193 8.1.2 Curvature Regularizing Stereo Reconstruction The second benchmark is based on the second-order stereo model of [102], as described in Section 1.5.3. The stereo reconstruction algorithm of [102] encourages the disparity map to be piecewise smooth using a 2nd order prior, composed of all 1 × 3 and 3 × 1 patches in the image, which each penalizing a robust function of the curvature of the disparity map. A number of optimization methods are proposed in [102], which are com- posed to get the final result. The most important step consists of pre-generating a set of 14 piecewise-planar proposed disparity maps, and then using these as proposals to the fusion-move algorithm to improve the current disparity until convergence. This method is called SEGPLN in [102]. We use the truncated- quadratic costs which penalizes the deviation of each 3 × 1 patch from a plane, via the robust L2 cost: fi, j,k(xi, x j, xk) =  min   ∆ x1 − ∆ 2 xj + ∆ xk 2 , τ .  (8.3) To give a fair benchmark comparison, we use the simplified model used in [55] and later adapted by [22], which omits the binary occlusion labels for each pixel. Because the stereo-reconstruction problem is doing inference in a continuous domain (of all possible disparity values, as real numbers) this model also discretizes the problem by giving each pixel 14 discrete labels, one for each pre-generated proposal. Experimentally, the fusion-move part of [102] (in which the algorithm repeatedly proposes and fuses these 14 proposals) is the bulk of the time spent by the algorithm, as well as the most important for energy reduction. So, good performance of discrete optimizers for this step can dramatically improve the performance of the whole algorithm. 194 Note that this discretization still allows estimating sub-pixel disparities (since the proposals can take any floating point value) using only 14 labels. Data was obtained by running the code2 for [102] and recording the proposed fusion moves and corresponding unary terms. The dataset consists of 3 stereo pairs, “cones”, “teddy” and “venus” from the Middlebury Stereo Dataset [82, 83] obtained from http://vision.middlebury.edu/stereo/ data/. 8.2 Comparison of Upper Bound Methods 8.2.1 Experimental Setup Our first set of experiments tests the effectiveness of the various upper bound methods of Chapter 6. We are primarily interested in the performance of these upper bounds as a subroutine within the SoSPD algorithm, as SoSPD can handle multilabel higher-order problems, including the two benchmarks of Section 8.1. It is possible to apply submodular upper bounds directly to the optimization of non-submodular higher-order binary problems; however, these problems are typically much less interesting from an application perspective as most computer vision problems are multilabel. In addition to SoSPD, we also test the effectiveness of the upper bounds in a fusion move algorithm [66], by using the submodular upper bound to convert each (possibly non-submodular) higher-order binary fusion move into a 2http://www.robots.ox.ac.uk/˜ojw/software.htm. 195 Method SOSPD-QUAD SOSPD-HEUR SOSPD-CARD SOSPD-LP1 SOSPD-LP∞ Energy 30712.58 30748.19 31190.61 30706.60 30704.49 Time (s) 16.049 20.212 22.179 2099.896 5233.162 % Best 0.00% 0.00% 0.00% 10.00% 90.00% Table 8.1: Comparison of upper bound methods for Fields of Experts denoising averaged over 10 images. Results for SOSPD-LP1 and SOSPD-LP∞ computed with the Gurobi LP solver. Gradient descent proposals were used to generate fusion moves in SoSPD. Method SOSPD-QUAD SOSPD-CARD SOSPD-HEUR Energy 32593.23 33074.61 32629.35 Time (s) 15.881 21.908 20.032 % Best 100.00% 0.00% 0.00% Table 8.2: Comparison of upper bound methods for the full Fields of Experts denoising dataset, averaged over 100 images. Gradient descent proposals used for SOSPD and REDUCTION-FUSION. Method SOSPD-QUAD SOSPD-HEUR SOSPD-CARD SOSPD-LP1 SOSPD-LP∞ Energy 8.958 × 109 8.952 × 109 8.958 × 109 8.953 × 109 8.948 × 109 Time (s) 104.421 102.311 116.521 115.281 119.012 % Best 0.00% 0.00% 0.00% 0.00% 100.00% Table 8.3: Comparison of upper bound methods for the Stereo dataset, averaged across 3 stereo pairs, “cones”, “teddy” and “venus”. Results for SOSPD-LP1 and SOSPD-LP∞ were computed using our custom simplex implementation. 196 submodular one, and then solve the resulting SoS optimization problem using the SoS IBFS algorithm of 5. Note that the fusion-move algorithm can be considered a pure-primal version of the primal-dual algorithm SoSPD, in that both algorithms will take the same sequence of fusion or expansion moves, with the same optimal binary labeling at each step. Thus, when both algorithms use the same upper bound, they will arrive at the same answer, though typically SoSPD is more efficient. We observed in our experiments that for a given lower bound, fusion moves had nearly identical3 final energy to the corresponding SoSPD result, but took a little over twice as long on both the stereo and denoising datasets, so we did not include them on the tables here. Both Fusion-moves and SoSPD allow a choice of fusion proposals at each iteration. For the stereo example, we simply cycle through the 14 labels, doing an expansion move on each one. For the Fields of Experts experiments, we use the gradient descent proposals of [38], which have been shown to be the most effective fusion proposals for this dataset. For the implementation, we used the publicly available code from [22] for the implementation of sum-of-submodular flow, as well as the SoSPD algorithm. The L1 and L∞ linear programs in (6.3) were solved by the linear programming package Gurobi. Additionally, for size 3 cliques in the stereo benchmark, the linear programs involved are very small (only 6 variables and 6 constraints) so we implemented a version of the simplex method with the constraints hardcoded, which was much faster than the general Gurobi solver. All code is in C++, and will be released under an open source license.4 We compare five different submodular upper bounds. We give each an ab- 3With differences largely due to stopping conditions 4Available at www.cs.cornell.edu/˜afix. 197 Figure 8.1: Visual results for Fields of Experts denoising with different upper bound methods. Top row, left to right: (a) SOSPD-HEUR. Bottom row: (b) SOSPD-QUAD (c) SOSPD-CARD. breviation in the tables and figures: the p-norm minimizing upper bounds of Section 6.3, equation (6.3), we’ll denote as LP1 and LP∞, the quadratic-based approximation of 6.4.2 is QUAD, the cardinality approximation of 6.4.3 is CARD, the baseline heuristic of [22] (in Section 6.4.1) we’ll denote by HEUR. We also use, for example, SOSPD-QUAD for the SoSPD algorithm with the quadratic approximation, and FUSION-LP1 to denote the fusion move algorithm with the 1-norm upper bound, etc. 8.2.2 Results Our first experiment tests how close our proposed approximations come to the L1 and L∞ minimizing upper bounds. Because of the slow-runtime of solving the LP for the L1 and L∞ upper bounds, we ran these on only the first 10 images for the Field of Experts dataset. Results of this experiment are summarized in 198 Figure 8.2: Reconstructed depth maps for stereo pair “cones”. Top row, left to right (with % of pixels within ±1 disparity) (a) REDUCTION-FUSION, 49.0% (b) SOSPD-QUAD, 49.9% Center row (c) SOSPD-CARD, 49.9%. (d) SOSPD-HEUR, 49.7% Bottom row (e) SOSPD-LP1, 50.0% (f) SOSPD-LP∞, 49.7% 199 ������ ������� ������� ������� ������ ������ ������ ������ ������ ������ ������ �� ���������� ���������� ���������� ���������������� �� ��� ��� ��� ��� �������������� Figure 8.3: Comparison of upper bound methods: Energy over time for the Fields of Experts denoising experiment, using the image in Figure 8.1. Reduction-Fusion, using the reduction method of Chapter 4 provided for comparison. Table 8.1. Overall, the L∞ upper bound performed the best in the majority (90%) of instances, while the L1 upper bound had only slightly higher energy. The norm-minimizing upper bounds together perform better than all other methods, indicating that these norms are a good measure to minimize for picking good upper bounds. Additionally, the pairwise approximation SoSPD-Quad was very close to the L∞ result, with the energy gap between them less than 1/4 the gap between SoSPD-Quad and the next competitor, SoSPD-Heur. This suggests that the proposed upper bound approximations can come close to the Linear Programming solution, while being more than 100 times faster. Next, we run all non-LP methods on the full denoising dataset, with results in Table 8.2. Notably, the pairwise approximation has both the best energy for 200 every image in the dataset, as well as being the fastest overall. For the stereo example, results for the 3 stereo pairs are summarized in Table 8.3. Since the cliques were of size 3, we were able to use the custom simplex method mentioned above for the L1 and L∞ upper bounds. The L∞ upper bound achieved the best energy for all 3 images, while taking only 16% more time (and nearly 4x faster than the non upper bound method, REDUCTION-FUSION). Across both datasets, we find that, the linear programming based upper bounds have the best energy performance, particularly the ∞-norm upper bound. However, for cliques of size four or more, computing the linear programming solution becomes expensive. Thus, the quadratic-based approximation is also promising, as it is faster than the norm-based methods for both datasets, while still having very similar energy optimization performance. As expected, the methods all achieve very similar final energies, and correspondingly the visual results for all upper bound algorithms are nearly indistinguishable, as seen in Figure 8.1 and 8.2. 8.3 Evaluation of SoSPD Now that we have identified the best upper bound algorithms for different problems, we now want to compare the performance of SoSPD against existing state-of-the-art algorithms for higher-order multilabel inference. In the following experiments, we use the custom-simplex implementation of SOSPD-LP∞ for cliques of size 3, and SOSPD-QUAD for larger cliques. For experimental comparisons, the method of [55] does not currently have 201 “Teddy” FGBZ-Fusion HOCR-Fusion GRD-Fusion SoSPD-Fusion “Cones” FGBZ-Fusion HOCR-Fusion GRD-Fusion SoSPD-Fusion Pixels within ±1 83.3% 83.8% 84.9% 84.8% Pixels within ±1 74.9% 74.2% 75.2% 75.2% Final energy 9.320 × 109 9.298 × 109 9.256 × 109 9.172 × 109 Final energy 1.1765 × 1010 1.1789 × 1010 1.1690 × 1010 1.1664 × 1010 Time 468s 210s 1116s 129s Time 340s 172s 1138s 133s Table 8.4: Evaluation of SoSPD: Numerical results for stereo reconstruction, for the two images in Figure 8.4. FGBZ-Gradient HOCR-Gradient GRD-Gradient SoSPD-Gradient Energy @ 10s 4.17 × 108 4.35 × 108 6.72 × 108 2.87 × 108 Final energy 2.353 × 108 2.368 × 108 2.348 × 108 2.347 × 108 Time 86s 78s 776s 42s Table 8.5: Evaluation of SoSPD: Numerical results for denoising, averaged over the 100 images in the test set. For the second column, we stop both methods after 10 seconds, and compare energy values. publicly available code, so we are left with the class of fusion-reduction methods [18, 40, 43]. While the Generalized Roof Duality method of [43] can produce good solutions, it is typically much slower than [18, 40], and is restricted to cliques of size at most 4. We observed that it obtains similar or slightly-worse energy values to SoSPD, while taking at least 10x more time, even for the heuristic version of GRD. We therefore focus on FGBZ [18] and HOCR [40] due to their speed and generality. 202 (a) (b) (c) (d) Figure 8.4: Evaluation of SoSPD: visual results for stereo reconstruction. (a) Ground truth disparities, with results from (b) FGBZ-Fusion (c) SoSPD-Fusion and (d) SoSPD-Best-Fusion. Top row is the “teddy” image, bottom row is “cones”. Results for SoSPD have slightly more correct pixels, and converged much faster — see Table 8.4 for details. 8.3.1 Stereo reconstruction We have two variants of SoSPD for this experiment, which only differ in the choice of proposed moves. The first, SoSPD-Fusion, rotates through the 14 labels, and successively chooses each to be an α-expansion proposal for that iteration. The second, SoSPD-Best-Fusion, uses an idea from [5] to pick the best α for each iteration. More specifically, we choose the α which will have the greatest total capacity leaving the source, in order to encourage as many nodes to switch to lower height labels as possible. We compared with the baselines, FGBZ-Fusion using the reduction [18], and HOCR-Fusion using the reduction [40]. Both methods cycle through the pre-generated proposals and perform fusion move. Numerical results are in Table 8.4 and images in Figure 8.4. Overall, the 203 3.5e+10 3e+10 2.5e+10 FGBZ-Fusion HOCR-Fusion SoSPD-Alpha SoSPD-Best-Alpha Energy 2e+10 1.5e+10 1e+10 Energy 5e+09 0 1.8e+10 1.7e+10 1.6e+10 1.5e+10 1.4e+10 1.3e+10 1.2e+10 1.1e+10 0 8e+08 7e+08 6e+08 5e+08 4e+08 3e+08 2e+08 1e+08 0 10 10 10 20 30 40 Time (seconds) 50 FGBZ-Fusion HOCR-Fusion SoSPD-Alpha SoSPD-Best-Alpha 20 30 40 Time (seconds) 50 FGBZ-Gradient HOCR-Gradient SoSPD-Alpha SoSPD-Gradient 20 30 Time (seconds) 40 50 60 60 60 Energy Figure 8.5: Evaluation of SoSPD: Energy reduction over time for the stereo images (top) “teddy” (center) “cones”. (bottom) Energy reduction over time for the denoising image “penguin”. Note that, in addition to converging faster, for a fixed time budget we achieve much better energy than the baseline. 204 Figure 8.6: Evaluation of SoSPD: Visual results for Field of Experts denoising. (top left) noisy image (top right) SoSPD-α (center left) FGBZ-Gradient, 10 sec (center right) SoSPD-Gradient, 10 sec (bottom left) FGBZ-Gradient at convergence (bottom right) SoSPD-Gradient at convergence. 205 SoSPD variants and reduction methods reach similar energy and visual results; however, SoSPD is fastest overall (2.5x-3.5x vs FGBZ, 1.3x-1.5x vs HOCR). 8.3.2 Field of Experts denoising SoSPD with α-expansion decreases the energy quickly initially, but gets stuck in poor local optima, with flat images as seen in Figure 8.6. Fortunately, gradient descent proposals [38] have been shown to be very effective at optimizing FoE priors. We call the combination of SoSPD with these fusion proposals SoSPDGradient. We compare against fusion move with the same proposals, and the reductions of [18] and [40]. Overall, when comparing SoSPD vs. [18] for the same proposal method, SoSPD is significantly faster, and achieves slightly lower energy at convergence. Additionally, given a fixed time budget of 10 seconds, both the energy and visual results of SoSPD are significantly better, as seen in Figure 8.6 and Table 8.5. 206 CHAPTER 9 STRUCTURED LEARNING OF SUM-OF-SUBMODULAR HIGHER ORDER ENERGY FUNCTIONS Now that we’ve covered inference of MRFs, we will turn to the second question of optimization: modeling. SoS functions can naturally express higher order priors involving, e.g., local image patches; however, it is difficult to fully exploit their expressive power because they have so many parameters. Rather than trying to formulate existing higher order priors as an SoS function, we take a discriminative learning approach, effectively searching the space of SoS functions for a higher order prior that performs well on our training set. We adopt a structural SVM approach [41, 95] and formulate the training problem in terms of quadratic programming; as a result we can efficiently search the space of SoS priors via an extended cutting-plane algorithm. We also show how the state-of-the-art max flow method for vision problems [30] can be modified to efficiently solve the submodular flow problem. Experimental comparisons are made against the OpenCV implementation of the GrabCut interactive segmentation technique [79], which uses hand-tuned parameters instead of machine learning. On a standard dataset [32] our method learns higher order priors with hundreds of parameter values, and produces significantly better segmentations. While our focus is on binary labeling problems, we show that our techniques can be naturally generalized to handle more than two labels. 9.1 Introduction Discrete optimization methods such as graph cuts [12, 52] have proven to be quite effective for many computer vision problems, including stereo [12], inter- 207 active segmentation [79] and texture synthesis [61]. The underlying optimization problem behind graph cuts is a special case of submodular function optimization that can be solved exactly using max flow [52]. Graph cut methods, however, are limited by their reliance on first-order priors involving pairs of pixels, and there is considerable interest in expressing priors that rely on local image patches such as the popular Field of Experts model [77]. While SoS functions have more expressive power, they also involve a large number of parameters. Rather than addressing the question of which existing higher order priors can be expressed as an SoS function, we take a discriminative learning approach and effectively search the space of SoS functions with the goal of finding a higher order prior that gives strong results on our training set.1 Our main contribution is to introduce the first learning method for training such SoS functions, and to demonstrate the effectiveness of this approach for interactive segmentation using learned higher order priors. Following a Structural SVM approach [41, 95], we show that the training problem can be cast as a quadratic optimization problem over an extended set of linear constraints. This generalizes large-margin training of pairwise submodular (a.k.a. regular [52]) MRFs [2, 91, 94], where submodularity corresponds to a simple non-negativity constraint. To solve the training problem, we show that an extended cuttingplane algorithm can efficiently search the space of SoS functions. 1Since we are taking a discriminative approach, the higher-order energy function we learn does not have a natural probabilistic interpretation. We are using the word “prior” here somewhat loosely, as is common in computer vision papers that focus on energy minimization. 208 9.2 Related Work Many learning problems in computer vision can be cast as structured output prediction, which allows learning outputs with spatial coherence. Among the most popular generic methods for structured output learning are Conditional Random Fields (CRFs) trained by maximum conditional likelihood [65], Maximum-Margin Markov Networks (M3N) [93], and Structural Support Vector Machines (SVM-struct) [95, 41]. A key advantage of M3N and SVM-struct over CRFs is that training does not require computation of the partition function. Among the two large-margin approaches M3N and SVM-struct, we follow the SVM-struct methodology since it allows the use of efficient inference procedures during training. In this paper, we will learn submodular discriminant functions. Prior work on learning submodular functions falls into three categories: submodular function regression [4], maximization of submodular discriminant functions, and minimization of submodular discriminant functions. Learning of submodular discriminant functions where a prediction is computed through maximization has widespread use in information retrieval, where submodularity models diversity in the ranking of a search engine [103, 67] or in an automatically generated abstract [87]. While exact (monotone) submodular maximization is intractible, approximate inference using a simple greedy algorithm has approximation guarantees and generally excellent performance in practice. The models considered in this paper use submodular discriminant functions where a prediction is computed through minimization. The most popular such 209 models are regular MRFs [52]. Traditionally, the parameters of these models have been tuned by hand, but several learning methods exist. Most closely related to the work in this paper are Associative Markov Networks [94, 2], which take an M3N approach and exploit the fact that regular MRFs have an integral linear relaxation. These linear programs (LP) are folded into the M3N quadratic program (QP) that is then solved as a monolithic QP. In contrast, SVM-struct training using cutting planes for regular MRFs [91] allows graph cut inference also during training, and [17, 60] show that this approach has interesting approximation properties even the for multi-class case where graph cut inference is only approximate. More complex models for learning spatially coherent priors include separate training for unary and pairwise potentials [64], learning MRFs with functional gradient boosting [71], and the Pn Potts models, all of which have had success on a variety of vision problems. Note that our general approach for learning multi-label SoS functions, described in section 9.3.4, includes the Pn Potts model as a special case. 9.3 S3SVM: SoS Structured SVMs In this section, we first review the SVM algorithm and its associated Quadratic Program (section 9.3.1). We then decribe a general class of SoS discriminant functions which can be learned by SVM-struct (section 9.3.2) and explain this learning procedure (section 9.3.3). Finally, we generalize SoS functions to the multi-label case (section 9.3.4). 210 9.3.1 Structured SVMs Structured output prediction describes the problem of learning a function h : X −→ Y where X is the space of inputs, and Y is the space of (multivariate and structured) outputs for a given problem. To learn h, we assume that a training sample of input-output pairs S = ((x1, y1), . . . , (xn, yn)) ∈ (X × Y)n is available and drawn i.i.d. from an unknown distribution. The goal is to find a function h from some hypothesis space H that has low prediction error, relative to a loss function ∆(y, y¯). The function ∆ quantifies the error associated with predicting y¯ when y is the correct output value. For example, for image segmentation, a natural loss function might be the Hamming distance between the true segmentation and the predicted labeling. The mechanism by which Structural SVMs finds a hypothesis h is to learn a discriminant function f : X × Y → R over input/output pairs. One derives a prediction for a given input x by minimizing f over all y ∈ Y.2 We will write this as hw(x) = argminy∈Y fw(x, y). We assume fw(x, y) is linear in two quantities w and Ψ fw(x, y) = wT Ψ(x, y) where w ∈ RN is a parameter vector and Ψ(x, y) is a feature vector relating input x and output y. Intuitively, one can think of fw(x, y) as a cost function that measures how poorly the output y matches the given input x. Ideally, we would find weights w such that the hypothesis hw always gives correct results on the training set. Stated another way, for each example xi, the correct prediction yi should have low discriminant value, while incorrect predictions y¯i with large loss should have high discriminant values. We write this 2Note that the use of minimization departs from the usual language of [95, 41] where the hypothesis is argmax fw(x, y). However, because of the prevalence of cost functions throughout computer vision, we have replaced f by − f throughout. 211 constraint as a linear inequality in w wT Ψ(xi, y¯i) ≥ wT Ψ(xi, yi) + ∆(yi, y¯i) : ∀y¯ ∈ Y. (9.1) It is convenient to define δΨi(y¯) = Ψ(xi, y¯) − Ψ(xi, yi), so that the above inequality becomes wT δΨi(y¯i) ≥ ∆(yi, y¯i). Since it may not be possible to satisfy all these conditions exactly, we also add a slack variable to the constraint for each example i. Intuitively, the slack variable ξi represents the maximum misprediction loss on the ith example. Since we want to minimize the prediction error, we add an objective function which penalizes large slack. Finally, we also penalize w 2 to discourage overfitting, with a regularization parameter C to trade off these costs. Quadratic Program 1. n-SLACK STRUCTURAL SVM min 1 wT w + C w,ξ≥0 2 n n i=1 ξi wT δΨi(y¯i) ≥ ∆(yi, y¯i) − ξi ∀i, ∀y¯i ∈ Y 9.3.2 Submodular Feature Encoding We now apply the Structured SVM (SVM-struct) framework to the problem of learning SoS functions. For the moment, assume our prediction task is to assign a binary label for each element of a base set V. We will cover the multi-label case in section 9.3.4. Since the labels are binary, prediction consists of assigning a subset S ⊆ V for each input (namely the set S of pixels labeled 1). Our goal is to construct a feature vector Ψ that, when used with the SVMstruct algorithm of section 9.3.1, will allow us to learn sum-of-submodular en- 212 ergy functions. Let’s begin with the simplest case of learning a discriminant function fC,w(S ) = wT Ψ(S ), defined only on a single clique and which does not depend on the input x. Intuitively, our parameters w will correspond to the table of values of the clique function fC, and our feature vector Ψ will be chosen so that wS = fC(S ). We can accomplish this by letting Ψ and w have 2|C| entries, indexed by subsets T ⊆ C, and defining ΨT (S ) = δT (S ) (where δT (S ) is 1 if S = T and 0 otherwise). Note that, as we claimed, fC,w(S ) = wT Ψ(S ) = wT δT (S ) = wS . T ⊆C (9.2) If our parameters wT are allowed to vary over all R2|C|, then fC(S ) may be an arbitrary function 2C → R, and not necessarily submodular. However, we can enforce submodularity by adding a number of linear inequalities. Recall that f is submodular if and only if f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B). Therefore, fC,w is submodular if and only if the parameters satisfy wA∪B + wA∩B ≤ wA + wB : ∀A, B ⊆ C (9.3) These are just linear constraints in w, so we can add them as additional constraints to Quadratic Program 1. There are O(2|C|) of them, but each clique has 2|C| parameters, so this does not increase the asymptotic size of the QP. Theorem 88. By choosing feature vector ΨT (S ) = δT (S ) and adding the linear constraints (9.3) to Quadratic Program 1, the learned discriminant function fw(S ) is the maximum margin function fC, where fC is allowed to vary over all possible submodular functions f : 2C → R. 213 Proof. By adding constraints (9.3) to the QP, we ensure that the optimal solution w is defines a submodular fw. Conversely, for any submodular function fC, there is a feasible w defined by wT = fC(T ), so the optimal solution to the QP must be the maximum-margin such function. To introduce a dependence on the data x, we can define Ψdata to be ΨdTata(S , x) = δT (S )Φ(x) for an arbitrary nonnegative function Φ : X → R≥0. Corollary 89. With feature vector Ψdata and adding linear constraints (9.3) to QP 1, the learned discriminant function is the maximum margin function fC(S )Φ(x), where fC is allowed to vary over all possible submodular functions. Proof. Because Φ(x) is nonnegative, constraints (9.3) ensure that the discriminant function is again submodular. Finally, we can learn multiple clique potentials simultaneously. If we have a neighborhood structure C with m cliques, each with a data-dependence ΦC(x), we create a feature vector Ψsos composed of concatenating the m different features ΨCdata. Corollary 90. With feature vector Ψsos, and adding a copy of the constraints (9.3) for each clique C, the learned fw is the maximum margin f of the form f (x, S ) = fC(S )ΦC(x) C∈C where the fC can vary over all possible submodular functions on the cliques C. (9.4) 9.3.3 Solving the quadratic program 214 1: Input: S = ((x1, y1), . . . , (xn, yn)), C, 2: W ← ∅ 3: repeat 4: Recompute the QP solution with the current constraint set: (w, ξ) ← argminw,ξ≥0 1 2 wT w + Cξ s.t. for all (y¯1, . . . , y¯n) ∈ W : 1 n wT n i=1 δΨi(y¯i) ≥ 1 n n i=1 ∆(yi , y¯i) − ξ s.t. for all C ∈ C, A, B ⊆ C : wC,A∪B + wC,A∩B ≤ wC,A + wC,B 5: for i=1,...,n do 6: Compute the maximum violated constraint: yˆi ← argminyˆ∈Y{wT Ψ(xi, yˆ) − ∆(yi, yˆ)} by using IBFS to minimize fw(xi, yˆ) − ∆(yi, yˆ). 7: end for 8: W ← W ∪ {(yˆ1, . . . , yˆn)} 9: until the slack of the max-violated constraint is ≤ ξ + . 10: return (w,ξ) Algorithm 2: : S3SVM via the 1-Slack Formulation. The n-slack formulation for SSVMs (QP 1) makes intuitive sense, from the point of view of minimizing the misprediction error on the training set. However, in practice it is better to use the 1-slack reformulation of this QP from [41]. Compared to n-slack, the 1-slack QP can be solved several orders of magnitude faster in practice, as well as having asymptotically better complexity. The 1-slack formulation is an equivalent QP which replaces the n slack variables ξi with a single variable ξ. The loss constraints (9.1) are replaced with constraints penalizing the sum of losses across all training examples. We also include submodular constraints on w. 215 Quadratic Program 2. 1-SLACK STRUCTURAL SVM min 1 wT w + C ξ w,ξ≥0 2 s.t. 1 wT n n i=1 δΨi(y¯i) ≥ 1 n n i=1 ∆(yi, y¯i) − ξ ∀(y¯1, ..., y¯n) ∈ Yn wC,A∪B + wC,A∩B ≤ wC,A + wC,B ∀C ∈ C, A, B ⊆ C (9.5) Note that we have a constraint for each tuple (y¯1, . . . , y¯n) ∈ Yn, which is an exponential sized set. Despite the large set of constraints, we can solve this QP to any desired precision by using the cutting plane algorithm. This algorithm keeps track of a set W of current constraints, and solves the current QP with regard to those constraints, and then given a solution (w, ξ), finds the most violated constraint and adds it to W. Finding the most violated constraint consists of solving for each example xi the problem yˆi = argmin fw(x, yˆ) − ∆(yi, yˆ). yˆ∈Y (9.6) Since the features Ψ ensure that fw is SoS, then as long as ∆ factors as a sum over the cliques C (for instance, the Hamming loss is such a function), then (9.6) can be solved with Submodular IBFS. Note that this also allows us to add arbitrary additional features for learning the unary potentials as well. Pseudocode for the entire S3SVM learning is given in Algorithm 2. 9.3.4 Generalization to multi-label prediction Submodular functions are intrinsically binary functions. In order to handle the multi-label case, we use expansion moves [12] to reduce the multi-label optimization problem to a series of binary subproblems, where each pixel may ei- 216 ther switch to a given label α or keep its current label. If every binary subproblem of computing the optimal expansion move is an SoS problem, we will call the original multi-label energy function an SoS expansion energy. Let L be our label set, with output space Y = LV. Our learned function will have the form f (y) = C∈C fC(yC) where fC : LC → R. For a clique C and label , define C = {i | yi = }, i.e., the subset of C taking label . Theorem 91. If all the clique functions are of the form fC(yC) = g (C ) ∈L (9.7) where each g is submodular, then any expansion move for the multi-label energy func- tion f will be SoS. Proof. Fix a current labeling y, and let B(S ) be the energy when the set S switches to label α. We can write B(S ) in terms of the clique functions and sets C as B(S ) = gα(Cα ∪ S ) + g (C \ S ) C∈C α (9.8) We use a fact from the theory of submodular functions: if f (S ) is submodular, then for any fixed T both f (T ∪ S ) and f (T \ S ) are also submodular. Therefore, B(S ) is SoS. Theorem 91 characterizes a large class of SoS expansion energies. These functions generalize commonly used multi-label clique functions, including the Pn Potts model [48]. The Pn model pays cost λi when all pixels are equal to label i, and λmax otherwise. We can write this as an SoS expansion energy by letting g (S ) = λi − λmax if S = C and otherwise 0. Then, g (S ) is equal to the Pn Potts model, up to an additive constant. Generalizations such as the robust Pn 217 Figure 9.1: Example images from the binary segmentation results. From left to right, the columns are (a) the original image (b) the noisy input (c) results from Generic Cuts [3] (d) our results. model [49] can be encoded in a similar fashion. Finally, in order to learn these functions, we let Ψ be composed of copies of Ψdata — one for each g , and add corresponding copies of the constraints (9.3). As a final note: even though the individual expansion moves can be computed optimally, α-expansion still may not find the global optimum for the multi-labeled energy. However, in practice α-expansion finds good local optima, and has been used for inference in Structural SVM with good results, as in [60]. 9.4 Experimental Results In order to evaluate our algorithms, we focused on binary denoising and interactive segmentation. For binary denoising, Generic Cuts [3] provides the most natural comparison since it is a state-of-the-art method that uses SoS priors. For interactive segmentation the natural comparison is against GrabCut [79], where we used the OpenCV implementation. We ran our general S3SVM method, 218 which can learn an arbitrary SoS function, an also considered the special case of only using pairwise priors. For both the denoising and segmentation applications, we significantly improve on the accuracy of the hand-tuned energy functions. 9.4.1 Binary denoising Our binary denoising dataset consists of a set of 20 black and white images. Each image is 400 × 200 and either a set of geometric lines, or a hand-drawn sketch (see Figure 9.1). We were unable to obtain the original data used by [3], so we created our own similar data by adding independent Gaussian noise at each pixel. For denoising, the hand-tuned Generic Cuts algorithm of [3] posed a simple MRF, with unary pixels equal to the absolute valued distance from the noisy input, and an SoS prior, where each 2 × 2 clique penalizes the square-root of the number of edges with different labeled endpoints within that clique. There is a single parameter λ, which is the tradeoff between the unary energy and the smoothness term. The neighborhood structure C consists of all 2 × 2 patches of the image. Our learned prior includes the same unary terms and clique structure, but instead of the square-root smoothness prior, we learn a clique function g to get an MRF ESVM(y) = i |yi − xi| + C∈C g(yC). Note that each clique has the same energy as every other, so this is analogous to a graph cuts prior where each pairwise edge has the same attractive potential. Our energy function has 16 total parameters (one for each possible value of g, which is defined on 2 × 2 219 patches). We randomly divided the 20 input images into 10 training images and 10 test images. The loss function was the Hamming distance between the correct, un-noisy image and the predicted image. To hand tune the value λ, we picked the value which gave the minimum pixel-wise error on the training set. S3SVM training took only 16 minutes. Numerically, S3SVM performed signficantly better than the hand-tuned method, with an average pixel-wise error of only 4.9% on the training set, compared to 28.6% for Generic Cuts. The time needed to do inference after training was similar for both methods: 0.82 sec/image for S3SVM vs. 0.76 sec/image for Generic Cuts. Visually, the S3SVM images are significantly cleaner looking, as shown in Figure 9.1. 9.4.2 Interactive segmentation The input to interactive segmentation is a color image, together with a set of sparse foreground/background annotations provided by the user. See Figure 9.2 for examples. From the small set of labeled foreground and background pixels, the prediction task is to recover the ground-truth segmentation for the whole image. Our baseline comparison is the Grabcut algorithm, which solves a pairwise CRF. The unary terms of the CRF are obtained by fitting a Gaussian Mixture Model to the histograms of pixels labeled as being definitely foreground or background. The pairwise terms are a standard contrast-sensitive Potts poten- 220 Input GrabCut S3SVM-AMN S3SVM Figure 9.2: Example images from binary segmentation results. Input with user annotations are shown at top, with results below. tial, where the cost of pixels i and j taking different labels is equal to λ·exp(−β|xi− x j|) for some hand-coded parameters β, λ. Our primary comparison is against the OpenCV implementation of Grabcut, available at www.opencv.org. As a special case, our algorithm can be applied to pairwise-submodular energy functions, for which it solves the same optimization problem as in Associative Markov Networks (AMN’s) [94, 2]. Automatically learning parameters allows us to add a large number of learned unary features to the CRF. As a result, in addition to the smoothness parameter λ, we also learn the relative weights of approximately 400 features describing the color values near a pixel, and relative distances to the nearest labeled foreground/background pixel. Further details on these features can be found in the Supplementary Material. We refer to this method as S3SVM-AMN. Our general S3SVM method can incorporate higher-order priors instead of 221 just pairwise ones. In addition to the unary features used in S3SVM-AMN, we add a sum-of-submodular higher-order CRF. Each 2 × 2 patch in the image has a learned submodular clique function. To obtain the benefits of the contrast-sensitive pairwise potentials for the higher-order case, we cluster (using k-means) the x and y gradient responses of each patch into 50 clusters, and learn one submodular potential for each cluster. Note that S3SVM automatically allows learning the entire energy function, including the clique potentials and unary potentials (which come from the data) simultaneously. We use a standard interactive segmentation dataset from [32] of 151 images with annotations, together with pixel-level segmentations provided as ground truth. These images were randomly sorted into training, validation and testing sets, of size 75, 38 and 38 respectively. We trained both S3SVM-AMN and S3SVM on the training set for various values of the regularization parameter c, and picked the value c which gave the best accuracy on the validation set, and report the results of that value c on the test set. The overall performance is shown in the table below. Training time is measured in seconds, and testing time in seconds per image. Our implementation, which used the submodular flow algorithm based on IBFS discussed in section 5.2, will be made freely available under the MIT license. Algorithm Average error Training Testing Grabcut 10.6± 1.4% n/a 1.44 S3SVM-AMN 7.5± 0.5% 29000 0.99 S3SVM 7.3± 0.5% 92000 1.67 Learning and validation was performed 5 times with independently sam- 222 Figure 9.3: A multi-label segmentation result, on data from [36]. The purple label represents vegetation, red is rhino/hippo and blue is ground. There are 7 labels in the input problem, though only 3 are present in the output we obtain on this particular image. pled training sets. The averages and standard deviations shown above are from these 5 samples. While our focus is on binary labeling problems, we have conducted some preliminary experiments with the multi-label version of our method described in section 9.3.4. A sample result is shown in figure 9.3, using an image taken the Corel dataset used in [36]. 223 CHAPTER 10 CONCLUSION In this thesis, we set out to expand the set of models for which fast inference is possible. As we’ve noted in the introduction, modeling and inference are tightly coupled — applications demand more accurate solutions which require more sophisticated models, however they are limited to the kinds of problems for which fast inference algorithms are known. For computer vision problems, Markov Random Fields continue to be the tool of choice for encoding spatial relations between pixels in an image. We’ve shown that many useful properties of images cannot be encoded by first-order MRFs, and that many natural features of images, especially their local, patchbased statistics, are best encoded by higher-order MRFs. When developing optimization algorithms for higher-order MRFs, we have been able to leverage existing graph-cuts methods, using reduction methods to turn higher-order MRFs into first-order problems. We’ve also seen that the basic ideas behind graph-cuts can be generalized to higher-order models, including alpha-expansion and other primal large-neighborhood search algorithms. For designing higher-order versions of graph-cuts methods, the key concept appears to be submodularity. This has been known to be the necessary condition for first-order MRFs since [52]; however, the condition for first-order graphs is much simpler than the general case. In particular, we have found that Sumof-Submodular inference is a natural middle-ground between min-cut based inference and fully general submodular function minimization with MIN-CUT ⊆ SOS MINIMIZATION ⊆ SUBMODULAR MINIMIZATION (10.1) 224 In particular, we are able to take advantage of the clique structure by treating the cliques as a hypergraph over the variables — this allows a fairly straightforward generalization of augmenting paths based algorithms to the higher order case (including the state-of-the-art for vision inputs, IBFS [30]). In this algorithm, we noted that submodularity is the key property for augmenting paths to find a globally optimal solution. For multilabel problems, our key tool has been Linear Programming, and in particular, the Local Marginal Polytope relaxation of the MRF inference problem. The primary feature of linear programming based algorithms is that they can actually say something about the global behavior of the problem, and in particular, using duality we get a global lower bound on the optimal solution. This is in contrast with many popular primal-only algorithms such as alphaexpansion and fusion-moves which make local choices (even if they are searching over a very large local neighborhood), and which cannot say anything about the global optimum. Furthermore, we’ve seen that the LP dual is also useful to speed up inference algorithms, by guiding the binary subproblems within a fusion-move algorithm, as done by the primal-dual algorithm SoSPD. In particular, we have generalized the FastPD algorithm [59] for first-order MRFs to work on higherorder problems, with many of the same speedups over pure-primal algorithms. Furthermore, we’ve used the dual LP to prove approximation ratios for our algorithm, giving a guaranteed bound on how far we can be from the optimum solution. A major limitation of higher-order MRFs is that they are difficult to design by hand, as they have many more parameters than first-order models. We have 225 explored one method for learning higher-order models, using a Structural SVM approach; however, many more learning algorithms are possible. A key feature of this (and related) learning algorithms, though, is that they require repeated application of inference. As a result, every time new inference algorithms allow new models to be efficiently optimized, we can do learning on these models as well. Finally, we will note that higher-order MRFs are a heavyweight solution for many applications. The algorithms presented in this thesis have brought higherorder MRFs from being largely-intractable to being reasonably fast to optimize. However, the algorithms involved still scale poorly with the clique size (typically O(2|C|) and generally require minutes to run, compared to the milliseconds required for real-time performance. However, for achieving maximum accuracy, they present a much greater flexibility for modeling and encoding of constraints. 226 APPENDIX A LOCAL COMPLETENESS We are considering a general labeling problem, with variables x1, . . . , xn, tak- ing labels in sets L1, . . . , Ln. In an MRF the energy function can be written as a sum of clique energies: there is some set C of cliques, and clique functions fC such that E(x1, . . . , xn) = fC(xC) C (A.1) To minimize this energy with respect to a fusion move, we have an input image I and a proposed image I , and for each pixel a binary variable bk that encodes whether the k-th pixel takes label Ik or Ik. The energy function is now a sum of clique energies over these n binary variables. For a clique C of size d, we can write the clique energy fC as a sum of terms in the binary variables and their negations, by specifying the energy pointwise for each possible assignment of the d binary variables in C: • For each assignment γ ∈ Bd, where B = {0, 1}, let f (γ) be the energy of the clique in the fused image according to γ. For instance, with d = 4 and γ = (0, 1, 1, 0) we have f (γ) = fC(I0, I1, I2, I3). • Let b(γ) be the term whose i-th literal is bi or bi, according to whether γi is 1 or 0 respectively. For γ = (0, 1, 1, 0), we would have b(γ) = b0b1b2b3. • Note that the term b(γ) is 1 exactly when the binary variables (b1, . . . , bd) take the assignment γ. Thus, we can write the clique energy as: fC(b1, . . . , bd) = f (γ)b(γ) γ∈Bd (A.2) 227 The first step of our reduction is to transform this to a multilinear polynomial by substituting 1 − bi for bi each time a negated variable occurs. This could possibly result in terms with coefficients for each subset of b1, . . . , bd. For each subset S ⊆ {b1, . . . , bd}, we can actually calculate the coefficient on the term tS = j∈S b j. Let ΓS be the set of assignments γ ∈ Bd with γi = 0 for i S , and let  σ(γ) =  1 The number of 0s in γ is even  −1 otherwise (A.3) Then, after we substitute (1 − bi) for all occurrences of bi and collect all terms with the same variables, the coefficient on the term tS is Coeff(tS ) = σ(γ) f (γ) γ∈ΓS (A.4) Therefore, to show that our energy function is locally dense, it suffices to show that Coeff(tS ) is never (or rarely) 0 for any subset S {b1, . . . , bd} (note that we don’t care if t{x1,...,xd} has coefficient 0, since it is not a subset of any term). We can obtain a general theorem about the binary energy functions corre- sponding to fusion moves, by moving to a continuous framework. We embed the original intensities in R, and extend the clique energies fC to functions on Rd. We need two assumptions: (1) fC is d − 1 times continuously differentiable and (2) each of the d different mixed partials ∂d−1 f ∂x1···∂xi···∂xd (where ∂xi means to omit the i-th partial) take their zeros in a set of measure 0. Theorem 92. Under these two assumptions the set of proposed-current image pairs (I, I ) for which the fusion move binary energy function does not have local density 1 has measure 0 as a subset of Rn × Rn. 228 Proof. By the above argument, it suffices to show that for a clique C on variables x1, . . . , xd, the current and proposed images (IC, IC) which have Coeff(tS ) = 0 for each S {x1, . . . , xd} have measure 0 in Rd × Rd. For every fusion move (IC, IC) and assignment γ ∈ Bd, we get a point v(γ) in  Rd, the result of fusion on just the clique pixels: v(kγ) =   Ik Ik γk = 0 γi = 1 If we list out these points, we get the set (I1, . . . , , Id−1, Id), (I1 . . . , Id−1, Id), (I1, . . . , Id−1, Id), . . . , (I1, . . . , Id−1, Id), (I1, . . . , Id) containing each possible fusion of IC and IC. Note that these 2d points form an axis-aligned rectangular prism in Rd. Denote these points as Verts(IC, IC). Every fusion move (IC, IC) gives a rectangular prism in this fashion. Now, fix S , and let the bad set of fusion-moves, B, be those (IC, IC) for which Coeff(tS ) = 0. To produce a contradiction, assume that B has nonzero measure. It is a closed set, so it contains an open ball. So there is some fusion move (x0, y0) and radius δ such that for all x, y ∈ Rd with | x|, | y| < δ, the fusion-move (x0 + x, y0 + y) is still a bad fusion-move. Now, since Coeff(tS ) = 0 for the fusion move (x0, y0), we can manipulate equation A.4 to get that f (v(1,1,···1)) = − σ(γ) f (v(γ)) γ∈ΓS \{(1,...,1)} (A.5) Since there’s an open ball of bad fusion moves around (x0, y0), for ∈ Rd with | | < δ, the fusion move (x0, y0 + ) is also bad. If we set (γ) equal to (γ1 1, . . . , γn n) (i.e. the vector which is i when γi is 1, and 0 otherwise), then the fusion move 229 (x0, y0 + ) gives a rectangular prism with vertices v(γ) + (γ) for γ ∈ Bd. Then, since (x0, y0 + ) is still a bad fusion move, we can again manipulate equation A.4 to get f (v(1,1,...,1) + ) = − σ(γ) f (v(γ) + (γ)) γ∈ΓS \{(1,...,1)} (A.6) Let gγ( ) = f (v(γ) + (γ). Notice that since (γ) i = 0 whenever γi = 0, we have that this function only depends on the variables xi where γi = 1. Thus, we have that for γ ∈ ΓS \ {(1, . . . , 1)}, the function gγ depends on at most d − 2 of the i. This is because S is a proper subset of {x1, . . . , xd}, so all the γ have γd = 0, and then we remove the element (1, . . . , 1) which has d − 1 1s. Therefore, we have that the mixed partial ∂d−1 gγ ∂x1···∂xd−1 is 0 for all | | < δ. Therefore, since f (v(1,...,1) + ) is a linear combination of the gγ, it also has this mixed partial equal to 0. But then, we have found a set of nonzero measure (the open ball of radius δ around y0) with mixed partial 0, contradicting our hypothesis. Therefore, the set of bad fusion moves in fact must have measure 0. 230 APPENDIX B LAPLACIAN EQUATIONS In the proof of Lemma 74 in the main paper, in order to show the solution of Laplacian equation Mψ = ∆ is non-negative, we claimed it’s straightforward to show that the inverse of the coefficient matrix M−1 is nonnegative (meaning each component is nonnegative), hence ψ = M−1∆ is non-negative (since ∆ is also non-negative). Now, we will show M−1 is non-negative for the completeness. It’s useful to have the following fact about the inverse of a tridiagonal matrix [96]: Lemma 93. The inverse of a non-singular tridiagonal matrix M  M = ac011 b1 a2 c2 b2 ... ... ... ... cn−1 ban0−n1 is given by where  (M−1)i j =  (−1)i+ jΠkj−=1i bkθi−1φ j+1/θn,  (−1)i+ jΠik−=1jckθ j−1φi+1/θn, if i if i ≤ > j j θi = aiθi−1 − bi−1ci−1θi−2 for i = 2, 3, . . . , n φi = aiφi+1 − biciφi+2 for i = n − 1, . . . , 1 with initial values θ0 = 1, θ1 = a1, φn+1 = 1, φn = an. (B.1) (B.2) (B.3) In our Laplacian equations, we have ai = 2, ∀i and b j = c j = −1, ∀ j. Substitute 231 them into Lemma 93, we have:  (M−1)i j =  θi−1φ j+1/θn,  θ j−1φi+1/θn, if i if i ≤ > j j where θi = 2θi−1 − θi−2 for i = 2, 3, . . . , n φi = 2φi+1 − φi+2 for i = n − 1, . . . , 1 with initial values θ0 = φn+1 = 1, θ1 = φn = 2. (B.4) (B.5) It’s easy to use induction to show θi = i + 1 and φi = n + 2 − i from their recursive definition in (B.5). Therefore, we have (M−1)i j =    i· (n + 1 − n+1 j) , j · (n n + + 1 1 − i) , if i if i ≤ > j j (B.6) Clearly, we have M−1 to be a positive matrix in our Laplacian equations. 232 APPENDIX C APPROXIMATION RATIO FOR CARDINALITY UPPER BOUNDS Theorem 79. The cardinality-based upper bound gives a 2(1− 1 p )-approximation. Proof. Similar to the proof of Theorem 76, we can rewrite the objective for odd n as n ψ p :=  k=0 n k 1 p ψk p = (n−1)/2  k=0 1 n k (ψk p + p ψn−k p) (C.1) and for even n: ψ p= n/2−1 n k (ψk p + ψn−k p) k=0 1 +1 2 n n 2 (ψ n p + ψ n p) 22 p (C.2) Let’s use Ψk = ψk p + ψn−k p as a shorthand. We can see the objective can be repre- sented as with ak ≥ 0, ∀k. 1 ψ p := n  2 k=0 p akΨk (C.3) Consider ψ∗ and ψ¯ as the true optimal solution and our approximation so- lution. Define Ψ∗ and Ψ¯ accordingly. Consider each term Ψk, we must have Ψ∗k ≥ Lk p 2p−1 since the RHS is the solution for the following program: min ψ Ψk s.t. Ψk = ψk p + ψn−k p ψk + ψn−k ≥ Lk, (C.4) ψk, ψn−k ≥ 0 where the minimizer is achieved by ψk = ψn−k = Lk 2 .1 1Recall Lk is the lower bound of ψk + ψn−k which only depends on ∆. 233 Meanwhile, we must have Ψ¯ ≤ Lk p since the RHS is the solution for the following program: max ψ Ψk s.t. Ψk = ψk p + ψn−k p ψk + ψn−k = Lk, (C.5) ψk, ψn−k ≥ 0 where the maximizer is achieved by either ψk = 0, ψn−k = Lk or ψk = Lk, ψn−k = 0.2 Therefore, we must have Ψ¯ k Ψ∗k ≤ 2p−1 for ∀k. As a non-negative linear combina- tion of non-negative numbers, we also have n 2 k=0 akΨ¯ k n ≤ 2p−1 2 k=0 akΨ∗k (C.6) hence 1 n 2 k=0 akΨ¯ k p 1 ≤ 2(1− 1 p ) n 2 k=0 akΨ∗k p (C.7) 2Recall we proved in Lemma 5 in the main paper that our approximation ψ¯ must let each Ψ¯ k = ψ¯ k + ψ¯ n−k achieves its lower bound Lk, hence we have the second constraint. 234 BIBLIOGRAPHY [1] Bjo¨ rn Andres, Jo¨ rg H. Kappes, Ullrich Ko¨ the, Christoph Schno¨ rr, and Fred A. Hamprecht. An empirical comparison of inference algorithms for graphical models with higher order factors using opengm. In DAGMSymposium, pages 353–362, 2010. [2] Dragomir Anguelov, Benjamin Taskar, Vassil Chatalbashev, Daphne Koller, Dinkar Gupta, Geremy Heitz, and Andrew Y. Ng. Discriminative learning of Markov Random Fields for segmentation of 3D scan data. In CVPR, pages 169–176, 2005. [3] Chetan Arora, Subhashis Banerjee, Prem Kalra, and S. N. Maheshwari. Generic cuts: an efficient algorithm for optimal inference in higher order MRF-MAP. In ECCV, 2012. [4] Maria-Florina Balcan and Nicholas J. A. Harvey. Learning submodular functions. In ACM Symposium on Theory of Computing (STOC), pages 793– 802, 2011. [5] D. Batra and P. Kohli. Making the right moves: Guiding alpha-expansion using local primal-dual gaps. In CVPR, pages 1865–1872, 2011. [6] I. Ben Ayed, L. Gorelick, and Y. Boykov. Auxiliary cuts for general classes of higher order functionals. In CVPR, 2013. [7] J. Besag. On the statistical analysis of dirty pictures (with discussion). Journal of the Royal Statistical Society, Series B, 48(3):259–302, 1986. [8] E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Mathematics, 123(1-3), 2002. [9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [10] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. TPAMI, 26(9):1124–1137, 2004. [11] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. In International Conference on Computer Vision (ICCV), pages 377–384, 1999. 235 [12] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. TPAMI, 23(11):1222–1239, 2001. [13] Qifeng Chen and Vladlen Koltun. Fast MRF optimization with application to depth reconstruction. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 3914–3921, 2014. [14] B. V. Cherkassky and A. V. Goldberg. On implementing push-relabel method for the maximum flow problem. Algorithmica, 19:390–410, 1997. [15] J. Edmonds and R. Giles. A min-max relation for submodular functions on graphs. Annals of Discrete Mathematics, 1:185–204, 1977. [16] Jack Edmonds. Submodular functions, matroids, and certain polyhedra. In Michael Jnger, Gerhard Reinelt, and Giovanni Rinaldi, editors, Combinatorial Optimization Eureka, You Shrink!, volume 2570 of Lecture Notes in Computer Science, pages 11–26. Springer Berlin Heidelberg, 2003. [17] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In International Conference on Machine Learning (ICML), pages 304–311, 2008. [18] A. Fix, A. Gruber, E. Boros, and R. Zabih. A graph cut algorithm for higher-order Markov Random Fields. In ICCV, 2011. [19] A. Fix, T. Joachims, S. Park, and R. Zabih. Structured learning of sum-ofsubmodular higher order energy functions. In ICCV, 2013. [20] Alexander Fix and Sameer Agarwal. Duality and the Continuous Graphical Model, pages 266–281. Springer International Publishing, Cham, 2014. [21] Alexander Fix, Joyce Chen, Endre Boros, and Ramin Zabih. Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I, chapter Approximate MRF Inference Using Bounded Treewidth Subgraphs, pages 385–398. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. [22] Alexander Fix, Chen Wang, and Ramin Zabih. A primal-dual algorithm for higher-order multilabel markov random fields. In CVPR, 2014. Supplemental Material at www.cs.cornell.edu/˜afix/. 236 [23] L. Ford and D. Fulkerson. Flows in Networks. Princeton University Press, 1962. [24] D. Freedman and P. Drineas. Energy minimization via graph cuts: Settling what is possible. In CVPR, 2005. [25] Brendan Frey and David MacKay. A revolution: Belief propagation in graphs with cycles. In Neural Information Processing Systems (NIPS), 1997. [26] S Fujishige and X Zhang. A push/relabel framework for submodular flows and its refinement for 0-1 submodular flows. Optimization, 38(2):133–154, 1996. [27] Andrew C Gallagher, Dhruv Batra, and Devi Parikh. Inference for order reduction in markov random fields. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1857–1864. IEEE, 2011. [28] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. TPAMI, 6:721–741, 1984. [29] Amir Globerson and Tommi S. Jaakkola. Fixing max-product: Convergent message passing algorithms for map lp-relaxations. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 553–560. Curran Associates, Inc., 2008. [30] Andrew V. Goldberg, Sagi Hed, Haim Kaplan, Robert E. Tarjan, and Renato F. Werneck. Maximum flows by incremental breadth-first search. In European Symposium on Algorithms, pages 457–468. Springer-Verlag, 2011. [31] Lena Gorelick, Yuri Boykov, Olga Veksler, Ismail Ben Ayed, and Andrew Delong. Submodularization for binary pairwise energies. In CVPR, 2014. [32] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman. Geodesic star convexity for interactive image segmentation. In CVPR, 2010. [33] P. L. Hammer, P. Hansen, and B. Simeone. Roof duality, complementation and persistency in quadratic 0-1 optimization. Mathematical Programming, 28:121–155, 1984. [34] P.L. Hammer and S. Rudeanu. Boolean Methods in Operations Research and Related Areas. Springer, 1968. 237 [35] J. M. Hammersley and P. E. Clifford. Markov random fields on finite graphs and lattices. Unpublished manuscript, 1971. [36] Xuming He, Richard S. Zemel, and Miguel A´ . Carreira-Perpin˜ a´n. Multiscale conditional random fields for image labeling. In CVPR, pages 695– 703, 2004. [37] H. Ishikawa. Higher-order clique reduction in binary graph cut. In CVPR, 2009. [38] H. Ishikawa. Higher-order gradient descent by fusion-move graph cut. In ICCV, 2009. [39] Hiroshi Ishikawa. Exact optimization for Markov Random Fields with convex priors. TPAMI, 25(10):1333–1336, 2003. [40] Hiroshi Ishikawa. Transformation of general binary MRF minimization to the first order case. TPAMI, 33(6), 2010. [41] T. Joachims, T. Finley, and Chun-Nam Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):27–59, 2009. [42] Vladimir Jojic, Stephen Gould, and Daphne Koller. Accelerated dual decomposition for map inference. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 503–510, 2010. [43] F. Kahl and P. Strandmark. Generalized roof duality for pseudo-boolean optimization. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 255–262, 2011. [44] Fredrik Kahl and Petter Strandmark. Generalized roof duality. Discrete Applied Mathematics, 160(1617):2419 – 2434, 2012. [45] Jorg H Kappes, Bjoern Andres, Fred A Hamprecht, Christoph Schnorr, Sebastian Nowozin, Dhruv Batra, Sungwoong Kim, Bernhard X Kausler, Jan Lellmann, Nikos Komodakis, et al. A comparative study of modern inference techniques for discrete energy minimization problems. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1328–1335. IEEE, 2013. [46] B. M. Kelm, N. Mueller, B. H. Menze, and F. A. Hamprecht. Bayesian estimation of smooth parameter maps for dynamic contrast-enhanced mr 238 images with block-icm. In Computer Vision and Pattern Recognition Workshop, 2006. CVPRW ’06. Conference on, pages 96–96, June 2006. [47] Jon Kleinberg and Eva Tardos. Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov Random Fields. J. ACM, 49(5):616–639, 2002. ACM Press. [48] Pushmeet Kohli, M. Pawan Kumar, and Philip H.S. Torr. P3 and beyond: Move making algorithms for solving higher order functions. TPAMI, 31(9):1645–1656, 2008. [49] Pushmeet Kohli, Lubor Ladick, and Philip Torr. Robust higher order potentials for enforcing label consistency. IJCV, 82:302–324, 2009. [50] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. In International Workshop on Artificial Intelligence and Statistics (AISTATS), 2005. [51] V. Kolmogorov and C. Rother. Minimizing nonsubmodular functions with graph cuts-a review. TPAMI, 29(7):1274–1279, July 2007. Earlier version appears as technical report MSR-TR-2006-100. [52] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? TPAMI, 26(2):147–59, 2004. [53] Vladimir Kolmogorov. Minimizing a sum of submodular functions. Discrete Appl. Math., 160(15):2246–2258, October 2012. [54] Vladimir Kolmogorov and Thomas Schoenemann. Generalized sequential tree-reweighted message passing. CoRR, abs/1205.6352, 2012. [55] N. Komodakis and N. Paragios. Beyond pairwise energies: Efficient optimization for higher-order MRFs. In CVPR, pages 2985–2992, 2009. [56] Nikos Komodakis, Nikos Paragios, and Georgios Tziritas. Mrf energy minimization and beyond via dual decomposition. In IN: IEEE PAMI., 2011. [57] Nikos Komodakis and Georgios Tziritas. A new framework for approximate labeling via graph cuts. In International Conference on Computer Vision (ICCV), 2005. 239 [58] Nikos Komodakis and Georgios Tziritas. Approximate labeling via graph cuts based on linear programming. TPAMI, 29(8):1436–1453, 2007. [59] Nikos Komodakis, Georgios Tziritas, and Nikos Paragios. Fast primaldual strategies for MRF optimization. Technical Report 0605, Ecole Centrale de Paris, 2006. [60] H. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic labeling of 3D point clouds for indoor scenes. In Conference on Neural Information Processing Systems (NIPS), 2011. [61] Vivek Kwatra, Arno Schodl, Irfan Essa, Greg Turk, and Aaron Bobick. Graphcut textures: Image and video synthesis using graph cuts. SIGGRAPH, 2003. [62] Dongjin Kwon, Kyong Joon Lee, Il Dong Yun, and Sang Uk Lee. Nonrigid image registration using dynamic higher-order mrf model. In ECCV, pages 373–386, 2008. [63] Lubor Ladicky, Chris Russell, Pushmeet Kohli, and Philip H. S. Torr. Computer Vision – ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V, chapter Graph Cut Based Inference with Co-occurrence Statistics, pages 239–253. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. [64] Lubor Ladicky, Christopher Russell, Pushmeet Kohli, and Philip H. S. Torr. Associative hierarchical CRFs for object class image segmentation. In ICCV, pages 739–746, 2009. [65] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. [66] V Lempitsky, C Rother, S Roth, and A Blake. Fusion moves for Markov Random Field optimization. TPAMI, 32(8):1392–1405, Aug 2010. [67] Hui Lin and Jeff Bilmes. Learning mixtures of submodular shells with application to document summarization. In UAI, pages 479–490, 2012. [68] L. Lova´sz. Submodular functions and convexity, pages 235–257. Springer Berlin Heidelberg, Berlin, Heidelberg, 1983. 240 [69] Carsten Lund and Mihalis Yannakakis. On the hardness of approximating minimization problems. J. ACM, 41(5):960–981, September 1994. [70] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001. [71] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. Contextual classification with functional max-margin Markov networks. In CVPR, pages 975–982, 2009. [72] Claudia Nieuwenhuis, Eno Tppe, Lena Gorelick, Olga Veksler, and Yuri Boykov. Efficient regularization of squared curvature. CoRR, abs/1311.1838, 2013. [73] James B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Math. Program., 118(2):237–251, January 2009. [74] Judeah Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [75] R. Potts. Some generalized order-disorder transformations. Proceedings of the Cambridge Philosophical Society, 48:106–109, 1952. [76] I.G. Rosenberg. Reduction of bivalent maximization to the quadratic case. Technical report, Centre d’Etudes de Recherche Oprationnelle, 1975. [77] Stefan Roth and Michael Black. Fields of experts. IJCV, 82:205–229, 2009. [78] C. Rother, P. Kohli, W. Feng, and J.Y. Jia. Minimizing sparse higher order energy functions of discrete variables. In CVPR, pages 1382–1389, 2009. [79] C. Rother, V. Kolmogorov, and A. Blake. “GrabCut” - interactive foreground extraction using iterated graph cuts. SIGGRAPH, 23(3):309–314, 2004. [80] C. Rother, S. Kumar, V. Kolmogorov, and A. Blake. Digital tapestry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [81] Bogdan Savchynskyy, Stefan Schmidt, Jo¨ rg Kappes, and Christoph Schno¨ rr. Efficient mrf energy minimization via adaptive diminishing 241 smoothing. Uncertainty in Artificial Intelligence, UAI-2012, pages 746–755, 2012. [82] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002. [83] Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I– 195. IEEE, 2003. [84] Dmitrij Schlesinger. Exact solution of permuted submodular minsum problems. In Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 28–38. 2007. [85] Alexander Shekhovtsov, Pushmeet Kohli, and Carsten Rother. Pattern Recognition: Joint 34th DAGM and 36th OAGM Symposium, Graz, Austria, August 28-31, 2012. Proceedings, chapter Curvature Prior for MRF-Based Segmentation and Shape Inpainting, pages 41–51. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. [86] M.I. Shlezinger. Syntactic analysis of two-dimensional visual signals in the presence of noise. Cybernetics, 12(4):612–628, 1976. [87] R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2012. [88] Petter Strandmark and Fredrik Kahl. Energy Minimization Methods in Computer Vision and Pattern Recognition: 8th International Conference, EMMCVPR 2011, St. Petersburg, Russia, July 25-27, 2011. Proceedings, chapter Curvature Regularization for Curves and Surfaces in a Global Optimization Framework, pages 205–218. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. [89] Paul Swoboda, Bogdan Savchynskyy, Jorg H. Kappes, and Christoph Schnorr. Partial optimality by pruning for map-inference with general graphical models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [90] Rick Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marshall Tappen, and Carsten Rother. A 242 comparative study of energy minimization methods for Markov Random Fields. TPAMI, 30(6):1068–1080, 2008. [91] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In ECCV, pages II: 582–595, 2008. [92] Meng Tang, Ismail Ben Ayed, and Yuri Boykov. Pseudo-bound optimization for binary energies. In ECCV, 2014. [93] B. Taskar, C. Guestrin, and D. Koller. Maximum-margin markov networks. In Advances in Neural Information Processing Systems (NIPS), 2003. [94] Benjamin Taskar, Vassil Chatalbashev, and Daphne Koller. Learning associative markov networks. In ICML, 2004. [95] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In International Conference on Machine Learning (ICML), pages 104–112, 2004. [96] Riaz A Usmani. Inversion of a tridiagonal jacobi matrix. Linear Algebra and Its Applications, 212:413–414, 1994. [97] Chaohui Wang, Nikos Komodakis, and Nikos Paragios. Markov random field modeling, inference & learning in computer vision & image understanding: A survey. Computer Vision and Image Understanding, 117(11):1610 – 1627, 2013. [98] Chaohui Wang, Olivier Teboul, Fabrice Michel, Salma Essafi, and Nikos Paragios. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2010: 13th International Conference, Beijing, China, September 20-24, 2010, Proceedings, Part III, chapter 3D Knowledge-Based Segmentation Using Pose-Invariant Higher-Order Graphs, pages 189–196. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. [99] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In Uncertainty in AI, 2007. [100] T. Werner. A linear programming approach to max-sum problem: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(7):1165–1179, July 2007. 243 [101] Toma´sˇ Werner. High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In CVPR 2008: Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 109–116, Madison, USA, June 2008. IEEE Computer Society, Omnipress. [102] Oliver Woodford, Philip Torr, Ian Reid, and Andrew Fitzgibbon. Global stereo reconstruction under second-order smoothness priors. TPAMI, 31:2115–2128, 2009. [103] Yisong Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In International Conference on Machine Learning (ICML), pages 271– 278, 2008. [104] Stanislav Zivny, David Cohen, and Peter Jeavons. The expressive power of binary submodular functions. In Mathematical Foundations of Computer Science 2009, volume 5734 of LNCS, pages 744–757. 2009. 244