GRAPH CUTS, SUM-OF-SUBMODULAR FLOW, AND LINEAR PROGRAMMING: EFFECTIVE INFERENCE IN HIGHER-ORDER MARKOV RANDOM FIELDS
A Dissertation Presented to the Faculty of the Graduate School
of Cornell University in Partial Fulﬁllment of the Requirements for the Degree of
Doctor of Philosophy
by Alexander Fix
May 2017

c 2017 Alexander Fix ALL RIGHTS RESERVED

GRAPH CUTS, SUM-OF-SUBMODULAR FLOW, AND LINEAR PROGRAMMING: EFFECTIVE INFERENCE IN HIGHER-ORDER MARKOV
RANDOM FIELDS Alexander Fix, Ph.D. Cornell University 2017
Optimization algorithms have a long history of success in computer vision, providing effective algorithms for tasks as varied as segmentation, stereo estimation, image denoising and scene understanding. A notable example of this is Graph Cuts, in which the minimum-cut problem is used to solve a class of vision problems known as ﬁrst-order Markov Random Fields. Despite this success, ﬁrst-order MRFs have their limitations. They cannot encode correlations between groups of pixels larger than two or easily express higher-order statistics of images. In this thesis, we generalize graph cuts to higher-order MRFs, while still maintaining the properties that make graph cuts successful.
In particular, we will examine three different mathematical techniques which have combined to make previously intractable higher-order inference problems become practical within the last few years. First, order-reducing reductions, which transform higher-order problems into familiar ﬁrst-order MRFs. Second, a generalization of the min-cut problem to hypergraphs, called Sum-ofSubmodular optimization. And ﬁnally linear programming relaxations based on the Local Marginal Polytope, which together with Sum-of-Submodular ﬂow results in the highly effective primal-dual algorithm SoSPD.
This thesis presents all mathematical background for these algorithms, as well as an implementation and experimental comparison with state-of-the-art.

BIOGRAPHICAL SKETCH Alexander Fix graduated with a Bachelor of Science in Computer Science and Mathematics from the University of Chicago in 2009. He is currently a PhD student at Cornell University, focusing on optimization algorithms with applications in Computer Vision. His advisor is Ramin Zabih. In the summer of 2013, he was a research intern at Google, advised by Sameer Agarwal. From 2013 to 2014, he completed his PhD research at Cornell Tech, in NYC. Since February 2015, he has been a researcher at Oculus Research in Redmond, WA.
iii

ACKNOWLEDGEMENTS
First, I am most grateful to my PhD advisor, Ramin Zabih, who has supported me without fail for the last six years. There are too many things to list, but here’s a start: Thank you for giving me the perfect environment to grow as a researcher — for independence when I needed it, and guidance when independence failed. Thank you for your example of how to be a member of a research community, and all your many introductions — here’s to many more workshops in Italy. Thank you for bearing with me on the actual writing of this thesis — it’s been a trial, but I think it’s turned out in the end.
I would also like to thank my committee members David Williamson and David Shmoys, for teaching me everything I know about approximation algorithms and linear programming, and for all your questions along the way — this thesis wouldn’t be half so interesting without them. Endre Boros, without whom I would not have started on my ﬁrst project, and for being a continual fount of promising ideas and research ideas ever since. Sameer Agarwal, and the rest of Steve Seitz’s group at Google, for a truly wonderful internship — thank you for introducing me to the wonderful world of research in industry.
And ﬁnally, Loranne, for putting up with all the years, and all the travel. I couldn’t have done it without you.
This research in this thesis has been funded by NSF grants IIS-0803705, IIS1161860/1161476, and IIS-1161282.
iv

TABLE OF CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction

1

1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Optimization Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Constrained Optimization . . . . . . . . . . . . . . . . . . . 6

1.2.2 Constraint Indicator Functions . . . . . . . . . . . . . . . . 7

1.2.3 Minimizing Elements . . . . . . . . . . . . . . . . . . . . . 8

1.2.4 Relaxations . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.5 Equivalence of Optimization Problems . . . . . . . . . . . 11

1.2.6 Common Equivalences Between Problems . . . . . . . . . 13

1.3 Example: Image Segmentation . . . . . . . . . . . . . . . . . . . . 14

1.3.1 Binary Labeling Problems . . . . . . . . . . . . . . . . . . . 15

1.3.2 Per-Pixel Cost Functions . . . . . . . . . . . . . . . . . . . . 15

1.3.3 Spatial Relations Between Pixels . . . . . . . . . . . . . . . 17

1.3.4 The Potts Model . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.5 Reduction to Graph Cut . . . . . . . . . . . . . . . . . . . . 19

1.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.4 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 23

1.4.1 Labeling Problems . . . . . . . . . . . . . . . . . . . . . . . 23

1.4.2 Maximum A-Posteriori (MAP) Inference . . . . . . . . . . 24

1.4.3 Log-probabilities . . . . . . . . . . . . . . . . . . . . . . . . 26

1.4.4 MAP inference in Foreground-Background Segmentation 27

1.4.5 Conditional Dependence . . . . . . . . . . . . . . . . . . . 29

1.4.6 The Hammersley-Clifford Theorem . . . . . . . . . . . . . 31

1.4.7 The Potts Model as an MRF . . . . . . . . . . . . . . . . . . 33

1.5 First-order and Higher-order MRFs . . . . . . . . . . . . . . . . . . 34

1.5.1 Advantages of Higher-Order Models . . . . . . . . . . . . 35

1.5.2 Image Denoising and Patch-Based Priors . . . . . . . . . . 36

1.5.3 Curvature Regularizing Priors for Stereo . . . . . . . . . . 40

1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2 Mathematical Background 2.1 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Pseudoboolean functions . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Representations of MRFs . . . . . . . . . . . . . . . . . . . 2.2.2 Set Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Multilinear Polynomials . . . . . . . . . . . . . . . . . . . . 2.2.4 Properties of Multilinear Polynomials . . . . . . . . . . . .

45 46 48 49 50 51 51

v

2.2.5 Computational Complexity and Hardness of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Submodular Functions . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Decreasing Marginal Gains . . . . . . . . . . . . . . . . . . 2.3.2 Equivalent Deﬁnitions of Submodularity . . . . . . . . . . 2.3.3 Properties of Submodular Functions . . . . . . . . . . . . . 2.3.4 Submodular First-order Pseudoboolean Functions . . . . .
2.4 Local and Marginal Polytopes for MRFs . . . . . . . . . . . . . . . 2.4.1 Weighted Averages as Linear Programs . . . . . . . . . . . 2.4.2 Marginal polytopes . . . . . . . . . . . . . . . . . . . . . . .
2.5 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Linear Cone Programming . . . . . . . . . . . . . . . . . .
2.6 Convex Sets and Convex Functions . . . . . . . . . . . . . . . . . . 2.7 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1 Exchanging Minimization and Maximization . . . . . . . . 2.7.2 Linear Programming Duality: An Example . . . . . . . . . 2.7.3 Conic Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Optimality for Linear Programs . . . . . . . . . . . . . . . . . . . . 2.9 Duality for the Local Marginal Polytope . . . . . . . . . . . . . . . 2.10 First-order Binary MRFs and Minimum Cut . . . . . . . . . . . . . 2.10.1 Solving First-order Submodular MRFs with Graph Cuts . 2.10.2 Linear Programs for Min-Cut . . . . . . . . . . . . . . . . . 2.10.3 Local Marginal Polytope for First-order Binary Problems .

53 58 58 60 62 65 67 68 69 72 75 76 78 79 80 86 90 92 94 95 97 98

3 Related Work

101

3.1 Higher-Order Models in Computer Vision . . . . . . . . . . . . . . 102

3.2 Inference Algorithms for Binary MRFs . . . . . . . . . . . . . . . . 104

3.2.1 First-Order Submodular MRFs . . . . . . . . . . . . . . . . 104

3.2.2 First-Order Nonsubmodular MRFs . . . . . . . . . . . . . . 105

3.2.3 Higher-Order Reductions . . . . . . . . . . . . . . . . . . . 107

3.2.4 Higher-order Submodular Functions . . . . . . . . . . . . . 109

3.3 Primal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

3.4 Dual Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.5 Primal-Dual Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 116

4 Higher order reductions

118

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.2.1 Reduction by substitution . . . . . . . . . . . . . . . . . . . 121

4.2.2 Reducing negative-coefﬁcient terms . . . . . . . . . . . . . 122

4.2.3 Reducing positive-coefﬁcient terms . . . . . . . . . . . . . 122

4.2.4 Generalized Roof Duality . . . . . . . . . . . . . . . . . . . 123

4.3 Reducing groups of higher-order terms . . . . . . . . . . . . . . . 125

4.3.1 Our method . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

vi

4.4 Worst case performance . . . . . . . . . . . . . . . . . . . . . . . . 129 4.5 Local completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.5.1 Performance on locally complete problems . . . . . . . . . 131 4.6 Locally complete energy functions in vision . . . . . . . . . . . . . 133 4.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5 Sum of Submodular Minimization

140

5.1 Sum of Submodular Minimization via Submodular Flow . . . . . 141

5.1.1 Deﬁnitions and Graph Construction . . . . . . . . . . . . . 142

5.1.2 Flow as a Reparameterization . . . . . . . . . . . . . . . . . 143

5.1.3 The Max-Flow Min-Cut Theorem for SoS Functions . . . . 145

5.2 IBFS for Submodular Flow . . . . . . . . . . . . . . . . . . . . . . . 148

5.2.1 IBFS on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 149

5.2.2 Modifying IBFS for SoS Flow . . . . . . . . . . . . . . . . . 150

5.2.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.3 Proof of the “No Shortcuts” Lemma . . . . . . . . . . . . . . . . . 152

5.4 The Current Arc Heuristic . . . . . . . . . . . . . . . . . . . . . . . 155

6 Submodular Upper Bounds for Higher Order Energy Functions

159

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 160

6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.3 Submodular Upper Bounds . . . . . . . . . . . . . . . . . . . . . . 163

6.4 Upper Bound Approximations . . . . . . . . . . . . . . . . . . . . 164

6.4.1 The Iterative Heuristic of SoSPD . . . . . . . . . . . . . . . 165

6.4.2 Quadratic-Based Submodular Upper Bounds . . . . . . . . 166

6.4.3 Cardinality-Based Submodular Upper Bounds . . . . . . . 169

7 A Primal-Dual Algorithm for Higher-Order Multilabel Markov Ran-

dom Fields

175

7.1 Higher-order Multi-label MRFs . . . . . . . . . . . . . . . . . . . . 175

7.1.1 Summary of Our Method . . . . . . . . . . . . . . . . . . . 176

7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.2.1 Graph Cut Methods and Higher-Order MRFs . . . . . . . 178

7.2.2 Linear Programming and Duality for MRFs . . . . . . . . . 178

7.2.3 Sum-of-Submodular Flow . . . . . . . . . . . . . . . . . . . 179

7.3 The SoS Primal Dual Algorithm . . . . . . . . . . . . . . . . . . . . 181

7.3.1 Update-Duals-Primals . . . . . . . . . . . . . . . . . . . . . 184

7.3.2 Pre-Edit-Duals . . . . . . . . . . . . . . . . . . . . . . . . . 186

7.3.3 Post-Edit-Duals . . . . . . . . . . . . . . . . . . . . . . . . . 187

7.3.4 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . 188

7.3.5 Approximation Bounds . . . . . . . . . . . . . . . . . . . . 189

vii

8 Experimental Evaluation of the SoSPD Algorithm

192

8.1 Benchmarks and Datasets . . . . . . . . . . . . . . . . . . . . . . . 192

8.1.1 Field of Experts Denoising . . . . . . . . . . . . . . . . . . . 193

8.1.2 Curvature Regularizing Stereo Reconstruction . . . . . . . 194

8.2 Comparison of Upper Bound Methods . . . . . . . . . . . . . . . . 195

8.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 195

8.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

8.3 Evaluation of SoSPD . . . . . . . . . . . . . . . . . . . . . . . . . . 201

8.3.1 Stereo reconstruction . . . . . . . . . . . . . . . . . . . . . . 203

8.3.2 Field of Experts denoising . . . . . . . . . . . . . . . . . . . 206

9 Structured learning of sum-of-submodular higher order energy func-

tions

207

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

9.3 S3SVM: SoS Structured SVMs . . . . . . . . . . . . . . . . . . . . . 210

9.3.1 Structured SVMs . . . . . . . . . . . . . . . . . . . . . . . . 211

9.3.2 Submodular Feature Encoding . . . . . . . . . . . . . . . . 212

9.3.3 Solving the quadratic program . . . . . . . . . . . . . . . . 214

9.3.4 Generalization to multi-label prediction . . . . . . . . . . . 216

9.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 218

9.4.1 Binary denoising . . . . . . . . . . . . . . . . . . . . . . . . 219

9.4.2 Interactive segmentation . . . . . . . . . . . . . . . . . . . . 220

10 Conclusion

224

A Local Completeness

227

B Laplacian Equations

231

C Approximation Ratio for Cardinality Upper Bounds

233

Bibliography

235

viii

CHAPTER 1 INTRODUCTION
Optimization algorithms have a long history of success in computer vision, providing the basis for many effective tasks as varied as segmentation, stereo estimation, image denoising and scene understanding. A particularly notable example of this is the method of Graph Cuts [11], in which minimum-cut algorithms are used to solve a class of vision problems known as ﬁrst-order Markov Random Fields (MRFs). There are two main reasons for Graph Cuts’ success. First, min-cut is already a well-studied problem with highly efﬁcient algorithms (and the popularity of Graph Cuts has encouraged the development of even more efﬁcient algorithms tuned speciﬁcally to computer vision problems). Second, the class of problems solved by Graph Cuts (ﬁrst-order MRFs) encapsulates the fundamental idea of image locality, i.e., that pixels in an image are highly correlated with their neighbors. This property makes MRFs well-suited to solving a wide range of inference problems in computer vision as well as machine learning and other ﬁelds.
Despite this success, ﬁrst-order MRFs have their limitations. They cannot easily encode correlations between groups of pixels larger than two, and thus are unable to express higher-order statistics of images. In this thesis, we focus on removing this limitation. Our goal is to generalize graph cuts to a wider class of higher-order MRFs, greatly extending the class of models for which MRF inference can be applied, while keeping the fast algorithms that make graph cuts successful. In a broader sense, this thesis is about the interaction between modeling and inference: by applying new advances in algorithms, we can now optimize a new class of models which were previously intractable, allowing
1

much greater ﬂexibility and power in the kinds of problems we can solve.

Our goal for the ﬁrst two chapters is to cover the mathematical background for Markov Random Fields, and to introduce the main optimization problem considered in the later chapters, which is a minimization problem of the form:

min fi(xi) + fC(xC) x iC
where the vector of variables x comes from a discrete label space x ∈

(1.1) i Xi, and

the function f to be optimized is a sum of unary functions fi (each depending

on a single variable xi) and so-called clique functions fC each of which depends

on a subset of the variables xC from a clique C, which are overlapping subsets

of the variables.

In particular, the main results of this thesis rely on the concepts of Linear Programming, duality, and linear programming relaxations. This introduction and the following chapter discuss the necessary background for these topics. We begin with the basic concepts of optimization, which may be unnecessary for readers familiar with the subject. However, we wish to put the MRF inference problem in the context of probabilistic inference, which informs the types of models which are useful in computer vision.

In this chapter, we will give a brief introduction to the use of optimization algorithms in computer vision, along with an extended example of how ﬁrst order MRFs and graph cuts are applied to a simple but typical vision task of binary segmentation. We will also explain how ﬁrst order MRFs can be generalized to include interactions between more than just pairs of pixels — such MRFs are called higher-order. We will conclude with several applications where allowing these higher-order interactions is necessary for using more sophisticated models which cannot be expressed by simpler ﬁrst-order MRFs.

2

1.1 Notation
All special notation will be introduced at the point of ﬁrst use in the text, but is also repeated here for easy reference.
We will write vectors as bold lowercase symbols: x. For any sets X and S , XS is the set of all vectors with components in X indexed by the elements of S . This allows, for instance, vectors not indexed by just the set 1, . . . , n. We’ll use Xn as shorthand for X{1,...,n}.
For i ∈ S and x ∈ XS , xi is the i-th component of x. For any subset T ⊆ S and vector x ∈ XS , we’ll write xT for the subvector of S corresponding to just the components in T .
We will always use V for the set of variable indices, so that xi are indexed by i ∈ V. When summing over variables, we will write i as shorthand for i∈V, as in (1.1). Similarly, we’ll use C to denote the set of cliques, and will use C as shorthand for C∈C.
When the cliques are all pairs, |C| = 2, then we have a graph with unordered edges {i, j}. We’ll write pairwise clique functions as fi, j which are similarly unordered, i.e., fi, j(xi, x j) = f j,i(x j, xi). Sums over pairs are also unordered, so i, j means i< j, including each unordered pair only once. In a graph, we will use N(i) to denote the set of neighbors of i, N(i) = { j | {i, j} ∈ E}.
The minimum value of a minimization problem (or maximum value of a maximization problem) is denoted OPT, so the minimum value of (1.1) is OPT(1.1). Minimizers are denoted by x∗, and X∗ is the set of all minimizers.
3

For a ﬁnite set X, the set of probability distributions on X is P[X], i.e., the set of all p : X → R with p(x) ≥ 0 for all x ∈ X, and x∈X p(x) = 1. For any function f : X → R, the expectation of f under the probability distribution p is f, p , which is given by f, p := x∈X f (x)p(x). Note that this is the inner product of f and p when treated as vectors in RX, so we will use ·, · for inner products in general.
The normal distribution with mean µ and standard deviation σ is N(µ, σ). We write x ∼ N(µ, σ) to denote a random variable drawn from this distribution. We use ∝ to denote proportionality, so the probability distribution function of
−(x−µ)2
N(µ, σ) is p(x) ∝ e 2σ2 .
The Iverson bracket P(x) is 1 or 0 depending on whether the condition P(x) is true or false. So f (x) = x is even has f (3) = 0 and f (6) = 1.

1.2 Optimization Basics

Optimization, at its core, is a search problem — we have an exponentially large (or possibly inﬁnite) set of choices, among which we want to ﬁnd the “best” one. To be precise, the most general formulation of optimization is that we have a solution space X (also called a state space or feasible set) and some objective (or cost function) function f : X → R, saying how good a given solution is. Our goal is to ﬁnd an x ∈ X which minimizes the objective value f (x).

Our standard notation for an optimization problem is:
min f (x) x
s.t. x ∈ X

(1.2)

4

For small problems which ﬁt on one line, we will also write min{ f (x) | x ∈ X}.
x

(1.3)

Most of the optimization problems we’ll consider are minimization problems, where the goal is to ﬁnd an x with f (x) as small as possible. We will write the optimum value (either minimum or maximum) of an optimization problem as OPT(1.2). Maximization problems will arise later, particularly when we come to the topic of duality. Note that we can convert back and forth between maximization and minimization problems by using the identity

max f (x) = − min − f (x).
x∈X x∈X

(1.4)

By itself, having an objective function isn’t much use — if we know nothing at all about the function f (i.e., we have a black-box which given an x ∈ X, evaluates and returns f (x)) then this general optimization problem is as hard as a totally unguided search problem — the best possible algorithm is to evaluate every single f (x) and return the best one. Since the set X is exponentially (or inﬁnitely) large for most interesting problems, this tells us there cannot be an efﬁcient optimization algorithm that doesn’t “look inside” the function f . Consequently, the study of optimization algorithms always involves taking advantage of problem structure, whether that structure comes from a particular form for the objective f (as in the clique structure of the MRF objective (1.1)), or from structure in the feasible set X.

5

Figure 1.1: The constraints of the optimization problem 1.5 are graphed above. The feasible region is shaded grey. Note that any value which satisﬁes 2x1 + x2 ≥ 2 and x2 ≥ 0 also satisﬁes the inequality x1 + x2 ≥ 1.
1.2.1 Constrained Optimization

Very commonly, problem structure comes from the state space X being deﬁned

by some constraints. That is, the set X is speciﬁed by a set of equations or in-

equalities that the elements x ∈ X must satisfy. For example, in the simple

optimization problem

min
x1,x2

3x1

+

4x2

s.t. 2x1 + x2 ≥ 2

x1 + x2 ≥ 1

(1.5)

x1, x2 ≥ 0

we have 4 constraints, namely that x1, x2 satisfy each of the four inequalities:

2x1 + x2 ≥ 2, x1 + x2 ≥ 1, x1 ≥ 0 and x2 ≥ 0. Solutions which violate any of

these

inequalities

are

called

infeasible.

For

example

(x1, x2)

=

(

1 2

,

0)

is

infeasible

6

(meaning

(

1 2

,

0)

X),

because

in

the

ﬁrst

inequality

we

would

have

2

·

1 2

+

0

=

1

<

2.

In general, in a constrained optimization problem we are given some functions g j : X → R indexed by j ∈ J, and our optimization problem is

min f (x) x
s.t. g j(x) ≥ 0 ∀ j ∈ J x∈X

(1.6)

We call each inequality g j(x) ≥ 0 a constraint, and the feasible set X is deﬁned to be the subset of X for which each constraint is satisﬁed, i.e., X = {x ∈ X | g j(x) ≥ 0, ∀ j ∈ J}. The set X is called the ambient space, and is typically Rn for some n. Also note that we may have constraints of the form g j(x) ≤ 0 or g j(x) = 0. These can each be converted into the standard form of (1.6) by noting that g j(x) ≤ 0 is equivalent to −g j(x) ≥ 0, and g j(x) = 0 can be replaced by the two inequalities g j(x) ≥ 0 and −g j(x) ≥ 0.

Any optimization problem may have many ways of being written in terms of constraints: some constraints may be redundant, for example in (1.5) the constraint x1 + x2 ≥ 1 is redundant, since any x1, x2 which satisﬁes the ﬁrst inequalities also satisﬁes x1 + x2 ≥ 1 (see Figure 1.1 for illustration).

1.2.2 Constraint Indicator Functions
An important construction for turning constrained minimization problems into unconstrained minimization is the indicator function: for a minimization prob-

7

lem and constraint g j(x) ≥ 0, the indicator function is



Igmjin(x)

:=

0 ∞

g j(x) ≥ 0 .
otherwise

(1.7)

That is, Igmjin is 0 whenever x satisﬁes the constraint, and is inﬁnite whenever the constraint is violated. Using the indicator function, we can replace all our

constraints with terms in the objective, to get an unconstrained minimization

minx F(x) where

F(x) = f (x) + Igmjin(x)
j

(1.8)

Whenever x is feasible (i.e., satisﬁes all the inequalities g j(x) ≥ 0) then

j Igmjin(x) = 0, so F(x) = f (x). However, if x is infeasible, it violates at least

one inequality, so we will have F(x) = ∞. In other words, we have replaced

‘disallowed’ solutions which violate the constraints, by putting an inﬁnite cost

on those solutions. Therefore, minimizing the unconstrained F is the same as

minimizing f with the constraints g j.

Note that for maximization problems, we instead have that solutions with

value −∞ are infeasible, so the indicator function for a maximization problem is



Igmjax(x)

:=

0 −∞

g j(x) ≥ 0 .
otherwise

(1.9)

1.2.3 Minimizing Elements

We will reserve the notation x∗ for elements x which are optimal, i.e., for which f (x∗) = minx{ f (x) | x ∈ X}. The set of all such x is denoted by argmin, so that

argmin{ f (x)} := {x | f (x) ≤ f (x ), ∀x ∈ X}.
x∈X

(1.10)

8

We will use the shorthand X∗ to denote the set of minimizers, when the problem we are referring to is clear from context. Note that in general the set of minimizers may be empty or have more than one element. For example, argminx∈R{ex} = ∅ and argminx∈R{(x − 1)2(x + 1)2} = {−1, 1}. Remark 1. For almost all problems in this thesis, it will be the case that the set of minimizers is non-empty. We will make special note of problems where this is not necessarily the case. Correspondingly, all proofs will make use of the existence of optimizers where useful.
There are various conditions which ensure that minimizers exist. One of the most powerful is the following: Theorem 2. If X is topologically compact and f : X → R is continuous, then argminx∈X f (x) ∅.
In particular, if X is a ﬁnite set, then it is always compact and any function f : X → R is continuous. Therefore, in the common case of |X| ﬁnite (also called discrete X) minimizers always exist.
Another special case is when the feasible set X is a subset of Rn. A subset of Rn is compact if and only if it is closed and bounded (meaning there is some R with ||x|| < R for all x ∈ X), so in particular minimizers always exist on any closed, bounded subset of Rn.
1.2.4 Relaxations
An important question is how to relate two optimization problems. The most common relation we will deal with is the notion of relaxation. The basic idea
9

behind relaxations comes from a seemingly trivial observation: if if we minimize over a larger set, the minimum value must go down. For example, it’s clear that min{3, 8, 6} ≥ min{3, 8, 6} ∪ {1, 5}, since the latter, larger set contains the minimum value 3 of the smaller set, plus possibly some other elements which may be smaller (like 1). This idea is the basic intuition for relaxations, in which we take a constrained optimization problem and we ignore, or relax, some constraints. For example, from our example (1.5) above, we can get a relaxation by removing the third constraint (that x1 ≥ 0) to get

min
x1,x2

3x1

+

4x2

s.t. 2x1 + x2 ≥ 2

x1 + x2 ≥ 1

(1.11)

x2 ≥ 0

In this case, we know that OPT (1.5) ≥ OPT (1.11), since the latter problem is

minimizing over a larger set.

To cover the all the cases we’ll use later on, we’ll extend this notion to work not just with constrained optimization. We’ll also allow renaming of elements of X, by a function g from X to some other set X .
Deﬁnition 3. An optimization problem ( f, X) embeds into another problem ( f , X ) if there is a function g : X → X with f (x) = f (g(x)) for all x ∈ X.

The most important special case of embeddings is relaxation, where X ⊆ X and the mapping g is just the inclusion map: g(x) = x. From our example above, we should expect the minimizing value of the relaxed problem to be no larger than the original problem.
First, we have a very basic fact about lower bounds:
10

Decreasing objective f(x)
Ω Ω’
g x* g(x*)
x’*
Figure 1.2: Illustration of the proof of Lemma 5. As long as the function g preserves objective values, then whatever the minimizer of ( f , X ) is, it is at least as good as g(x∗).
Proposition 4. If L is a lower bound of a set of real numbers A ⊆ R, meaning L ≤ a for all a ∈ A, then L ≤ min A. Lemma 5. If ( f, X) embeds into ( f , X ) then OPT( f, X) ≥ OPT( f , X ).
Proof. See ﬁgure 1.2 for illustration.
We’ll show that OPT( f , X ) is a lower bound to { f (x) | x ∈ X}. The Lemma then follows immediately from Proposition 4.
Let x ∈ X, we have that g(x) ∈ X and f (g(x)) = f (x) since g is an embedding. Then, since g(x) is feasible for ( f , X ) we have f (g(x)) ≥ OPT( f , X ) and therefore f (x) ≥ OPT( f , X ) as well.
1.2.5 Equivalence of Optimization Problems
Another question we might want to ask is: under what conditions can we use one problem to solve another. We can think of such problems as equivalent — a
11

solution to one gives us a solution to the other, and vice-versa. There are many ways of formalizing this notion of equivalence, but the following will be the most helpful for our purposes.1

Deﬁnition 6. Two optimization problems ( f, X) and ( f , X ) are equivalent if there is a bijection g : X → X which is order-preserving, i.e.,

f (x) ≤ f (y) implies f (g(x)) ≤ f (g(y))

(1.12)

If g, g−1 are poly-time computable, then we’ll say ( f, X) and ( f , X ) are poly-time equivalent.

Given our identity (1.4) converting maximization problems to minimization problems, that minx f (x) = − maxx − f (x), we will say that a maximization problem ( f, X) and minimization problem ( f , X ) are equivalent if there is an orderreversing bijection between them, meaning f (x) ≤ f (y) implies f (x) ≥ f (y).
As we claimed, given equivalent problems we can convert solutions of one to solutions of the other. Note that in the case of a poly-time equivalence, this is a reduction in the usual NP-completeness sense.
Lemma 7. If ( f, X) and ( f , X ) are equivalent (with mappings g : X → X and g−1 : X → X) the functions g and g−1 send minimizers to minimizers: g(X∗) ⊆ X ∗ and g−1(X ∗) ⊆ X∗.

Proof. Let x∗ be a minimizer of ( f, X). We want to show that g(x∗) is a minimizer
of ( f , X ), so let x be any other element of X . Since g is a bijection, we have that
1In particular, we deﬁne problem equivalence so that we can convert from optimal solutions of one problem to optimal solutions of another problem. When investigating probabilistic inference, this deﬁnition is most suited to taking the log-probability of a Gibbs energy to get an energy function of the form (1.1). However, it does not preserve approximation algorithms, and two equivalent problems (according to this deﬁnition) may have very different best-possible approximation ratios.
12

g−1(x ) ∈ X, and since x∗ is a minimizer of f , we must have

f (x∗) ≤ f (g−1(x )).

(1.13)

Then, since g is order preserving, we have

f (g(x∗)) ≤ f (g(g−1(x ))) = f (x ).

(1.14)

Therefore, f (g(x∗)) is less than any other element of X , so g(x∗) is a minimizer of ( f , X ). The reverse claim follows by symmetry.

1.2.6 Common Equivalences Between Problems

There are several common transformations we will apply to problems, that all deal with manipulating the objective function to get an equivalent problem. In particular, adding a constant to the objective function or multiplying the objective by a ﬁxed positive constant leads to an equivalent optimization problem. These are both special cases of a general rule: applying a monotonic transformation to the objective is an equivalence.
Lemma 8. If g : R → R is monotonically increasing (i.e., x ≤ y implies g(x) ≤ g(y)) then the optimization problems ( f, X) and (g ◦ f, X) are equivalent.

Proof. The identity function id : X → X is of course a bijection. Furthermore, whenever g is monotonic, then id is order-preserving, in the sense of Deﬁnition 6. Indeed, we need to show that

f (x) ≤ f (y) implies f (id(x)) ≤ f (id(y)).

(1.15)

But plugging in f = g ◦ f this is just

f (x) ≤ f (y) implies g( f (x)) ≤ g( f (y))

(1.16)

13

which follows from g being monotonic. Corollary 9. For any constant b and positive constant a > 0, the optimization problems ( f, X) and (a f + b, X) are equivalent.
Proof. The function ax + b is monotonic for a > 0.
In particular, we can always ignore constant terms in any optimization problem. For example in (1.5) the objective 7 + x1 + 2x2 can be replaced with x1 + 2x2, which is an equivalent problem according to Lemma 8.
Other useful monotonic functions include log and exp which we can combine with Lemma 8 to convert products to sums and vice-versa.
1.3 Example: Image Segmentation
In this section, we will illustrate all of the above concepts by way of an extended example. In computer vision, a prototypical use of optimization is to compute a foreground-background segmentation of an image — most commonly solved by reduction to minimum-cut in a graph. This example is likely familiar to readers familiar with graph cuts; however, it illustrates the key concepts of reduction, gadgets, and the tradeoff between more complex models and efﬁcient optimization, all of which are main themes of this thesis.
The foreground-background segmentation problem is this: for each pixel, we want to give a binary label indicating that this pixel is either part of the foreground object, or part of the background. With n pixels, there are 2n possible
14

segmentations, so brute-force searching for the best one is impractical. However, we haven’t yet speciﬁed f or X, which together will give us the structure we need to solve the problem.
1.3.1 Binary Labeling Problems
First, we deﬁne the feasible set X. Binary segmentation is an example of a labeling problem, where we are trying to give every pixel of an image some discrete label. We formalize this by giving each pixel a corresponding variable. That is, if we let V be the set of pixel indices, then for every i ∈ V there is a corresponding variable xi. These variables take values (also called labels) from {0, 1}, where a label of xi = 0 indicates that pixel i is assigned to the background, and xi = 1 indicates the variable is in the foreground. Therefore, the feasible set (called a label space in a labeling problem) is X = {0, 1}V. In particular, because there are only two labels, this problem is called a binary labeling problem.
1.3.2 Per-Pixel Cost Functions
There are many possible choices for the cost function f . We will start with the simplest one ﬁrst, where we assume that for every pixel, we have some (not necessarily accurate) idea of whether it is likely to be in the foreground or the background. For example, in the image in Figure 1.3, we are trying to segment out the banana as the foreground object, with the remainder of the image as the background. If (for example) a machine learning algorithm has seen many examples of segmented bananas, it could learn that yellow and brown pixels are
15

Figure 1.3: (Left) An example input to a binary segmentation problem. The foreground object we want to segment out is the banana. (Right) The desired segmentation mask for the foreground object.

likely to be foreground, while other colors are likely to be background. These rough ideas of likelihood are formalized as a cost function fi for each pixel: fi(0) gives the cost for the pixel being in the background, and fi(1) that of the pixel being in the foreground. We then have an objective function

f (x) = fi(xi).
i

(1.17)

If our costs are constructed such that low cost f (x) equates to high likelihood of

x, then choosing the minimizing x is the most likely solution.

The function (1.17) has a particularly simple structure to optimize: a function is called separable when it can be written as a sum of functions fi(xi), each of which is a function of a single variable xi, with no shared variables between them. That is, we can write f (x) = i fi(xi). In this case, it is clear that we can ﬁnd the minimizing x by setting xi to be 0 if fi(0) ≤ fi(1), and 1 otherwise. For separable objectives, making locally good choices gives a globally optimal solution, so these are among the simplest objectives to optimize.

16

Figure 1.4: (Left) The resulting segmentation using a unary-only model, as in (1.17). (Right) The resulting segmentation after adding edge terms of the form (1.19). Note that the segmentation boundaries much more closely follow the actual boundaries of the foreground object.
1.3.3 Spatial Relations Between Pixels
This simple, separable model unfortunately does not give good results — even with very informative functions fi, the segmentations obtained tend to be very noisy, as shown in Figure 1.4. The problem is that our model is missing important knowledge about the problem: it makes the (implicit) assumption that the label of a pixel is unconnected with the label of its neighbors. We can see by noting that the mimimizing xi is found independently of the others (i.e., without reference to f j for j i). However, we actually know a lot about the relations between pixels in an image. For example, the foreground labels tend to form a connected region in the image, and in general, a given pixel being in the foreground is good evidence that its neighbors are likely to be foreground pixels as well (and similarly for background pixels).
The intuition we want to incorporate into improving (1.17) is that we should take advantage of the spatial relations between pixels in an image. To do so, we
17

ii

Figure 1.5: (left) Typical 4-connected neighborhood on a pixel grid (right) 8-connected neighborhood. Neighbors N(i) of pixel i are bolded.
make use of the fact that the pixels V are arranged in a grid, and we will say that pixels i, j are neighbors if they are nearby in this grid. Typical neighbor sets are the 4-connected grid (where i has up to 4 neighbors up, down, left and right of it) and the 8-connected grid (which includes the 4-connected neighbors, as well as the 4 immediately diagonal neighbors). See Figure 1.5 for illustration. We will let N(i) be the set of pixels j which are neighbors of i, and let E = {{i, j} | j ∈ N(i)} be the set of all pairs of neighbors, called edges.

1.3.4 The Potts Model

With this neighborhood structure, a simple model which encourages neighbors

to take the same labels is the Potts model [75]. This model has the unary costs fi

from (1.17), as well as pairwise costs fi, j between pairs of neighbors {i, j} ∈ E. The

pairwise costs are of a particularly simple form — we pay a ﬂat cost every time

the endpoints of an edge i, j have different labels. That is,



fi, j(xi, x j)

=

 

1 0

xi x j otherwise

.

(1.18)

Note that we’ll index fi, j using unordered pairs, meaning fi, j(xi, x j) = f j,i(x j, xi).

18

We incorporate these pairwise terms into (1.17) by letting

f (x) = fi(xi) + λ fi, j(xi, x j).
i i, j

(1.19)

Because we now pay a cost for neighbors with different labels, the minimum

cost solution will tradeoff choosing likely foreground-background assignments,

while also not having too many discontinuities in the labeling. We can adjust

the parameter λ to adjust the relative weights in this tradoff. Higher λ encour-

ages more cohesiveness in the segmentation: as λ → ∞ the segmentation will

eventually become a single all-foreground or all-background region; as λ → 0

the objective becomes the same as the unary-only model of (1.17), and we get the

noisy segmentations seen in Figure 1.4. The best segmentations will be obtained

for λ somewhere in the middle, and the best λ must be tuned as a hyperparam-

eter of the model (typically by evaluating many such λ on a validation set, and

choosing the value with the best segmentation results on this set of held-out

images).

1.3.5 Reduction to Graph Cut
However, unlike (1.17), it is not immediately obvious how to ﬁnd a minimum energy solution x to (1.19). We cannot simply set each variable xi to the one with smallest fi(xi) as before, as this might cause a large penalty from pairwise terms fi, j(xi, x j).
To solve this minimization problem we will use the standard optimization technique of reduction, whereby we transform (or reduce) some seemingly difﬁcult problem into another problem which we already know how to solve. Minimizing (1.19) can be reduced to the well-known optimization problem known
19

as minimum cut.
The min-cut problem is the following: we are given a graph with nodes N and arcs 2 A ⊆ N × N, along with capacities ci, j on each arc (i, j) ∈ A. There are two special nodes s, t ∈ V called respectively the source and sink. A cut a set S ⊆ N that contains s but not t (i.e., s ∈ S and t S ). The cost of a cut is the sum of all capacities of arcs with their ﬁrst endpoint in S , and the other outside S , so c(S ) = i∈S j S ci, j. Our goal is to ﬁnd a cut S minimizing c(S ).
There exist many efﬁcient algorithms for solving these problems, including Augmenting Paths [23], Push Relabel [14], as well as the current state-of-theart for inputs typical of vision problems: IBFS [30]. If we can transform our actual problem (1.19) into an input for the min-cut problem, then we can apply any of these algorithms to ﬁnd the solution. Fortunately, the min-cut problem has many similarities to the binary segmentation problem. If we think of membership-or-not in S as a binary label, then the cost function c(S ) looks very much like the Potts terms fi, j: we pay a cost ci, j whenever neighbors i and j take different labels (i.e., i is in S and j is not in S ). The only difﬁculty is handling the unary terms fi.
To transform our problem into a min-cut input, we introduce a graph construction (sometimes called a gadget), which is a method for constructing a particular graph whose vertices and edges chosen in such a way that ﬁnding the minimum cut in this graph gives a solution to the binary segmentation problem. For Binary Segmentation, our constructed graph has a vertex for every pixel of the original problem, as well as two special nodes s and t, called a super-source and super-sink, so N = V ∪ {s, t}.
2We distinguish edges, which are undirected (i.e., are unordered pairs {i, j}), from arcs which are directed pairs (i, j).
20

For every undirected edge e = {i, j} in the neighborhood structure, we get two directed arcs (i, j) and ( j, i), both of which have capacity λ. Finally, we account for the unary terms by adding arcs (s, i) and (i, t) for every pixel i, with capacities cs,i = fi(0) and ci,t = fi(1).

This is now a valid input to the min-cut problem, so the only remaining question is whether this is actually an equivalent optimization problem to our original objective (1.19). According to Deﬁnition 6, we need an order-preserving bijection between these two problems. We’ve already noted a natural way of identifying binary vectors x with cuts S , namely, taking membership in S as a binary label. To ensure that s is always in the cut S and t is not in S , we deﬁne g from binary labelings to cuts by g(x) = {s} ∪ {i ∈ V | xi = 1}. It is clear that this is a bijection, with inverse g−1(S ) being the binary vector with xi = 1 for i ∈ S and xi = 0 for i S .

Finally, g is not just order preserving, but actually leaves the objective value the same. Indeed, for any x we have

f (x) = fi(1) + fi(0) +

λ

i:xi=1

i:xi=0

i, j:xi=1,x j=0

whereas the cost of the corresponding cut S is

(1.20)

c(S ) = fi(1) + fi(0) +

λ

i∈S i S i, j:i∈S , j S

(1.21)

Then, remembering that xi = 1 if and only if i ∈ S , we see that these two equations are the same, so f (x) = c(S ).

21

1.3.6 Discussion
It’s important to not let these details obscure the overall plan at work — because the binary segmentation problem is equivalent to the min-cut problem, Lemma 7 says we can take any optimal solution to the min-cut problem and get back an optimal solution to the segmentation problem. In fact, this solution is as simple as taking the min-cut S ∗, and then applying g−1 to get a binary map x∗ out of it.
A ﬁnal observation: the min-cut problem was already close enough to our segmentation problem that we chould actually construct a gadget showing them to be equivalent. This is no coincidence — only by choosing to model the binary cut problem in this particular way (with an objective function of the form (1.19)) is this connection obvious.
In fact, the existence of this graph construction is somewhat fragile — as we add more detail to the model, with more elaborate cost functions, it may break. For example, if we want to modify (1.19) to allow having a different cost λi, j on each edge, then we are still ﬁne: the constructed min-cut problem would now have a different cost ci, j = λi, j on each internal arc (i, j), and min-cut will still ﬁnd a solution. However, if we want to solve more general segmentation problems, for example semantic segmentation, where the label space is a larger set of semantic categories (e.g., sky, grass, building, etc.) then it is not obvious how to repair the construction, since min-cut is a binary problem, and multi-label generalizations of min-cut (such as multi-way cut) are generally NP-hard. Another change we could consider making is to allow some pixels to have a negative cost for taking different labels (i.e., allowing λi, j < 0). In this case, our construction would have negative capacities ci, j < 0, and the minimization problem again
22

NP-complete (by reduction from max-cut).
This difﬁculty is the heart of the tension between modeling and inference — the limits of our inference algorithms constrain the expressiveness of how we can model problems from applications.
1.4 Markov Random Fields
Having seen how these ideas work in a practical example, we can now formalize all of the concepts that make foreground-background segmentation work. The following deﬁnitions are all building towards the main topic of this thesis: optimization for Markov Random Fields.
1.4.1 Labeling Problems
In our segmentation example, our optimization problem had a variable xi for each pixel i, and each variable could take one of two labels, representing a choice of foreground or background for that pixel. Many problems have this general form, frequently with larger label sets than just {0, 1}. For example, in a semantic segmentation problem, the pixels take labels from a pre-deﬁned set of semantic categories, such as Sky, Ground, Tree, etc. The label sets can be quite large in practice: in optical ﬂow each pixel is labeled with a two-dimensional displacement, labeling every pixel with its corresponding pixel in the previous vide frame, so that the 2 pixels belong the same physical point — in this case the size of the label space is potentially the total number of pixels in the image (and even larger if subpixel displacements are desired).
23

Deﬁnition 10. A labeling problem is an optimization problem with a set of variables xi indexed by i ∈ V each taking values from a label set Xi. That is, X = i∈V Xi.
We will use vector notation x = (x1, . . . , xn) to indicate a feasible state, which we will call a labeling. We will write xC for the subvector of x restricted to the indices i ∈ C, so that xC = (xi)i∈C. Similarly, deﬁne XC = i∈C Xi to be the subspace of X corresponding to C.
1.4.2 Maximum A-Posteriori (MAP) Inference
In deﬁning the unary terms for equation (1.17), we indicated that the fi were chosen to correspond to how likely a particular pixel was to be background or foreground, given the information we observed in the image (i.e., that a particular yellow pixel was likely to be part of the banana, and therefore foreground). This rough idea is formalized by the notion of probabilistic inference.
For probabilistic inference, in addition to our unknown state x ∈ X, we also have a set of observations y ∈ Y. For example, in binary segmentation our observations include all the color values yi of each pixel i that form the image.
Probabilistic inference further assumes that there is some joint probability distribution p ∈ P[X × Y] over all possible observations and their associated labelings. We don’t assume we have knowledge of this joint probability distribution, but it does allow us to consider the conditional probability p[x|y]: the probability of an unknown state x ∈ X, given that we observed y ∈ Y. This conditional distribution is known as the posterior probability (since it is the remaining probability after having observed a deﬁnite y). The Maximum A-Posteriori
24

(MAP) problem is to ﬁnd the state x with highest posterior probability

max p[x|y]
x∈X

(1.22)

That is, having seen (i.e., conditioning on) the values of the observable y, we choose the most probable x as our answer. Note that MAP is not the only probabilistic decision method, but it is empirically quite successful, as well as amenable to an optimization approach.3

In order to compute this posterior probability, it is frequently easier to reason about another conditional distribution: p[y|x]. This is known as the likelihood function, and gives the probability of seeing an observation given that the true state was x. Likelihood functions are useful whenever we have a forward model of how our observations are formed from the hidden latent variables x.

Given this likelihood function, we can use Bayes’ rule to compute the poste-

rior distribution

p[x|y] = p[y|x]p[x] p[y]

(1.23)

There are two other terms two explain. The probability p[x] is the unconditional

probability that the true state is x. Since this probability does not depend at

all on our observation, it is known as the prior probability of the state x. The

probability p[y] is the unconditional probability of a particular observation y,

regardless of the true state x. Since in MAP inference we are optimizing over

x, and y is ﬁxed, this is a ﬁxed positive constant multiplied to our objective, so

given Corollary 9 (on problem equivalence), we are free to ignore it. In contexts

3It’s worth noting that MAP inference is a Bayes Optimal decision procedure, only when the loss function is 0-1, meaning we only care if we have found the exact correct solution and not any nearby point. In practice, we often have a loss function which allows some amount of inaccuracy; however, the decision procedure in this case often requires more difﬁcult (and frequently intractable) inference algorithms, such as ﬁnding the posterior marginal distributions for each variable.

25

where y can vary, this probability p[y] is usually denoted Z and is called the partition function.
An important point is that for almost all applications, the true underlying distribution p[x, y] by which observations and hidden states are generated is both unknown and likely unknowable, and in any case much too complicated an object to perform any computation with. However, we may have a reasonable model, which approximates this probability distribution, and which is close enough to the true distribution to give reasonable answers.

1.4.3 Log-probabilities

A useful transformation for probabilistic inference is to consider the negative log of the probability, converting the product in (1.23) into a sum

− log(p[x|y]) = − log(p[y|x]) − log(p[x]) + log(Z).

(1.24)

Because − log is monotonically decreasing on [0, ∞), it is an order-reversing bijection, so the minimization problem

min − log(p[x|y]) = min − log(p[y|x]) − log(p[x]) + log(Z) xx

(1.25)

is equivalent to our original probabilistic inference problem (1.22), by Lemma 8. This form has a number of advantages over (1.22). Because optimization problems are invariant up to addition of constants, we can ignore the term log Z entirely. Then, deﬁne

fdata(x) := − log(p[x|y]) fprior(x) := − log(p[x]).

(1.26) (1.27)

26

Then the MAP problem is equivalent to

min fdata(x) + fprior(x) x

(1.28)

The negative log transformation therefore lets us separate the minimization into two parts, one dealing with how well a state x ﬁts the observed data y, and another dealing only with the prior probability of x.

If some of our probabilities are independent of each other, we can simplify

this even further. Many models assume that each observation yi is generated

independently from the other y j, and in fact only depends on a single unknown

xi. In this case, we have

p[y|x] = p[yi|xi].

(1.29)

i

We will say that such a model has separable likelihood, because we can take the

negative log of these probabilities, and deﬁne fi(xi) := − log(p[yi|xi]), to get

fdata(x) = fi(xi).
i

(1.30)

This is a separable function of the variables xi. In this case, the functions fi are

called unary data terms, since they are functions of a single variable, depending

only on a single piece of data.

1.4.4 MAP inference in Foreground-Background Segmentation
Returning to our example of Foreground-Background Segmentation, we can now be explicit about the unary terms fi in (1.17), and how they relate to likelihood of pixels taking certain labels. As we have seen, we can use Bayes rule to calculate the posterior probability, but we need a likelihood function p[y|x] and prior p[x].
27

In reality, there is no simple rule which takes a foreground-background labeling x and gives a probability of different images y with that particular segmentation. The space of “all possible images” is much too large to specify a distribution over, and moreover requires a distribution over images collected from the real world.

We can make a reasonable approximation by choosing a much simpler, but still plausible choice for our likelihood function. In foreground-background segmentation, one common choice is to use a color model, where we assume that the color of objects is drawn from a population, and we have distinct populations for the foreground and background objects.

Speciﬁcally, we will choose for our unary likelihood terms a Gaussian Mixture Model (GMM), which models the individual pixels yi as independently chosen, and picked according to weighted sums of gaussians. The gaussians are different depending on whether the hidden state xi is foreground (xi = 1) or background (xi = 0).

Deﬁnition 11. A Gaussian Mixture Model is a distribution over Rn which is a

weighted sum of gaussians. A GMM has probability distribution function

k
GM M(c) = wiNµi,Σi(c)
i=1

(1.31)

for a color c ∈ R3, where w1, . . . , wk are weights summing to 1, and Nµi,Σi are normal

distributions with mean µi and covariance Σi.

Note that RGB colors come from R3, so we are using 3-dimensional gaussians. The GMM model for segmentation has 2 different distributions, GMMFG and GMMBG. Each pixel yi in the image (part of our observations) is generated independently of the other pixels: if the corresponding label xi is foreground,
28

then it is drawn from the distribution GMMFG, if xi is background, it is instead

drawn from GMMBG. Therefore, the likelihood function is



p[yi|xi]

=

GMMFG(yi) GMMBG(yi)

xi = 1 xi = 0

(1.32)

Taking negative log probabilities, we can deﬁne unary terms

fi(x) := − log(p[yi|xi])
 = − log(GMMFG(yi))
− log(GMMBG(yi))

xi = 1 xi = 0

= − log(GMMFG(yi))δ1(xi) − log(GMMFG(yi))δ0(xi)

(1.33) (1.34) (1.35)

where δz(xi) is the delta function, deﬁned to be 1 whenever xi = z and 0 other-

wise. Then, the probabilistic inference problem with this choice for the likeli-

hood function becomes

min fi(xi) − log(p[x]) x i

(1.36)

GMMs are an effective choice for the likelihood function since they capture the intuition of images being composed of a few like-colored objects. For example, in a scene composed of a white cow on a green ﬁeld with a blue sky, we will have 2 clusters of colors for background pixels, centered around blue and green respectively, and a single cluster of foreground pixels centered around white.

1.4.5 Conditional Dependence
One feature we have not discussed is the choice of prior p[x]. The ﬁrst thing to note is that if p[x] is the uniform prior over all images, then log(p[x]) is a
29

constant, so can be removed from (1.36). Doing so, we get back the unary-only model of (1.17). Therefore, the separable optimization problem we discussed earlier is exactly the case where we assume no prior over possible segmentations x — every possible labeling is assumed to be equally likely.

Of course, this is not a very realistic prior — as we have discussed, pixels share a lot of information with their neighbors. Using this information, in the form of a prior p[x], leads us to our next topic, which is conditional dependence between variables.

In computer vision, conditional dependence between variables is most closely related to spatial locality in images. That is, information at a particular pixel i is strongly correlated with that of its neighbors j ∈ N(i). In particular, we are interested in probability distributions for which variables xi depend only on their neighbors x j for j ∈ N(i). Such distributions are called Markov [35].

The Markov property relates two different conditional distributions. The ﬁrst is the probability of a given variable xi taking the label ai ∈ Xi, conditioned on all other variables xV−i having some already determined labeling aV−i.The second conditional probability is that of xi taking ai, conditioned on just the variables neighboring i, xN(i), having labels aN(i).

Deﬁnition 12. A probability distribution p ∈ P[X] is Markov (with respect to the neighbor structure N) if (1) we have p[x] > 0 for all x ∈ X and (2) for each i, and every labeling aV ∈ X, we have

p[xi = ai|xV−i = aV−i] = p[xi = ai|xN(i) = aN(i)].

(1.37)

Such a distribution is called a Markov Random Field.

That is, if we specify the labels x j = a j for all the neighbors j ∈ N(i), then the 30

probability that xi = ai is the same, regardless of the labels of any non-neighbor. The variables xi and xk for k N(i) are conditionally independent, since after conditioning on the neighbors x j for j ∈ N(i), the variables xi and xk are independent. The Markov property is a particularly strong form of image locality, in that it says non-neighbors have no direct effect on a variable xi.

1.4.6 The Hammersley-Clifford Theorem

While we are are interested in the Markov property because it captures our intuition of spatial locality, we get lucky — all such distributions can be written in a particularly simple form. In particular, the probability is a product over subsets of the variables called cliques.

Deﬁnition 13. For a neighborhood graph N, the set of (Markov) cliques C consists of all fully-connected subgraphs

C := {C ⊆ V | {i, j} ∈ E, ∀i, j ∈ C}.

(1.38)

When we have a function fC(xC) that depends only on xC, (i.e., fC : XC → R) we will call fC a clique function.

Theorem 14 (Hammersley-Clifford [35]). A probability distribution p is a Markov Random Field if and only if it can be written in the form

p[x] = e− fC(xC)
C∈C

(1.39)

where each fC is a clique function depending only on xC. Such a distribution is called a

Gibbs distribution.

31

Therefore, given a posterior distribution p[x|y] which is Markov, we can take

negative log-probabilities to get a simple form for the MAP problem:



−

max x

log

p[x]

=

min x

−

log



e− fC(xC)

C

= min fC(xC) x C∈C

(1.40)

Since the MAP problem for any Markov Random Field can be written in the

form (1.40) (and vice versa), we will henceforth take this as our basic deﬁnition

of an MRF.

Deﬁnition 15. Let C be a subset of 2V, i.e., any collection of subsets of V, X = i Xi a label space, and f : X → R be of the form

f (x) = fC(xC).
C∈C

(1.41)

Then f is called the MAP problem for the MRF f , or an MRF inference problem with

clique structure C and clique functions fC.

Note that we dont require the cliques C to line up with any particular set of fully connected subsets, or to necessarily come from the probability of any probabilistic inference problem. Because the inference methods described in later chapters apply to any such functions, we will refer to any function of the form (1.41) as an MRF.

Because many of the problems we deal with have a separable likelihood, we will generally separate out the unary terms in (1.41), and write f as

f (x) = fi(xi) + fC(xC).
i∈V C∈C

(1.42)

Remark 16. We will generally assume that the cliques are small, relative to the number

of variables: |C| << n.

32

In this case, the cliques represent local structure, in that they are small collections of non-independent variables.

1.4.7 The Potts Model as an MRF

Returning to our example of foreground-background segmentation, we can now ﬁt the Potts model of Section 1.3.4 into this framework. Recall equation (1.19) for the binary segmentation energy

f (x) = fi(xi) + fi, j(xi, x j).
i i, j

(1.43)

As we’ve already seen, the unary terms fi come from our probabilistic inference

framework via the likelihood functions p[yi|xi]. The pairwise terms, fi, j, how-

ever, do not involve the observations y, and are therefore part of the prior, p[x].

Note that (1.43) is a particular instance of (1.41), i.e., (1.43) deﬁnes an MRF. Here, the clique structure includes the neighbor pairs {i, j} ∈ E for the pairwise terms and the singeltons {i} for the unary terms:

C = {{i, j}|i ∈ N( j)} ∪ {{i}|i ∈ V}

(1.44)

We can even give this particular choice of pairwise functions fi, j a probabilis-

tic interpretation, by reversing the negative-log transformation we used to

get (1.40).

p[x|y] ∝

e− fi(xi)

e .− fi, j(xi,x j)

(1.45)

i i, j∈E

Note that we only get the probability up to proportionality — since the mini-

mization problem is the same up to an additive constant, reversing the negative-

log transformation we only get the probability up to a positive multiplicative

constant.

33

Recalling that fi, j(xi, x j) = λ if xi x j and 0 otherwise, we can simplify the part of 1.45 dealing with the prior p[x] to

p[x] ∝

e =− fi, j(xi,x j)

e−λ.

i, j∈E

i, j∈E:xi x j

(1.46)

This distribution p[x] is also known as the Ising model, which has been inde-

pendently studied in physical models of spin states. In general, such Gibbs

distributions [28] (distributions which are products of exponentials) are widely

studied in statistical physics and other ﬁelds.

1.5 First-order and Higher-order MRFs

One of the most important factors for the complexity of solving an MRF inference problem is the order of the MRF.
Deﬁnition 17. The order of an MRF is the maximum size of any clique C ∈ C, minus one. We distinguish two cases: ﬁrst-order MRFs, which have maximum clique size two, and higher-order MRFs, which have cliques of size three or greater.

First-order MRFs were historically the ﬁrst to have efﬁcient inference algorithms. According to the deﬁnition of a ﬁrst-order MRF, every clique has size at most 2, so is either a unary term, |C| = 1, or involves a pair of variables, |C| = 2. Therefore, for ﬁrst order MRFs, we can form a graph G = (V, E) with (unordered) edges e = {i, j}, such that

f (x) = fi(xi) +

fi, j(xi, x j)

i {i, j}∈E

(1.47)

Because the clique structure C forms a graph, we can employ existing graph

algorithms for this problem. As seen in our example of binary segmenta-

tion, (1.47) is closely related to the min-cut problem in a graph, and can be

34

exactly solved for binary problem. For multi-label problems, we can approximately solve the problem by repeated application of min-cut using algorithms such as alpha-expansion, which we will describe later.
In contrast to ﬁrst-order MRFs, it is harder to develop efﬁcient optimization algorithms for higher-order MRFs. The ﬁrst difﬁculty is that even specifying the values of a clique function fC(xC) for each of the labelings xC ∈ XC requires storing |C| values (where = |Xi|). Additionally, whereas a ﬁrst-order MRF naturally forms a graph, a higher-order MRF forms a hypergraph, where instead of edges consisting of pairs of vertices, we have hyperedges which are subsets of the vertices, of size possibly larger than 2. While many graph algorithms can be extended to work on hypergraphs, it is not always completely obvious how to do so. In particular, a generalized version of the min-cut algorithm for hypergraphs was only very recently applied to higher-order MRF optimization [53].
1.5.1 Advantages of Higher-Order Models
Despite these difﬁculties, higher-order MRFs have a number of advantages, the most important of which is that ﬁrst-order MRFs are limited in expressiveness compared to general MRFs.
The primary advantage of using higher-order MRFs is that they allow greater ﬂexibility in coming up with models that better match the true statistics of images. Probabilistic inference frequently makes simplifying assumptions about distributions in order to end up at a tractable optimization problem. If we are ultimately interested in more accurate answers, then we require more complicated models that better represent the true prior and likelihood functions.
35

Compare the different segmentation results from the unary-only model

of (1.17) versus the pairwise Potts model of (1.19), as seen in Figure 1.4. In

our probabilistic inference framework, the unary-only model corresponds to a

MAP

problem

with

a

totally

uniform

prior,

p[x]

=

1 |X|

.

This

prior

does

not

repre-

sent real-world segmentations well, and consequently the results we see in Fig-

ure 1.4 are noisy, with boundaries that do not closely match the actual object. By

adding in the pairwise Potts terms, we have complicated the model (requiring

a reduction to min-cut to solve, rather than the simple separable optimization

of (1.17)); however, the observed segmentations correspond much more closely

to what we expect real objects to look like. This was achieved by adding a prior

p[x] which explicitly prefers segmentations with a short boundary (a quality

which is shared with the true distribution of objects in images).

Compared to ﬁrst-order models, higher-order MRFs allow even more ﬂexibility to match the underlying distribution, and in many cases can express properties of images that are inexpressible in ﬁrst-order models. We will consider two examples, patch-based priors for denoising and curvature regularizing priors for stereo.

1.5.2 Image Denoising and Patch-Based Priors
A common problem in photography is that images suffer from noise — low light images in particular suffer from photon noise due to the quantized nature of light, thermal noise, and read-noise from imperfect electronics.
In the image denoising problem, we attempt to remove this noise from an image, in order to reconstruct the underlying scene being imaged. We have a
36

forward model of image formation, where we observe a noisy image y, which

is obtained from an underlying (noise-free) image x by the addition of a noise

function η to each pixel:

yi = xi + η.

(1.48)

In our forward model, we will assume that η is independent for each pixel i, and that η is unbiased Gaussian noise, η ∼ N(0, σ). For concreteness, we’ll assume that the label space Xi is discretized to 256 intensity values Xi = {0, . . . , 255}.

Our likelihood function is determined by our forward model. Since yi =

xi + N(0, σ), we have

p[yi|xi]

=

−(yi − xi )2
e 2σ2

.

(1.49)

Taking negative-log probabilities (as in Section 1.4.3) we get unary terms

fi(xi)

=

− log

p[yi|xi]

=

1 2σ2 (xi

−

yi)2.

(1.50)

Note that this is an example of a separable likelihood function.

Without any prior on our problem, the optimization problem is just

min x

1 2σ2

(xi

−

yi)2

(1.51)

so the minimizing answer is just to set x = y — that is, if we have unbiased noise, and no prior beliefs about noise-free images, the MAP answer is to say that the original noise-free image is whatever we observed.

Of course, noisy images are quite different from noise-free images. In particular, we expect images to be composed of connected regions formed by individual objects (which may be actual objects, or patches of similar color within an object such as the spots on a cow) where the variation of intensity within each object is roughly constant, and with a few sharp transitions where objects meet.

37

We can capture this intuition with an edge-preserving prior, which is a ﬁrstorder MRF model. For our clique structure C, we’ll use either the 4- or 8connected neighborhood model of Figure 1.5, and let the clique structure C be all edges C = E. On each edge {i, j} we will have what is called a robust distance function fi, j of the form

fi, j(xi, x j) = min{|xi − x j|, τ}.

(1.52)

This speciﬁc cost is called the truncated L1 cost. To explain this choice, note that it has minimum cost when xi = x j, with gradually increasing cost as xi and x j have different intensities. This represents the part of our intuitive description that within an object, neighboring pixels should have similar intensities. Then, because we expect there to be edges between objects where the intensity can jump arbitrarily, we cap the maximum cost at τ.

These edge-preserving priors do a reasonable job matching the observed statistics of neighboring pixel differences, and form a ﬁrst-order MRF, so we have fast algorithms for optimizing them. However, they fail to take into account more complicated statistics of images, especially those related to image texture. In particular, the truncated L1 cost always prefers neighbors to have identical intensity values, so totally ﬂat regions will always be preferred to regions of small texture variations.

It is difﬁcult to come up with a texture-aware ﬁrst-order MRF. However, with higher-order MRFs it is relatively easy. Instead of putting a cost fi, j on pairs of neighboring pixels, we will put a cost fC on patches of the image (such priors are called patch-based priors). These patch functions fC are designed to match the statistics of noise-free image patches, and can thus account for local intensity variation due to texture (among other features). We will focus on a particular

38

Figure 1.6: From left to right: (a) Original image, before noise. (b) Image after adding independent Gaussian noise to each pixel. (c) Denoised result using the pairwise edge-preserving prior of (1.52). (d) Denoised result using the Field-of-Expert prior of (1.53).

prior known as Field of Experts [77], which consists of a set of k linear patch

ﬁlters Ji run through a non-linear response function to give a cost function

k
fC(xC) = αi log(1 + JiT xC).
i=1

(1.53)

This cost function is not as clearly intuitive as the edge-preserving truncated L1 cost; however, we can use machine learning techniques so that the ﬁlters Ji and weights αi are chosen such that the resulting prior p[x] matches the observed distribution of image patches as closely as possible.

The results of optimizing a denoising problem with a ﬁrst-order MRF and a higher-order Field of Experts MRF are shown in Figure 1.6. As expected, the more complicated higher-order model is better able to capture the local variations due to shading gradations and texture, and leads to a less “ﬂattened” result. Of course, this complexity is not free — optimizing these higher-order models is more challenging. Consequently, we will use this particular denoising model as a benchmark for evaluating various higher-order optimization meth-

39

ods in later sections.

1.5.3 Curvature Regularizing Priors for Stereo

In some cases, certain image properties are simply impossible to express in a ﬁrst-order MRF, as we will see from this example of stereo reconstruction. In the stereo reconstruction problem, we are given a left and right image taken by two cameras separated by a baseline. The cameras are calibrated so that rows in one image correspond to rows in the other image, and if a point in space maps to a pixel at column i of the right image, and column i + δ of the left image, then that point must be at a depth ∆/δ, where ∆ is the separation of the cameras, known as the baseline. The difference δ is known as the disparity. We will turn stereo reconstruction into a labeling problem by discretizing the disparity into a ﬁnite set of pixel differences {0, 1, . . . , D}. Each variable xi gives the disparity for each pixel i in the right image, from which we can infer the depth (given the baseline ∆).

We are interested in what kinds of priors are appropriate for stereo reconstruction. As with image denoising, we expect the depth image to consist of connected regions, so some sort of edge-preserving prior will be required. This prior should reﬂect the variation we expect within each region.

A simple prior is to put a cost on neighboring pixels whenever their disparities are different. There are other choices for this, but one possibility is the truncated L1 cost that we used for denoising

fi, j(xi, x j) = min{|xi − x j|, τ}.

(1.54)

This function has minimum cost when the disparity values for i and j are the

40

same — for this reason, this prior prefers the depth image to be composed of regions which are all at the same depth, i.e., regions which are each on a plane parallel to the camera image plane. For this reason, such priors are called frontoparallel priors.

However, even simple depth images are not fronto-parallel: for example the

depth image of a ﬂat wall taken from a 45 degree angle is a slanting plane. Even

with the simple assumption that real scenes are formed of totally ﬂat objects,

we would expect each region to be a slanting plane, but not necessarily fronto-

parallel. An example of a prior based on this intuition (from the work of [102])

is to penalize the curvature of the depth map, as ﬂat planes have no curvature.

We do this by putting a cost on 3 × 1 windows of the image, with a cost function

fi, j,k

of

the

three

disparities

xi,

xj

and

xk

(with

corresponding

depths

∆ x1

,

∆ x2

,

∆ x3

):

fi, j,k(xi, x j, xk) = min

∆

∆ −2

+

∆

,τ

.

x1 x j xk

(1.55)

The

quantity

∆ x1

−

2

∆ xj

+

∆ xk

is

a

discrete

approximation

to

the

curvature

of

the

depth map at pixel j, and as with the truncated L1 cost, we cap this cost at τ to

not overly penalize discontinuities between objects.

As shown in Figure 1.7, the results achieved by including a curvature regularizing prior are much more realistic, and do not arbitrarily chop the result into fronto-parallel planes, as in the ﬁrst-order prior.

It is not just more natural to express curvature priors as a higher-order MRF, it is actually impossible to express these constraints using only neighboring pixel differences. To see this, in Figure 1.8, we have two possibile sets of depths for a group of three pixels. The ﬁrst possibility is planar, and should have low cost, while the other has a corner, and should be higher cost. A higher-order MRF with cliques of size 3 can distinguish between these two cases, and assign

41

Figure 1.7: (Left) Example synthetic input to a stereo reconstruction problem. (Center) Reconstruction using a ﬁrst-order prior, resulting in many fronto-parallel planes. (Right) Reconstruction using a third-order curvature regularizing prior, resulting in smooth planes and curves. All images from [102].
a higher cost to the case with a corner. However, a ﬁrst-order MRF sees 4 pairs of pixels, each with the same distance |xi − x j| in the disparities. By symmetry, we shouldn’t penalize pairs of pixels sloping back left-to-right versus right-to-left, so we must assign the same cost to the plane and the corner.
Despite the extra (but necessary) expressiveness of these higher-order priors, they are challenging for existing optimization methods to handle. Ad-hoc methods have been proposed, including a reduction of these second-order cliques to a ﬁrst-order model, as proposed in the paper which introduced this model [102]. However, it is our goal to show how to optimize such models in general. As with patch-based image denoising, this model is an important benchmark of higher-order inference algorithms.
42

Figure 1.8: (Left) Two possible disparities for a set of three pixels. (Center) A higher-order model with size 3 cliques can see all three pixels, and assign different costs accordingly. (Right) A ﬁrst-order model sees 4 different pairs of pixel differences, each with the same difference between the two pixels (with some slanting left to right, and some the opposite direction).
1.6 Conclusion

There are three main points from this introduction, which will hopefully frame the problem of inference in higher-order MRFs.

First, we have seen that MRFs arise from probabilistic inference, in particular in probabilistic inference problems in which the variables have spatial locality, as in the conditional dependence of nearby pixels in an image.

Not all MRFs are probabilistic inference problems, many are hand con-

structed without reference to forward models or prior distributions. However,

all of the optimization techniques developed in further chapters will handle

equations of the form

min fC(xC) x C

(1.56)

regardless of how the clique functions are chosen. That is, this equation is ab-

stracts away the issues of probabilistic inference, to a form which can be directly

handled by optimization algorithms.

Finally, we have seen how modeling more sophisticated priors leads to better answers, but can also result in harder inference problems. In particular, higher-

43

order models allow much greater ﬂexibility for modeling than ﬁrst-order priors, sometimes including constraints on the solution that cannot be expressed using only ﬁrst-order terms.
44

CHAPTER 2 MATHEMATICAL BACKGROUND

In this chapter, we cover the mathematical tools necessary to present both

the related work, as well as the main results of this thesis. To recall: our main

problem is the MAP problem in MRFs, in which we are trying to minimize a

function of the form

f (x) = fC(xC)
C∈C
where the variables x = (x1, . . . , xn) come from a label space X =

(2.1) i Xi, and the

functions fC are clique functions, each depending on a corresponding subset C

of the variables, for each C in the clique set C.

We will begin by looking at transformations of (2.1) called reparameterizations in Section 2.1, which will be a useful tool in many optimization algorithms. The special case of MRFs with binary labels in Section 2.2 provides most of the known theoretical results on hardness of optimization and approximation. Section 2.3 covers submodular functions, which are a class of MRFs for which exact optimization is tractable. Convex relaxations of the MRF optimization problem, called the Marginal Polytopes are presented in Section 2.4. The Marginal Polytope, and Linear Programming relaxations in general, form the basis for the main algorithms of this thesis, so we give an overview of Linear Programming in Section 2.5, along with the related concepts of convex functions (Section 2.6), linear programming duality (Section 2.7), and optimality conditions (Section 2.8). Finally, we give speciﬁc applications of these linear programming contexts to MRF inference, in particular the dual of the Local Marginal Polytope in Section 2.9. We conclude with a description of graph cuts algorithms, and how max-ﬂow min-cut algorithms ﬁt in to our linear programming framwork

45

in Section 2.10.
2.1 Reparameterization

A simple but useful fact is that any particular function may have many ways of being written, and that different forms may reveal useful information about an optimization problem, or be otherwise more convenient for algorithms. These multiple representations of the same function are called reparameterizations.

Deﬁnition 18. Let f, f : X → R be MRFs on the same label space, where

f (x) = fC(xC)
C∈C

f (x) = fC(xC).
C∈C

(2.2)

Then f is a reparameterization of f if f (x) = f (x) for every x ∈ X. That is, they are

equal as functions X → R.

Note that the two functions f and f may have very different values for the clique functions fC and fC, and may even have different clique structures C and C.

A common use of reparameterization is to put an MRF into a particular normal form as a starting point for optimization. For example, if we don’t want to deal with negative values, we can always re-write an MRF so that each clique function is nonnegative (except for a constant term which may be negative).

Lemma 19. Any MRF f : X → R can be reparameterized to a form

f (x) = f∅ + fC(xC)
C∈C
where for every C ∅ we have fC(xC) ≥ 0 for all xC ∈ XC.

(2.3)

46

Proof. For every C ∈ C, let δC = minxC fC(xC). We’ll set fC = fC − δC, for C ∅ and

f∅ = f∅ + δC.
C∅

(2.4)

First, we have achieved our goal that the clique functions fC are nonnegative, since fC(xC) = fC(xC) − δC and fC(xC) ≥ δC by choice of δC, hence fC(xC) ≥ 0.

Now, to see that f is a reparameterization of f , we just expand out the terms to see that we have added and subtracted constants in a way that cancels out. For any x we have

f (x) = f∅ + fC(xC)
C∅
= f∅ + δC + ( fC(xC) − δC)
C∅ C∅
= f∅ + fC(xC) = f (x)
C∅

(2.5)

Most of the reparameterizations used later follow this same basic form: we take some of the cost from one clique function fC and move it to another clique function fC , in such a way that the addition and subtraction balances out.
Even for the simple reparameterization in Lemma 19, we can prove useful facts about the original MRF f , in this case, giving an easy lower-bound on value of the optimal solution.
Corollary 20. If f is reparameterized as in Lemma 19, then f (x) ≥ f∅ for all x ∈ X. In particular, OPT( f ) ≥ f∅.
Proof. Since every term of f has fC(xC) ≥ 0 (except for possibly f∅), we have f (x) = f (x) ≥ f∅.
47

Another useful reparameterization that moves cost from higher-order terms

to unary terms is the pencil reparameterization, from the Min-sum diffusion al-

gorithm of [100]. In the terminology of [100], a pencil1 consists of a label a for

a variable xi, together with all the values fC(xC) of a single clique C, for just the

labels xC with xi = a. We can get a reparameterization by subtracting a value δ

from all the fC(xC), and adding the same δ to the unary term fi(a).



fC (xC )

=

 

fC(xC) − δ fC (xC )

xi = a otherwise



fi

(xi)

=

 

fi(xi) + δ fi(xi)

xi = a otherwise

(2.6)

This transformation doesn’t change the cost: if x happens to have xi = a then

whatever the rest of the labels in xC are, we subtracted δ from fC(xC), while

adding δ to fi(xi), which cancel out. And if xi a then we didn’t change the

value of fC or fi.

2.2 Pseudoboolean functions
In the following two sections, we will restrict our attention to binary problems; that is, MRFs where the label set for each variable is {0, 1}. We have already seen that binary MRFs are of particular importance for graph-cuts methods, since the min-cut problem is itself is binary (each vertex is either on the s or t side of the cut). Consequently, such functions have attracted a good deal of research in the combinatorial optimization literature, as far back as [34], where they are known as pseudoboolean functions.
1Presumably so-called because in the graph construction of [100] a pencil consists of edges from nodes representing all the fC(xC) to the single node representing xi — these edges come to a point, thus resembling a pencil tip.
48

Deﬁnition 21. A pseudoboolean function (of n variables) is a function f : {0, 1}V → R.
As with MRFs, many pseudoboolean functions have local structure in which they can be written as a sum of clique functions, so that f (x) = C fC(xC). Each fC : {0, 1}C → R is also a pseudoboolean function, deﬁned just on the clique C. A pseudoboolean function f is higher-order if f has cliques C of size 3 or greater, and f is ﬁrst-order if |C| ≤ 2 for all cliques.
2.2.1 Representations of MRFs
For all MRFs (including multi-label MRFs), there are two main representations worth mentioning.
• In the explicit representation, the MRF f is given as a table of values for each clique function fC. That is, for every C ∈ C and xC ∈ XC, the value fC(xC) is given as part of the input to the optimization algorithm.
• In the implicit (or black box) representation, the clique structure C is given explicitly (as a collection of subsets C ⊆ V) but each clique function fC is given as an oracle. That is, there is a piece of code or an algorithm which computes fC given xC.
Generally, we assume that MRFs are given in the implicit representation, since this is the more general form2, but both make sense in different scenarios. However, for the binary case, there are additional representations which make
2The black box is more general from the point of view of the algorithm, since a black box algorithm can be used on any other representation.
49

use of the combinatorial structure of binary labels: set functions and multilinear polynomials.

2.2.2 Set Functions

In the binary segmentation example (Section 1.3), we used a mapping between binary labelings {0, 1}V and subsets S ⊆ V in order to convert between minimum cuts in a graph and segmentations. In general, we can use this conversion to map any pseudoboolean function f : {0, 1}V to a function taking subsets S ⊆ V as input. Such functions are called set functions.
Deﬁnition 22. A set function with base set V is a function f : 2V → R.

We can convert from boolean vectors x ∈ {0, 1}V to subsets S ⊆ V by deﬁning

S (x) = {i ∈ V | xi = 1}

(2.7)

In the other direction, we get a boolean vector x from a subset S by setting xi = 1 if i ∈ S and 0 otherwise. We’ll denote this map x(S ). It is easily checked that these maps are inverses of each other.
From this bijection, it is clear that we can just as easily deﬁne pseudoboolean functions to be functions f : 2V → R, i.e., set functions. In many of the following results, it will be convenient to use either the boolean vector representation or the set function representation depending on the situation, so we will switch between them freely.
Finally, note that we have chosen the convention that xi = 1 if and only if i ∈ S . The other choice of xi = 0 if and only if i ∈ S is also valid, but less
50

common.

2.2.3 Multilinear Polynomials

A distinguishing feature of the set {0, 1} is that for both x = 0 and x = 1 we have x2 = x. This lets us write polynomial functions of x very simply: higher powers xid collapse to linear terms xi. For example, for x ∈ {0, 1}V we have 7x13x32x44 = 7x1x3x4. That is, all polynomials over {0, 1}V are multilinear.

Therefore, to specify a monomial, we don’t need the powers on each variable, only the subset H ⊆ V containing the variables in the monomial, and a coefﬁcient aH — in the example 7x13x32x44, we have H = {1, 3, 4} and aH = 7. With this notation, every monomial can be written aH i∈H xi for some H ⊆ V and aH ∈ R.
Deﬁnition 23. A multilinear polynomial is a pseudoboolean function of the form

f (x) = aH xi
H∈H i∈H

(2.8)

where H ⊆ 2V is the collection of variables in each monomial, and aH ∈ R are the

coefﬁcients for each term.

2.2.4 Properties of Multilinear Polynomials
A non-obvious fact is that every pseudoboolean function can be written as a multilinear polynomial. In fact, this multilinear representation is unique, up to ignoring terms with coefﬁcient aH = 0. Since adding terms with aH = 0 doesn’t
51

change the polynomial, we get a canonical form by including all the terms for every H ⊆ 2V (setting aH = 0 for H ∈ 2V \ H).

Lemma 24. The multilinear polynomial representation of a pseudoboolean function f is unique. In particular, the coefﬁcients aH of f are given by

aH = (−1)|H\S | f (x(S ))
S ⊆H
where x(S ) is the binary vector corresponding to the set S (see Section 2.2.1).

(2.9)

We can simplify the proof of this lemma by noting the following formula for f (x) in terms of the coefﬁcients aH:
Proposition 25. For a multilinear polynomial f , we have

f (x(S )) = aH.
H⊆S
Proof. For H ∈ H, we have two cases:

(2.10)

• If H ⊆ S , then in the binary vector x(S ), xi = 1 for all i ∈ H, so aH i∈H xi = aH.
• If H S then there is some j ∈ H with j S . So, in x(S ), we have x j = 0, and hence aH i∈S xi = 0.

Therefore, we have that aH i∈H xi = aH H ⊆ S , so

f (x(S )) = aH xi = aH H ⊆ S = aH

H∈H i∈H

H∈H

H⊆S

(2.11)

Using this fact, the lemma follows: 52

Proof of Lemma 24. We prove this by induction on |H|. For H = ∅, we have

f (x(∅)) = aH = a∅
H⊆∅
so a∅ = f (x(∅)) = (−1)|∅| f (x(∅)) = S ⊆∅(−1)|∅\S | f (x(S )).

(2.12)

Then, for general H we have

f (x(H)) = aH = aH + aH .

H ⊆H

HH

By induction we expand out aH to get

(2.13)

f (x(H)) = aH +

(−1)|H \S | f (x(S ))

H H S ⊆H

= aH +

(−1)|H \S | f (x(S ))

S ⊆H H :S ⊆H H

= aH + f (x(S ))

(−1)|H \S |

S ⊆H

H :S ⊆H H

= aH + −(−1)|H\S | f (x(S ))
SH

and rearranging, we have

(2.14) (2.15) (2.16) (2.17)

aH = f (x(H)) + (−1)|H\S | f (x(S ))
SH
= (−1)|H\S | f (x(S ))
S ⊆H

(2.18) (2.19)

2.2.5 Computational Complexity and Hardness of Approximation
Pseudoboolean functions are closely linked with the family of NP-complete problems relating to boolean satisﬁability. In particular, it is trivial to give a
53

reduction from the maximum satisﬁability problem (MAX-SAT) to the pseudoboolean optimization problem.
Recall that in boolean satisﬁability problems, the literals (the boolean variables xi, along with their negations x¯i) are combined into a set of logical formulas by the operators conjunction, ∧, and disjunction, ∨. In the MAX-SAT problem, we have a set of clauses, each of which is a disjunction of a subset of the literals, for example, C = x1 ∨ x¯3 ∨ x¯9. MAX-SAT is an optimization problem: our objective is to ﬁnd a setting of the variables which maximizes the total number of satisﬁed clauses.
It is clear that this objective is already a pseudoboolean function: f (x) = |{C | C is satisﬁed}| is a function Rn → R. We can make this a minimization problem by minimizing − f , hence we have given a reduction from MAX-SAT to minimization of pseudoboolean functions, and therefore we have:
Theorem 26. Minimization of pseudoboolean functions is NP-hard.

Even restricting ourselves just to the simplest case of ﬁrst-order pseudoboolean functions doesn’t make things any easier. Like MRFs, satisﬁability

problems are also distinguished by their order: MAX-2SAT restricts the size of all cliques to be of size at most 2, and is still NP-complete. In this case, the objec-

tive function f can be written as a ﬁrst-order pseudoboolean function. Denote the set of clauses as C. For a clause C, we get a clique function3



fC(xi, x j)

=

 

1 0

C satisﬁed by xi, x j otherwise

(2.20)

3We are abusing notation to use C for both the clause and corresponding clique C, since both are just a collection of variables.

54

In this case, the MAX-2SAT objective is

f (x) = fC(xi, x j).
C∈C

(2.21)

Minimizing this is a ﬁrst order pseudoboolean minimization problem, and

therefore we have:

Theorem 27. Minimizing ﬁrst-order pseudoboolean functions is NP-hard.

Having established that even simple pseudoboolean functions are NP-hard to optimize, we can ask the secondary question of whether any approximation algorithm is possible.
Deﬁnition 28. For a number α ≥ 1, we say that a set of minimization problems { f } can be α-approximated if there is a polynomial time algorithm which, for any instance f returns an assignment of the variables x with f (x) ≤ αOPT( f ). Such an algorithm is called an α-approximation for { f }.

In other words, we may not be able to ﬁnd an optimal solution x∗ in polynomial time, but we can at least ﬁnd a solution x with objective cost at most α times the optimal cost. Clearly, we want α as close to 1 as possible, so that the cost of our polynomial-time-computable solutions will be not far from the optimum.
For pseudoboolean functions, the question of approximability is somewhat complicated by the fact that the optimal value may be negative. We cannot have f (x) ≤ αOPT( f ) for α > 1 and OPT( f ) < 0, since this would mean f (x) < OPT( f ), which is impossible. So the above deﬁnition is meaningless for pseudoboolean functions which can be negative. To get around issues with negative objectives, we deﬁne P+ to be the set of all strictly positive pseudoboolean
55

functions, and P+1 to be all the strictly positive ﬁrst-order pseudoboolean functions.4 That is:

P+ = { f : {0, 1}n → R | f (x) > 0, ∀x ∈ {0, 1}n} P+1 = { f : {0, 1}n → R | f (x) > 0, ∀x ∈ {0, 1}n, f is ﬁrst order}

(2.22)

Even with this assurance that the function is always positive, we still cannot approximately optimize ﬁrst-order pseudoboolean functions.
Theorem 29. There is no α-appoximation for P+1 unless P = NP.

Proof. We will give a reduction from graph coloring, since we have the following theorem from [69].
Theorem 30 (Lund, Yannakakis). There is an > 0 such that graph-coloring cannot be n -approximated, unless P = NP.

Recall that in the graph-coloring problem, we are given a graph, and we must assign colors to each node such that no neighboring nodes share a color. The minimization problem for graph-coloring is to ﬁnd the minimal number of colors for which there is a valid coloring.

Let G = (V, E) be the given graph. Let n = |V|. Any graph is n colorable, since we can just give every node its own distinct color. So, colorings of G are functions c : V → {1, . . . , n} with c(i) c( j) for {i, j} ∈ E. The graph-coloring problem is to minimize the number of used colors:

min |c(V)| c

(2.23)

4Finding an algorithm to optimize P+ or P+1 is known as a promise problem, since we are given the additional promise that the function is never negative.

56

We denote the minimum number colors used by a coloring as χ(c), and the minimum possible number of colors as χ(G).

First, we’ll write graph coloring as a higher-order pseudoboolean function. Graph coloring is a multi-label problem (each node can take a label from {1, . . . , n}), however we can make it a binary problem by introducing variables xi, j where xi, j = 1 if node i takes color j, and 0 otherwise. The binary vector x deﬁnes a valid coloring if exactly one xi, j = 1 for each i, and if whenever i, i are neighbors we never have xi, j = xi , j = 1. The corresponding coloring c is c(i) = j for the unique j with xi, j = 1.

Then, the graph-coloring objective is a single higher-order term
 f (x) =  χ(c) x gives a valid coloring c
 n + 1 otherwise

(2.24)

Finally, we can reduce this higher-order pseudoboolean function to ﬁrst order, using the results from later in this thesis (Chapter 4). In particular, there exists a function g (which we can compute in polynomial time) such that g is a function of x and some auxiliary variables y with miny g(x, y) = f (x).
The minimum value of g is minx,y g(x, y) = minx f (x) = χ(G), the same as the minimum value to the original coloring problem. The function g is in P+1 , so if there exists an α-approximation algorithm for this class, then it will compute a solution (x , y ) in polynomial time, with g(x , y ) ≤ αχ(G).

Throwing away the auxiliary variables y , we can consider the cost of x alone. We have f (x ) = miny g(x , y) ≤ g(x , y ) ≤ αχ(G). It’s possible that x is not a valid coloring, but in this case, f (x ) = n + 1, so replace x with the one corresponding to the coloring giving every node a distinct color (i.e., c(i) = i for

57

i = 1, . . . , n), this can only reduce f (x ) so now we have a valid coloring c with χ(c) = f (x ) ≤ αχ(G), and thus we would have an α-approximation for graph coloring.
Therefore, an α-approximation algorithm for P+1 would give an αapproximation algorithm for graph coloring, which we know cannot happen unless P = NP.
2.3 Submodular Functions
Despite the hardness of optimizing general pseudoboolean functions, there is an important subclass where we can efﬁciently do exact minimization. These functions, called submodular, are the basis for generalizing the min-cut problem to higher-order MRFs.
2.3.1 Decreasing Marginal Gains
Submodular functions are usually deﬁned as set functions f : 2V → R. Since we have already noted the equivalence between set functions and pseudoboolean functions (Section 2.2.2), all of these deﬁnitions will translate to pseudoboolean functions as well.
Intuitively, submodular functions capture the notion of diminishing marginal gains – speciﬁcally, diminishing marginal gains over a set of discrete binary choices.
58

As an example,5 consider the two (non-exclusive) choices of whether or not to have cake, and whether or not to have cookies. Having either cake or cookies is certainly better than having nothing, and having both cake and cookies is better than having either alone. However, having both is perhaps a bit too-much, and the added beneﬁt of having “cookies with cake” over “just having cake”, isn’t as large as that of having cookies over nothing — that is, the marginal gain of adding cookies has decreased. An example table of utilities is in Table 2.1.

No Cake Cake

No Cookies

05

Cookies

37

Table 2.1: Utilities for various dessert options. The marginal gain for adding cookies to nothing is 3 − 0 = 3, whereas the marginal gain of adding cookies to {Cake} is 7 − 5 = 2.

To formalize this notion, we have a ground set V of choices, of which we may pick any subset. Each subset is assigned a value, or objective, f (S ) ∈ R. The marginal beneﬁt of adding i ∈ V to a subset S ⊆ V is the difference in values f (S ∪ {i}) − f (S ). To simplify notation, we will write S ∪ {i} as S + i, so this difference is f (S + i) − f (S ). To say we have decreasing marginal gain means that if we take a larger set T (where larger means T ⊇ S ), then the marginal gain f (T + i) − f (T ) is smaller.

Deﬁnition 31. A function f : 2V → R is submodular if for every S ⊆ T ⊆ V, and

i ∈ V \ T , we have

f (S + i) − f (S ) ≥ f (T + i) − f (T )

(2.25)

5Admittedly whimisical.

59

2.3.2 Equivalent Deﬁnitions of Submodularity

There are many different properties that are equivalent with Deﬁnition 31 which are useful depending on the situation.
Theorem 32. The following are equivalent:

1. f is submodular 2. for every S , T ⊆ V we have
f (S ∪ T ) + f (S ∩ T ) ≤ f (S ) + f (T )
3. For every S ⊆ V and i, j ∈ V \ S with i j we have f (S ) + f (S + i + j) ≤ f (S + i) + f (S + j)

(2.26) (2.27)

Frequently, the second condition above is taken as the primary deﬁnition, however since they are equivalent, we could have taken any as the deﬁnition of submodular.
The third condition is notable for using many fewer constraints than the others (only O(n22n) as opposed to O(22n) for (1) and (2)).

Proof. For (1) ⇒ (3), note that S ⊆ S + j, so from (2.25) we have f (S + i) − f (S ) ≥ f (S + i + j) − f (S + j)
which can be rearranged to get (2.27) for every S ⊆ V and i, j S .

(2.28)

For (3) ⇒ (2), let S , T be any subsets of V. If S ⊆ T then S ∩ T = S and S ∪ T = T so the inequality (2.26) trivially holds. The same is true if T ⊆ S , so we will now consider the case where S T and T S .

60

Enumerate the elements of T \ S as T \ S = {i1, . . . , ik1} and the elements of
S \ T as S \ T = { j1, . . . , jk2}. Consider the following sum, over all pairs of ik, jk
k1 k2
f (S + i1 + · · · + ik + j1 + · · · + jk ) − f (S + i1 + · · · + ik−1 + j1 + · · · + jk )
k=1 k =1
− f (S + i1 + · · · + ik + j1 + · · · + jk −1) + f (S + i1 + · · · + ik−1 + j1 + · · · + jk −1) (2.29)
Simplify the above by letting S k,k = S + i1 + · · · + ik−1 + j1 + · · · jk −1, and we get
that (2.29) is

k1 k2
f (S k,k + ik + jk ) − f (S k,k + jk ) − f (S k,k + ik) + f (S k,k )
k=1 k =1

(2.30)

Each term of this sum is a rearrangement of (2.27), applied to S k,k , so the whole

sum is ≤ 0. We can split up this sum into 4 separate summations, and reindexing

we get that (2.30) is equal to

k1 k2

k1 k2

k1 k2

k1 k2

f (S k+1,k +1) −

f (S k+1,k ) −

f (S k,k +1) +

f (S k,k )

k=1 k =1

k=1 k =1

k=1 k =1

k=1 k =1

k1+1 k2+1

k1+1 k2

k1 k2+1

k1 k2

= f (S k,k ) − f (S k,k ) − f (S k,k ) + f (S k,k )

k=2 k =2

k=2 k =1

k=1 k =2

k=1 k =1

(2.31)

= f (S )k1+1,k2+1 − f (S k1+1,1) − f (S 1,k2+1) + f (S 1,1)

= f (S ∪ T ) − f (S ) − f (T ) + f (S ∩ T )

Therefore, f (S ∪ T ) + f (S ∩ T ) ≤ f (S ) + f (T ) for all S , T ⊆ V, so we have that (3) ⇒ (2).

Finally, for (2) ⇒ (1), assume that (2.26) holds, and take any S ⊆ T ⊆ V and i ∈ V \ T . Then (S + i) ∪ T = T + i and (S + i) ∩ T = S , so we have

f (T + i) + f (S ) = f ((S + i) ∪ T ) + f ((S + i) ∩ T ) ≤ f (S + i) + f (T )

(2.32)

Rearranging, we get that (2.25) holds, so f is submodular.

61

2.3.3 Properties of Submodular Functions

Submodular functions are notable for sharing many properties with convex functions, and consequently they ﬁll a similar role in discrete optimization problems as convex functions do for continuous optimization. See [68] for an excellent summary of the connections between convex and submodular functions.
The basic calculus of submodular functions is that they are closed under addition and multiplication by positive constants (but not subtraction or multiplication by negative constants).
Lemma 33. Submodular functions are closed under positive linear combinations. That is, if f1, . . . , fk are submodular and a1, . . . , ak ∈ R are nonnegative, then a1 f1 + · · · + ak fk is submodular.

Proof. Since fi is submodular and ai ≥ 0 we have ai fi(S ∩ T ) + ai fi(T ∪ T ) ≤ ai fi(S ) + ai fi(T ). Sum these inequalities together, and we get

ai fi(S ∩ T ) + ai fi(S ∪ T ) ≤ ai fi(S ) + ai fi(T )
i i ii

(2.33)

A powerful tool in the analysis of convex functions are their subdifferentials: linear functions (x) which are also lower bounds, (x) ≤ f (x). Submodular functions similarly have linear lower bounds, called subbases.
For binary labels, linear functions can be deﬁned by a vector ψ ∈ RV. For each such ψ, we get a linear function (also called ψ by abuse of notation) with ψ(S ) = i∈S ψi.
62

Deﬁnition 34. For a submodular function f , a subbase is a linear function ψ with f (S ) ≥ ψ(S ) for all S ⊆ V. A base of f is a subbase with ψ(V) = f (V).

There is a simple algorithm to compute a base for any submodular function f , as long as f (∅) ≥ 0.6 In fact, we can greedily construct this vector ψ (this is known as Edmond’s algorithm [16]). We let ψ1 = f ({1}) and for i = 2, . . . , n we set ψi = f ({1, . . . , i}) − f ({1, . . . , i − 1}).
Lemma 35. The vector ψ deﬁned above satisﬁes f (S ) ≥ ψ(S ) for all S ⊆ V and f (V) = ψ(V ).

Proof. We prove the lemma inductively on the size of S . For S = ∅ we have f (∅) ≥ ψ(∅) = 0.

For S ∅ let i be the largest element of S . Since f is submodular, it has decreasing marginal gains, and S − i ⊆ {1, . . . , i − 1} so

f (S ) = f (S ) − f (S − i) + f (S − i)

≥ f ({1, . . . , i − 1} + i) − f ({1, . . . , i − 1}) + f (S − i) = ψi + f (S − i)

(2.34)

≥ ψi + ψi = ψ(S )
i∈S −i
For the second part of the claim, φ(V) = i∈V f ({1, . . . , i} − f ({1, . . . , i − 1}) = f (V) −
f (∅) = f (V).

Submodular functions have additional structure, in addition to their similarity to convex functions. In particular, the set of all minimizers of a submodular function will be closed under intersections and unions. This follows easily from
6Note that if f (∅) < 0 then there are no subbases of f at all, since ψ(∅) = 0 by deﬁnition, and we require f (∅) ≥ ψ(∅).
63

one of the equivalent conditions for submodularity, Theorem 32. If S ∗ and T ∗ are both minimizers, then

f (S ∗ ∩ T ∗) + f (S ∗ ∪ T ∗) ≤ f (S ∗) + f (T ∗)

(2.35)

and since S ∗ and T ∗ are minimizers, we must have f (S ∗∩T ∗) = f (S ∗∪T ∗) = f (S ∗).

In fact, we can generalize this to the sets where a submodular function is equal to a given subbase.
Deﬁnition 36. If ψ is a subbase of a submodular function f , the tight sets T ( f, ψ) are all S for which f (S ) = ψ(S ). That is, T ( f, ψ) = {S ⊆ V | f (S ) = ψ(S )}.
Lemma 37. For a submodular function f and subbase ψ, T ( f, ψ) is a lattice, meaning it is closed under intersection and union.

Proof. Let S , T ∈ T ( f, ψ). So, in particular we have f (S ) = ψ(S ) and f (T ) = ψ(T ). We want to show that S ∩ T and S ∪ T are in T ( f, ψ), so we want f (S ∩ T ) = ψ(S ∩ T ) and f (S ∪ T ) = ψ(S ∪ T ). Since f ≥ ψ we have f (S ∩ T ) ≥ ψ(S ∩ T ) and f (S ∪ T ) ≥ ψ(S ∪ T ). Now, since f is submodular, we have
ψ(S ∩ T ) + ψ(S ∪ T ) ≤ f (S ∩ T ) + f (S ∪ T ) ≤ f (S ) + f (T ) = ψ(S ) + ψ(T ) (2.36)

Because ψ is linear, we have ψ(S ) + ψ(T ) = ψ(S ∩ T ) + ψ(S ∪ T ), by inclusionexclusion. Therefore, the inequalities in (2.36) are all equalities, and we have

f (S ∪ T ) + f (S ∩ T ) = ψ(S ∪ T ) + ψ(S ∩ T )

(2.37)

Finally, since f (S ∪ T ) ≥ ψ(S ∪ T ) and f (S ∩ T ) ≥ ψ(S ∩ T ), we have that f (S ∪ T ) = ψ(S ∪ T ) and f (S ∩ T ) = ψ(S ∩ T ).

64

A particularly useful special case of the above lemma is when f is nonnegative, meaning f (S ) ≥ 0 for all S ⊆ V. In this case, 0 is a sub-base of f , and the tight sets T ( f, 0) are the zero-valued sets, Z( f ) = {S ⊆ V | f (S ) = 0}.
Corollary 38. If f is a non-negative submodular function, then the zero sets Z( f ) form a lattice.

2.3.4 Submodular First-order Pseudoboolean Functions

For checking if an arbitrary set function is submodular, the above deﬁnitions all have at least O(n22n) equations to verify. There is some redundancy between the equations, but determining whether a function is submodular is NP hard in general [104]. For ﬁrst-order pseudoboolean functions, whether or not a function is submodular is particularly easy to verify — all we need to check is that each of the pairwise coefﬁcients (in the polynomial representation) is non-positive.

Lemma 39. A ﬁrst-order pseudoboolean function, given as a multilinear polynomial

f (x) = a∅ + ai xi + ai, j xi x j
i i, j
is submodular if and only if ai, j ≤ 0 for all i, j ∈ V.

(2.38)

Proof. Throughout this proof, we will treat f as a set function, with f (S ) := f (x(S )), where x(S ) is the corresponding binary vector for the set S .

We use the explicit representation for the function values f (S ) given by

Prop 25:

f (S ) = aH.
H⊆S

(2.39)

65

We can relate this to one of our conditions for submodularity (Theorem 32) by expanding out f (S ) + f (S + i + j) − f (S + i) − f (S + j) ≤ 0:

0 ≥ f (S ) + f (S + i + j) − f (S + i) − f (S + j)

= aH +

aH −

aH −

aH

H⊆S H⊆S +i+ j H⊆S +i H⊆S + j

  

=

H⊆S +i+ j

aH

−

H⊆S

+j

aH 

−

 H⊆S +i

aH

−

H⊆S

aH 

(2.40) (2.41)
(2.42)

To complete the proof, we’ll apply a simple proposition simplifying these sums over subsets: Proposition 40. For any set of coefﬁcients aH deﬁned for subsetsets H ⊆ V we have

aH − aH = aH+i

H⊆S +i

H⊆S

H⊆S

(2.43)

To see this, note that that 2S +i \ 2S = {S + i | S ∈ 2S }. Using this fact, we have

f (S ) + f (S + i + j) − f (S + i) − f (S + j)

(2.44)

= aH+i − aH+i

H⊆S + j

H⊆S

(2.45)

= aH+i+ j = ai, j
H⊆S

(2.46)

where the last line follows from aH = 0 for |H | > 2 (because f is ﬁrst-order).

Therefore, we have that f (S ) + f (S + i + j) − f (S + i) − f (S + j) ≤ 0 if and only if

ai, j ≤ 0, hence f is submodular if and only if ai, j ≤ 0 for all i, j.

Another way of recognizing ﬁrst order submodular functions is if each individual pairwise term is submodular. First, note that we have that fi, j(xi, x j) is submodular if and only if a single inequality holds:

fi, j(0, 0) + fi, j(1, 1) ≤ fi, j(1, 0) + fi, j(0, 1)

(2.47)

66

Then, since sums of submodular functions are submodular (Lemma 39) we have

Lemma 41. A ﬁrst-order pseudoboolean function

f (x) = fi(xi) + fi, j(xi, x j)
i i, j

(2.48)

is submodular if for each i, j we have fi, j is submodular (i.e., it satisﬁes (2.47)).

Note that this is a sufﬁcient but not a necessary condition.

2.4 Local and Marginal Polytopes for MRFs
We will now begin to consider multilabel problems, i.e., those with more than just two labels. The next several sections are building to a key theoretical tool called the Local Marginal Polytope, which is a linear programming formulation of the MRF inference problem. In particular, we want to introduce this linear programming relaxation, as well as the major tools for dealing with linear programs, including duality and complementary slackness.
A major difﬁculty in optimizing MRFs is that they are discrete problems — there is a combinatorial space i Xi of possible states for x, and as we have seen, it is difﬁcult to make global statements about the function f (since efﬁciently ﬁnding either the minimum or any constant approximation to it would mean P = NP).
Contrast this with optimizing a convex, continuous function f : Rn → R. A major feature of such functions is that they always have a global optimum, and simple algorithms (including gradient descent) will always lead to this global
67

optimum. In this section we will show how MRF optimization can be cast as a particular kind of convex minimization problem called Linear Programming.

2.4.1 Weighted Averages as Linear Programs

To motivate Linear Programming, we will consider a simple example: ﬁnding the minimum element over a ﬁnite set {a1, . . . , ak}. The brute-force approach is of course to simply examine each element in turn, and remember the smallest. If we want to turn this into a continuous minimization problem instead, one way we can do this is to consider weighted averages of the elements ai. That is, we have non-negative weights µi for each ai with the weights summing to 1 (i.e.,
i µi = 1) where the resulting weighted averge, aµ, is equal to the sum

aµ = µiai.
i

(2.49)

We know that the weighted average has to be between the minimum and maximum elements, so aµ ≥ min{ai}. And, if we put all the weight on the minimizing i∗ (i.e., µ∗i∗ = 1 and µ∗j = 0 for j i∗), then we have aµ∗ = ai∗ = min{ai}.
In other words, the minimum over all weighted averages of the ai is exactly the minimum element:

min µiai = min{ai}

µ:µ≥0, i µi=1 i

i

(2.50)

This is a continuous optimization problem with a linear objective i µiai, and linear constraints i µi = 1, and µi ≥ 0 for i = 1, . . . , n, which makes (2.50) an example of a linear program.

68

2.4.2 Marginal polytopes

Recall that an MRF is called discrete if the state space X is ﬁnite. Equivalently, each variable xi has a ﬁnite label set |Xi| < ∞.
Let f (x) = C fC(xC) be a discrete MRF, with X = i Xi. For simplicity, we’ll assume that |Xi| = for all i, although nothing of the following breaks if we have non-uniformly-sized label sets.

Given the discussion in the previous section, we can convert this discrete, combinatorial optimization problem to a continuous linear program by minimizing the weighted average of all solutions. We have a weight for every state x ∈ X, which we will write as µ(x), and get a minimization problem

min µ(x) f (x) µ x
s.t. µ(x) = 1
x
µ≥0

(2.51)

Note that in the above, x is no longer a free variable — we are instead summing the weights µ(x) over all possible states x. In fact, if we treat the weighted average as instead giving a probability distribution over the states x, then the objective x µ(x) f (x) is exactly the expectation of f (x) when each state is chosen with probability µ(x). This explains our notation of µ(·) as a probability density function on X.

Thinking in terms of probability distributions gives us a pre-existing toolbox with which to reason about our problem. For example, it’s clear that the expectation of f under the probability distribution µ is bounded between the minimum and maximum values of f (x), and that if we’re optimizing over all

69

probability distributions, the best we can do is to choose the minimizing x∗ with probability 1 (i.e., set µ(x∗) = 1 and µ(x) = 0 for x x∗).

For any discrete set X, let P[X] be the set of all probability distributions on X. That is, P[X] = {µ ∈ RX : µ ≥ 0, x µ(x) = 1}. We will use µ, f to denote the expectation of f with respect to µ, so that µ, f = x µ(x) f (x). With this notation, we get a very simple version of (2.51):

min µ, f
µ∈P[X]

(2.52)

A major drawback in optimizing over all probability distributions in P[X] is that we have exploded the number of variables from |V| to |V|. As a result, this linear program is much too large to solve efﬁciently even for small MRFs.

However, we have not yet used the clique structure of the MRF f . We can use the clique structure to get a much smaller number of variables — our eventual goal is to specify a probability distribution µC ∈ P[XC] just for each clique C separately. For a clique of size k, this requires only k variables, which is much smaller than the n variables in P[X] (assuming, per Remark 16, that k << n). Our ﬁrst step towards this goal is to use linearity of expectation to expand out f in the objective of (2.52):

µ, f = µ, fC
C
= µ, fC
C

(2.53)

Each of the terms µ, fC is an expectation of a function fC which only depends

on a few variables, namely the subset of variables in C. Therefore, we only

care about the probability that a clique labeling xC is chosen. That is, we are

interested in the probability that x restricted to C is xC. We can compute this by

70

summing µ over all assignments of the remaining variables, xV\C. To explain the notation, in the sum we get the combined vector x = (xC, xV\C). We will write µ|C for the marginal distribution of µ on the subset of variables in C. This marginal

probability is

µ|C(xC) = µ(x)
xV \C

(2.54)

A useful fact about marginalization is that it preserves expectations, as long as

we marginalize onto the set of variables that a function depends on.

Proposition 42. If µ ∈ P[X] and fC is a clique function, fC : XC → R, then µ, fC = µ|C, fC .

Proof. This equality is just a re-grouping of the sums in the deﬁnition of expec-

tation:

µ, fC = µ(x) fC(xC)
x

= µ(x) fC(xC)
xC xV\C

(2.55)

= µ|C(xC) fC(xC) = µ|C, fC
xC

This lets us re-write the expectation over µ in (2.52) as a sum over expecta-

tions on each clique:

min
µ∈P[X]

µ|C, fC

C

(2.56)

This particular linear program is known as the Marginal Polytope, because the

variables are marginal probabilities of the joint distribution µ. This problem

is still equivalent to our original MRF optimization problem, but still has the

problem of an exponentially large set of variables. We can ﬁx the latter problem,

but to do so, we must move to a relaxation of our original problem. Instead of

71

ensuring we have a global probability distribution µ, we will instead have a separate probability distribution µC on each clique C, and these distributions are required to only locally agree, meaning that if C and C share a variable i, then they agree on the marginal distribution on that single variable: µC|i = µC |i. We’ll denote this single distribution for xi by µi which we constrain to be equal to µC|i for all C containing i.

min µC, fC
{µC ∈P[XC ]} C
s.t. µC|i = µi ∀C, i ∈ C

(2.57)

This linear program is known as the Local Marginal Polytope, as we have replaced marginalization from a global joint probability distribution µ with local constraints on the consistency of the distributions µC.

This linear program is a great simpliﬁcation compared to the full marginal polytope, and still provides a global lower-bound on the optimum of the original (integral) problem. However, we have moved to a relaxation, so it is possible that the optimal linear programming solution may not have any corresponding integral solution with as-good a value. In particular, whenever the clique functions are non-submodular, or whenever there are cycles in the graph, then the local marginal polytope may not be a tight relaxation.

2.5 Linear Programming
The marginal polytopes above are particular examples of a general class of problems called Linear Programs (LPs). A Linear Program is a constrained optimization problem, with variables xi ∈ R, a linear objective, and linear constraints. For
72

now, we will only consider the case where we have ﬁnitely many variables and constraints, although generalizations to inﬁnitely many variables are possible.7

The constraints in a Linear Program may be equality constraints, or inequal-

ity constraints (either greater-or-equal or less-than-or-equal), and additionally,

we may require that some of the variables are either non-negative or non-

positive. As a result, without some unifying notation (which we will present

shortly) there are many cases to consider to write down a “general LP”. As an

example, the following LP with 3 variables and 3 constraints has each of these

possibilities:

min
x1,x2,x3

3x1

−

2x2

+

7x3

s.t. x1 − x2 = 3

2x2 + 3x3 ≥ 2 − 3x1 + x3 ≤ 4

(2.58)

x1 ≥ 0

x2 ≤ 0

x3 ∈ R The objective is linear in the 3 variables, as are each of the three constraints,

with one each of an =, ≥ and ≤ constraint. Additionally, x1 and x2 are respectively required to be non-negative and non-positive, while x3 may be positive or negative.

In general, an LP has m1 greater-than constraints, and m2 less-than contraints
and m3 equality constraints (with m1 + m2 + m3 = m), as well as n1 non-negative
variables, n2 non-positive variables, and n3 variables which can be any real num-
7In particular, Linear Programming relaxations for continuous MRFs (where the label space is a continuous interval [a, b] ⊆ R) are inﬁnite dimensional. See the author’s paper [20] for an example of how to generalize the marginal polytope to continuous MRFs.

73

ber (with n1 + n2 + n3 = n). Partitioning the indices by I1 = {1, . . . , n1}, I2 =

{n1 + 1, . . . , n1 + n2} and I3 = {n1 + n2 + 1, . . . , n} (and similarly for J1, J2, J3) we

get the general form of an LP:
n
min ci xi x i=1 n
s.t. a j,i xi ≥ b j
i=1 n
a j,i xi ≤ b j
i=1 n
a j,i xi = b j
i=1

∀ j ∈ J1 ∀ j ∈ J2 ∀ j ∈ J3

(2.59)

xi ≥ 0 ∀i ∈ I1

xi ≤ 0 ∀i ∈ I2

We can more easily organize these linear functions by writing them as dotproducts and matrix-vector products: let b = (b1, . . . , bm) and c = (c1, . . . , cn) be vectors, and A be the m × n matrix with entries a j,i. To handle the different types of constraints, let AJk be the submatrix of A with just the rows corresponding to Jk (for k = 1, 2, 3) and similarly bJk the subvector of b with rows from Jk. Then we can more compactly write (2.59) as

min cT x x

AJ1 x ≥ bJ1

AJ2 x ≤ bJ2 AJ3 x = bJ3

(2.60)

xI1 ≥ 0

xI2 ≤ 0 Note that inequalities regarding vectors (such as x ≥ 0) are always treated com-

ponentwise (i.e., xi ≥ 0 for all i).

74

2.5.1 Linear Cone Programming
Keeping track of which variables are positive or negative, and which constraints are equality vs. inequalities is tedious and can complicate equations. Simplifying this notational complexity gives us a good excuse to move to a slight generalization of Linear Programming called Cone Programming. Fortunately, the main theorems concerning Linear Programming (especially regarding duality) are most naturally stated using the theory of cones, so we will solve two problems at once by considering conic problems here. In Cone Programming, we require the variables and constraints to lie in a type of convex subset of Rn called a cone. Deﬁnition 43. A subset K ⊂ Rn is a cone if K is closed under addition and multiplication by nonnegative scalars c ∈ R, c ≥ 0. That is, for x, y ∈ K we have x + y ∈ K and for c ≥ 0 we have cx ∈ K.
The following 4 subsets of R are particularly useful cones:
• K0 := {0} • KR := R • K≥ := {x ∈ R | x ≥ 0} • K≤ := {x ∈ R | x ≤ 0}
It is trivial to verify that these are each closed under addition and multiplication by nonnegative scalars.
Using these cones, we can re-write the constraints of an LP: for example, the constraint xi ≥ 0 is the same as xi ∈ K≥, and the constraint i a j,ixi = b j is the
75

same as i a j,ixi − b j ∈ K0. This lets us re-write all our constraints as membership in a cone. Given the partition (I1, I2, I3) of the variables into xi ≥ 0, xi ≤ 0 and unconstrained xi, we get a cone K deﬁned by

K = K≥ × K≤ × KR
i∈I1 i∈I2 i∈I3

(2.61)

Then, the relation x ∈ K is identical to the intersection of the various constraints

that xi ≥ 0 for i ∈ I1, xi ≤ 0 for i ∈ I2 and xi ∈ R for i ∈ I3.

Similarly, given the partition (J1, J2, J3) of the constraints into ≥, ≤ and = relations, we can deﬁne

K = K≥ × K≤ × K0
j∈J1 j∈J2 j∈J3

(2.62)

so that Ax − b ∈ K is identical to the original constraints AJ1x ≥ bJ1, AJ2x ≤ bJ2

and AJ3x = bJ3.

Therefore, the general form of an LP (2.60) is much more simply expressed

as min cT x
x

Ax − b ∈ K

(2.63)

x∈K

2.6 Convex Sets and Convex Functions

Much of the theory of Linear Programming comes from convex optimization — speciﬁcally, since the feasible set for an LP is a convex set, and the linear objective is likewise convex, an LP is an instance of a convex program. In this section, we will review the basics of convexity and convex programs, however, even the basics of convex optimization are large enough to ﬁll a book [9].
76

The basic deﬁnitions of convexity is that any line joining two elements within a convex set is also contained in the set. Deﬁnition 44. A set Ω ⊆ Rn is convex if for every a, b ∈ Ω and t ∈ [0, 1] we have ta + (1 − t)b ∈ Ω.
For functions, convexity says that the set of points lying above the graph of f is convex. This set is called the epigraph. Deﬁnition 45. A function f : Rn → R is convex if for every a, b ∈ Rn and t ∈ [0, 1] we have f (ta + (1 − t)b) ≤ t f (a) + (1 − t) f (b). Equivalently, f is convex if and only if the epigraph of f (i.e., the set S ⊆ Rn+1, where S = {(x, z) | z ≥ f (x)}) is a convex set.
One of the most useful theorem regarding convex functions (for the purposes of optimization) is that all local minima of a convex function are also global optima. Lemma 46. If f : Rn → R is continuous, and x∗ is a local minimum of f (meaning f (x∗) ≤ f (x) for all x ∈ U, where U is an open neighborhood of x∗) then x∗ is a global minimum of f , meaning f (x∗) ≤ f (x) for all x ∈ Rn.
Therefore, algorithms which converge to local optima (such as gradient descent methods) also always ﬁnd a global optimum.
For proving results relating to convexity, one of the most powerful theorems is the hyperplane separation theorem. Recall that a hyperplane in Rn is deﬁned by an equation v1x1 + · · · + vnxn = c, or more compactly, vT x = c. This divides Rn in half, with those x for which vT x ≥ c on one side, and those with vT x < c on the other.
77

The hyperplane separation theorem says that for any closed convex set Ω, if we have any element x outside Ω, then there is a hyperplane such that Ω is on one side of the hyperplane, and x is on the other. Theorem 47. For any closed convex set Ω ⊆ Rn, and for any point x not in Ω, there is a hyperplane separating x from Ω — that is, there is a vector v and scalar c ∈ R such that for any y ∈ Ω we have vT y > c but vT x < c.
This theorem may not seem especially important, however in practice it holds a similar place to the Intermediate Value Theorem in one-dimensional calculus, in that many other more useful theorems follow directly from it.
Finally, we’ll note that convexity immediately applies to our results of the previous section: Lemma 48. A cone K ⊆ Rn is convex.
Proof. The cone K is closed under addition and multiplication by non-negative scalars, so let a, b ∈ K, and t ∈ [0, 1]. Then ta and (1 − t)b are in K, and hence ta + (1 − t)b ∈ K as well.
2.7 Duality
One of the most important tools for working with linear programs is the theory of duality. Every linear program has another linear program associated to it, called the dual program. This dual program can be obtained by a purely mechanical process (which we will describe shortly), and provides a great deal
78

of information regarding solutions to our original linear program (henceforth called the primal problem).
At a high level, for a minimization problem (as in (2.60)) the dual program uses the constraints of the original problem to give a lower bound on the optimal objective. In fact, the dual program has a variable corresponding to every constraint of the primal program, and similarly a constraint corresponding to every variable of the primal. That is, dualization exchanges constraints and variables — this can be very helpful for problems with large numbers of constraints but few variables, or vice-versa.

2.7.1 Exchanging Minimization and Maximization

The simple observation at the heart of duality is that exchanging nested minimization and maximizations gives a lower bound.

Lemma 49. If f : X × Y → R is any function (not necessarily continous) and X, Y are

any sets, then

min max f (x, y) ≥ max min f (x, y)

x∈X y∈Y

y∈Y x∈X

(2.64)

Proof. We’ll deﬁne two helper functions, g(x) = maxy∈Y f (x, y) and h(y) = minx∈X f (x, y). Note that, by deﬁnition, minx∈X maxy∈Y f (x, y) = minx∈X g(x), and similarly maxy∈Y minx∈X f (x, y) = maxy∈Y h(y).
Let x be any element of X, and y any element of Y. Since g(x) is the maximum over all y of f (x, y ), we have g(x) ≥ f (x, y), and similarly since h(y) is the minimum over all x of f (x , y) we have h(y) ≤ f (x, y). In particular, for any x and y

79

we always have

g(x) ≥ f (x, y) ≥ h(y)

(2.65)

A general fact regarding maxima and minima is that if we have two sets A, B ⊆ R with a ≥ b for any a ∈ A and b ∈ B, then min A ≥ max B. Therefore, we have

min max f (x, y) = min g(x) ≥ max h(y) = max min f (x, y)

x∈X y∈Y

x∈X y∈Y y∈Y x∈X

(2.66)

2.7.2 Linear Programming Duality: An Example

The notion of duality just presented seems quite trivial — it is just a rule for in-

terchanging minimization and maximization. However, in the context of Linear

Programming, it becomes quite powerful. To see this in action, let’s consider

the following simple LP:

min
x1,x2

3x1

+

4x2

s.t. 2x1 + x2 ≥ 2

(2.67)

x1, x2 ≥ 0

The feasible set of this LP is illustrated in Figure 2.1. We might guess that the minimizer of (2.67) is achieved at (1, 0) with value 3. However, how can we prove this? With only 2 variables and 3 constraints, it’s not especially difﬁcult to prove that this is indeed the minimizer, for example by geometric arguments. With many variables and constraints, though, this becomes a much more difﬁcult task.

If we can ﬁnd some easily obtainable lower bound to the objective of (2.67), then it may be possible to prove something about the minimum value. In par-

80

Figure 2.1: The feasible set of (2.67) is graphed above in grey.
ticular, if we could prove that the objective of (2.67) is always at least 3, then our proposed solution (1, 0) with objective value 3 has to be the global minimizer (since any other solution can’t have value less than 3). The duality lemma, Lemma 49, is our method to ﬁnding this lower bound.
The general recipe for constructing the dual is:
1. Take the constraints of the original LP and move them into the objective, using indicator functions (Section 1.2.2).
2. Write these indicator functions as a maximization over a new variable (called a Lagrange multiplier) times the original constraint.
3. Exchange minimization and maximization, using Lemma 49. 4. Take any terms that look like indicator functions out of the objective, and
make them constraints.
Let’s examine each of these steps in turn, for our example LP.
81

Re-writing constraints as indicator functions

Recall from Section 1.2.2 that we can convert any constrained optimization prob-

lem to an unconstrained problem by using indicator functions. If g j(x) ≥ 0 is a

constraint, then the indicator function in a minimization problem for g j takes

value ∞ for all infeasible solutions:



Igmji(nx)≥0(x)

=

 

0 ∞

g j(x) ≥ 0 otherwise

(2.68)

For a maximization problem, the inidicator function takes value −∞ for infeasi-

ble solutions:



Igmja(xx)≥0(x)

=

 

0 −∞

g j(x) ≥ 0 otherwise

(2.69)

We can take any constrained optimization problem and get an equivalent prob-

lem by removing the constraint g j(x) ≥ 0 and adding Igj(x)≥0(x) to the objective.

For our example problem, we’ll use this to eliminate the constraint 2x1 + x2 ≥ 2. We get a new objective which is equal to the original objective, plus the indicator function for this constraint:

F(x1, x2) = 3x1 + 4x2 + I2mxi1n+x2≥2(x1, x2)

(2.70)

As we’ve already noted, replacing constraints by indicator functions gives an

unchanged optimization problem, so our original minimization (2.67) is equal

to

min
x1,x2≥0

3x1

+

4x2

+

Imin
2x1+x2

≥2

(

x1

,

x2)

(2.71)

82

Lagrange Multipliers

For the case of linear constraints, there is a simple way to re-write the indicator function which preserves the linear structure. To do so, let’s look at what happens when we multiply the original constraints by a new, auxiliary variable (called a Lagrange Multiplier).

For the constraint 2x1 + x2 ≥ 2, we deﬁne the residual to be the quantity 2 − (2x1 + x2). When the residual is positive, then (x1, x2) is infeasible, and the residual is the amount by which it the constraint has been violated. Conversely, when the residual is non-positive, then the constraint is satisﬁed.

Consider multiplying the residual by a non-negative variable y ≥ 0:

y(2 − x1 − x2)

(2.72)

If the residual is positive, then we can send (2.72) to +∞ by making y arbitrarily large. However, if the residual is non-positive then the biggest we can make (2.72) is 0, by setting y = 0. In other words, we have



max y(2 − 2x1 y≥0

−

x2)

=

 

0 ∞

2x1 + x2 ≥ 1 otherwise

=

Imin
2x1+x2

≥1

(

x1

,

x2)

(2.73)

Therefore, we have found a way to re-write the indicator function as a maximization over this Lagrange Multipler. Applied to our example problem, we have that our original primal problem is equal to

min
x1,x2≥0

3x1

+

4x2

+

max y≥0

y(2

−

2x1

−

x2)

(2.74)

83

Rearrange and exchange minimization and maximization

The next step of the process is purely algebraic. We want to exchange maximization and minimization, using Lemma 49, so we will group together all the terms containing x1 and x2, and then apply our lemma.

min 3
x1,x2≥0

x1

+

4x2

+

max y≥0

y(2

−

2x1

−

x2)

= min max x1(3 − 2y) + x2(4 − y) + 2y x1,x2≥0 y≥0

≥

max y≥0

min
x1,x2≥0

x1(3

−

2y)

+

x2(4

−

y)

+

2y

= max 2y + min x1(3 − 2y) + min x2(4 − y)

y≥0 x1≥0

x2≥0

(2.75)

Transform indicator functions to constraints

Now we notice something interesting: the inner minimizations are a product

of a variable xi, times a linear function of the variable y. This is the exact same situation for our expression of the indicator function in (2.73), except now the

roles of xi and y have been reversed. In fact, we have



min x1(3 − 2y)
x1≥0

=

 

0 −∞

2y ≤ 3 otherwise



min x2(4 − y)
x2≥0

=

 

0 −∞

y≤4 otherwise

(2.76)

These are the indicator functions for the constraints 2y ≤ 3 and y ≤ 4, for the maximization problem over y. We can replace these inidicator functions by the

84

corresponding constraints to get a linear program:
max 2y y≥0 s.t. 2y ≤ 3 y≤4
This problem is called the dual of our original problem (2.67).

(2.77)

Obtaining a proof of optimality

We now return to our original question: how can we prove that the solution (1, 0) with value 3 is optimal for the primal problem?

Note that the dual program (2.77) is a lower bound on our primal problem (2.67), because we exchanged minimization with maximization (and all other steps maintained equality).

Let’s consider a potential dual solution:

y

=

3 2

.

It’s much easier to see that

this is optimal for the dual problem because there’s just a single variable, and

we’ve raised it as high as possible without violating the constraint 2y ≤ 3. Fur-

thermore, the value of this solution is 3.

Since the dual is a lower bound on the primal problem, we now know that any primal solution must have value at least 3 — therefore, our proposed solution of (1, 0) is indeed optimal. The process by which we arrived at this proof was somewhat complicated; however, all the steps we performed were purely mechanical. Using the language of cone programming, we can give a simple recipe for obtaining the dual program of any LP.

85

2.7.3 Conic Duality

The example above shows how to obtain the dual LP for a single linear constraint. We’ll now show how to ﬁnd the dual for any Linear Program, using concepts from cone programming. The key deﬁnition to make this work is that of a dual cone.

Deﬁnition 50. For a cone K ⊆ Rn, its dual cone K∗ is the set of all y ∈ Rn whose dot product with all elements of K is nonnegative. So,

K∗ := {y ∈ Rn | yT x ≥ 0, ∀x ∈ K}

(2.78)

Using this deﬁnition, we can get a general formula for the dual of any linear cone program.

Theorem 51. The linear cone program

min cT x x Ax − b ∈ K

x∈K

has a dual program

max bT y y AT y − c ∈ −K∗

y ∈ (K )∗ In particular, weak duality holds, so that OPT(2.79) ≥ OPT(2.80).

(2.79) (2.80)

Provided we can actually compute the dual cones K∗ and (K )∗, then this gives a simple formula for computing the dual of any LP. Let’s see how this works for the cones that deﬁne our basic equality and inequality constraints.
86

For K0, KR, K≥ and K≤, we can compute their duals easily. Note that for x, y ∈ R, xT y is just the product xy.

•

K∗ R

=

K0:

according

to

the

deﬁnition,

for

y

to

be

in

K∗ R

we

would

need

yx

≥

0

for all x ∈ R. Consider 3 cases for y: either y > 0, y < 0 or y = 0. If y > 0 set

x = −1 (which has x ∈ KR). Therefore, we have yx < 0 for x ∈ KR so y

K∗ . R

If y < 0 then set x = 1, and we have yx < 0 for x ∈ KR, and so y

K∗ . Finally, R

if y = 0 then yx = 0 for all x, so y ∈ K∗ . R

• K0∗ = KR, since there is only one x ∈ K0 and it is x = 0. Then, y0 = 0 for all y ∈ R, and hence K0∗ = R.

• K≥∗ = K≥. We have two cases. If y ≥ 0 then yx ≥ 0 for any x ≥ 0, and hence y ∈ K≥∗. If y < 0 then set x = 1 and we have y · 1 < 0, and hence y K≥∗.

• K≤∗ = K≤ by similar argument.

When we have multiple constraints, the cone K is a cartesian product of cones, according to (2.62), so we need to know how taking the dual relates to cartesian products. Fortunately, the dual of a product is the product of the duals.
Proposition 52. If K1, . . . , Kn are cones, then (K1 × · · · × Kn)∗ = K1∗ × · · · × Kn∗.

Proof. Let K = K1 × · · · × Kn. To see that K1∗ × · · · × Kn∗ is the dual of K, let x ∈ K. We’ll write x as the concatenation of subvectors xi ∈ Ki so that x = (x1, . . . , xn).
We know that for any i and yi ∈ Ki∗ that yTi xi ≥ 0, so in particular, letting y = (y1, . . . , yn) we have yT x ≥ 0. Therefore, K1∗ × · · · × Kn∗ ⊆ K∗.
In the other direction, if y ∈ K∗ then for each i, we want to show that yi ∈ Ki∗. Indeed, for any xi ∈ Ki let x be (0, . . . , xi, . . . , 0), and we have yT x = yixi must be ≥ 0. This is true for any xi ∈ Ki, so yi ∈ Ki∗, and hence K∗ ⊆ K1∗ × · · · × Kn∗.
87

In the remainder of the section, we prove Theorem 51. First, we can extend our observation about Lagrange variables resulting in indicator functions to the conic programming case.

Lemma 53. If x ∈ K then maxy∈K∗ −yT x = 0. If x K, then maxy∈K∗ −yT x = ∞. That

is,

Ixm∈iKn (x)

=

max
y∈K∗

−yT x

(2.81)

Proof. If x ∈ K, then by deﬁnition of K∗, since y ∈ K∗ we have yT x ≥ 0 and hence −yT x ≤ 0. By choosing y = 0 we get that the maximum of −yT x is equal to 0.

Now, let x K. Then, since K is closed and compact, by the Separating Hyperplane Theorem (Theorem 47) there exists y˜ and c with y˜T x < c but y˜T z > c for all z ∈ K. In particular, since 0 ∈ K we have y˜T 0 > c so c < 0.

It turns out that this y˜ deﬁning the separating hyperplane is actually an element of the dual cone, y˜ ∈ K∗. To show this, assume by way of contradiction that there’s some z˜ ∈ K with y˜T z˜ = < 0. Since K is a cone, and c > 0 we have c z˜ ∈ K. However, then y˜T ( c z˜) = c = c which contradicts y˜T z > c for all z ∈ K. Therefore, there is no z ∈ K with y˜T z < 0 and hence y˜ ∈ K∗.

Now, for our x which is not in K we have that y˜T x < c so −y˜T x > −c > 0. Since K∗ is a cone, we have that λy˜ ∈ K∗ for all λ ≥ 0, so −(λy˜)T x > −λc and hence

max −yT x ≥ max −(λy˜)T x = max −λc = ∞

y∈K∗

λ≥0

λ≥0

(2.82)

Note that this Lemma gives some intuition for separating hyperplanes, namely that they are directions along which we can send yT x to −∞ for x K, and furthermore, separating hyperplanes of cones are elements of the dual cone.
88

Corollary 54.

Ixm∈aKx(x)

=

min
y∈K∗

yT

x

Proof. This follows from Ixm∈aKx(x) = −Ixm∈iKn (x).

(2.83)

With this Lemma, we can prove the main theorem of this section.

Proof of Theorem 51. Let’s start with our primal problem

min cT x x Ax − b ∈ K x∈K

(2.84)

We then follow the recipe for our single variable example: (1) replace the constraint Ax − b ∈ K with an indicator function, and then use Lemma 53 to write this as a maximization over a new variable y. (2) Exchange maximization with minimization. (3) Replace terms that look like indicator functions with constraints:

min

(2.84)

=

min
x∈K

cT x

+

IAmxin−b∈K

(x)

Replace constraints with indicators

= min cT x + max −yT (Ax − b) x∈K y∈(K )∗

Lemma 53

= min max yT b + cT x − yT Ax x∈K y∈(K )∗

≥ max min yT b + (c − AT y)T x y∈(K )∗ x∈K

Exchange max with min

= max yT b + min xT (c − AT y)

y∈(K )∗

x∈K

= max{bT y | y ∈ (K )∗, c − AT y ∈ K∗} Replace indicators with constraints y (2.85)

89

Finally, the last line is equal to our dual problem
max bT y y AT y − c ∈ −K∗ y ∈ (K )∗

(2.86)

2.8 Optimality for Linear Programs
The dual program is not just useful as a lower bound to the primal program — in fact, if we have a pair of solutions (x∗, y∗) which are optimal for the primal and dual problems respectively, then these solutions obey a property called complementary slackness.
The main idea behind complementary slackness is that whenever a variable in the optimal solution xi∗ is nonzero, then the constraint corresponding to that variable in the dual program must be satisﬁed with equality (we say that such a constraint is tight). Conversely, whenever a dual constraint is not tight, the variable xi must be 0, so complementary slackness is useful for proving facts about the sparseness of optimal solutions (i.e., that only certain variables xi are nonzero).
Let’s start with a linear conic program program, where we have expanded

90

out K and K into a product of cones (as in (2.62)), with primal program

min cT x x
Ax − b ∈ K j
j

(2.87)

and dual program

x ∈ Ki
i
max bT y y AT y − c ∈ − Ki∗ i y ∈ (K j)∗ j

(2.88)

For each constraint j, the slack is the value of the linear constraint, (Ax − b) j.

Complementary slackness says that for each i, the product of the primal variable xi and the dual slack (AT y − c)i must be zero. Similarly, the product of the dual variable y j and the primal slack (Ax − b) j is also zero.
Theorem 55. Given the linear conic programs above, if there exist optimal solutions x∗ and y∗, then these satisfy xTi (AT y − c)i = 0 for all i and yTj (Ax − b) j = 0 for all j.

We can specialize this to linear programs, where each cone Ki, K j are just subsets of Rn.
Corollary 56. If x∗ and y∗ are respectively primal and dual optimal for a Linear Program, and si, t j are the primal and dual slack variables, given by si = aTi x − bi, t j = yT a j − c j, then xisi = 0 and y jt j = 0 for all i, j.

Finally, we conclude with the Strong Duality theorem, which says that for LPs, we do not actually lose anything by exchanging maximization and minimization.
91

Theorem 57 (Strong Duality). For a Linear Program, if both the primal and dual have feasible solutions, then their optimal values are equal.

2.9 Duality for the Local Marginal Polytope

Recall the Local Marginal Polytope of Section 2.4

min µi(xi) fi(xi) + µC(xC) fC(xC)

{µC } i xi

C xC

s.t. µC(xC) = µi(xi) ∀C, i ∈ C, xi ∈ Xi
xC\i

µi(xi) = 1 ∀i
xi

µ≥0

(2.89)

Since we are interested in this linear program (as a relaxation of our MRF optimization problem), we should examine its dual, to see if there is any structure to the dual LP that can be exploited. As a bonus, we will also use this as an example of how to take the dual of an LP in practice.

The language of cones and dual cones are convenient for stating the main theorems of Linear Programming, but aren’t necessarily the most convenient for doing calculations. However, the preceding sections have given us a recipe for computing the dual: (1) multiply the constraints by new Lagrange variables, and bring them into the objective as indicator functions, (2) exchange minimization with maximization, (3) group terms containing the original variables, and replace terms that look like indicator functions with constraints.

For the Local Marginal Polytope, we have two types of constraints, and correspondingly, two types of dual variables. The ﬁrst set of constraints are for each

92

clique C, each i ∈ C and each label xi ∈ Xi, so we’ll denote the corresponding dual variable as λC,i(xi).8 The other constraints are for each i, with corresponding dual variable κi.

The indicator functions for these constraints are the max of the Lagrange variables times the residual, so we have λC,i(xi) · µi(xi) − xC\i µC(xC) for the ﬁrst set of constraints and κi 1 − xi µi(xi) for the second set of constraints. Since these are equality constraints, these maximizations are over the dual cone (K0)∗ = R. Therefore, we get that (2.89) is equal to

min max
µ≥0 λ,κ

µi(xi) fi(xi) +

µC(xC) fC(xC)

i xi

C xC







+
C,i∈C

xi λC,i(xi) µi(xi) − xC\i µC(xC) +

i

κi 1 −

xi

µi(xi)

= min max µ≥0 λ,κ

κi +

ii



µi(xi)  fi(xi) − κi +

λC,i(xi)

xi C:i∈C



+ µC(xC)  fC(xC) − λC,i(xi)

C xC

i∈C

and hence

(2.90) (2.91) (2.92) (2.93)



≥ max λ,κ

κi +

iC

xC

min
µC (xC )≥0

µC

(xC

)



fC

(xC

)

−

i





+
i

xi

min
µi(xi)≥0

µi(xi)



fi(xi)

−

κi

+

C:i∈C

λC,i(xi)

 λC,i(xi)

(2.94) (2.95)

We have two expressions of the form mina≥0 a · b, which is −∞ when b < 0 and

0 otherwise, so this is the same as the indicator for the constraint that b ≥ 0.

Therefore, we can replace these expressions with the corresponding constraints

8We could have denoted this variable as λC,i,xi , but as we’ll see shortly, these variables have a natural interpretation as functions of xi.

93

to get the linear program

max κ,λ

κi

i

s.t. λC,i(xi) ≤ fC(xC)
i

κi ≤ fi(xi) + λC,i(xi)
C:i∈C

(2.96)

Note that we are maximizing over κ, and the only constraints involving κ are all of the form κi ≤ hi(xi) for some functions hi. Speciﬁcally, let hi(xi) = fi(xi) + C:i∈C λC,i(xi), which we call the height of label xi at variable i (following the language of [57]). Since we’re maximizing over κ, we will always have κi = minxi hi(xi).
We can informally think of the dual variable λC,i(xi) as taking part of the cost fC(xC), and redistributing it to the unary terms. The height functions hi(xi) can be thought of as the original cost fi(xi), plus any redistribution λC,i from the cliques to the unary terms at i. The dual is always a lower bound on the value f (x) of any labeling.

2.10 First-order Binary MRFs and Minimum Cut
We will conclude the chapter by returning to binary ﬁrst-order problems, and in particular how we can use max-ﬂow/min-cut to solve them. First, we will give a general solution for solving any submodular ﬁrst-order MRF with mincut. Furthermore, we will also show how the Local Marginal Polytope for these problems is itself equivalent to a cut polytope — that is, max-ﬂow/min-cut is not just a convenient algorithm for this purpose, but actually is the same prob-
94

lem from a Linear Programming perspective.

2.10.1 Solving First-order Submodular MRFs with Graph Cuts

We will start with a binary ﬁrst-order submodular MRF, that is, a minimization

problem of the form

min x

fi(xi) +

fi, j(xi, x j)

i i, j

(2.97)

where the variables xi are in {0, 1} and each fi, j is submodular, meaning it satisﬁes

fi, j(0, 0) + fi, j(1, 1) ≤ fi, j(1, 0) + fi, j(0, 1).

(2.98)

In computer vision, Graph Cuts refers to the various different algorithms which solve MRF minimization problems using the minimum-cut/maximumﬂow problem. For the simplest case of ﬁrst-order submodular MRFs, we can reduce (2.97) directly to a minimum cut problem.
To do so, it is easiest to ﬁrst reparameterize the problem. We have already seen in Section 2.1 that there are several ways to change the values of the fi and fi, j while keeping the total energy function the same for any labeling x. In this case, we want to ﬁnd a reparameterization of (2.97) so that fi, j(xi, x j) ≥ 0 for all xi, x j ∈ {0, 1}, and so that fi, j(0, 0) = fi, j(0, 1) = fi, j(1, 1) = 0.9 I.e., fi is nonzero for a single label f (1, 0).
We use a series of reparameterizations to make this happen. Then, for each fi, j we ﬁrst subtract fi, j(0, 0) from all values fi, j(xi, x j), and add the corresponding amount to the constant term of f . Next, we use the pencil reparameteri-
9Note that we’ve broken the symmetry between i and j with these requirements, so we’ll assume that each unordered edge {i, j} has i < j.
95

zation (2.6) of Section 2.1 to subtract δ = ( fi, j(1, 1) − fi, j(0, 1)) from fi, j(1, x j) for x j ∈ {0, 1} and add δ to fi(1). Similarly, we subtract δ = ( fi, j(0, 1) − fi, j(0, 0)) from fi, j(xi, 1) for xi ∈ {0, 1} and add δ to f j(1).

Overall, we have a new energy function fi, j given by:
fi, j(0, 0) = fi, j(0, 0) − fi, j(0, 0) = 0 fi, j(0, 1) = fi, j(0, 1) − fi, j(0, 0) − ( fi, j(0, 1) − fi, j(0, 0)) = 0 fi, j(1, 0) = fi, j(1, 0) − fi, j(0, 0) − ( fi, j(1, 1) − fi, j(0, 1)) fi, j(1, 1) = fi, j(1, 1) − fi, j(0, 0) − ( fi, j(1, 1) − fi, j(0, 1)) − ( fi, j(0, 1) − fi, j(0, 0)) = 0
(2.99)

Note that, because fi, j is submodular, we have that fi, j(1, 0) ≥ 0, as the expression above for fi, j(1, 0) is just a rearrangement of the equation deﬁning submodularity (2.98).

For the unary terms, we sum up the added δ for each edge i, j to get

fi (1) = fi(1) + fi, j(1, 1) − fi, j(0, 1) + fi, j(1, 0) + fi, j(0, 0)
j:i< j j: j<i

(2.100)

Now, recall that in a minimum cut problem, we have a graph with a source s and sink t, and vertices V, and for each edge e we have a capacity ce. A cut S ⊆ V + s + t has s ∈ S but t S . A directed edge (u, v) is cut if u ∈ S but v S , and the total cost of a cut S is the sum of capacities of all cut edges.

In particular, if we identify cuts S with binary labels x where xi = 1 if and only if i ∈ S , then an edge (i, j) is cut if and only if xi = 1 and x j = 0, and hence only if we would pay the cost fi, j(1, 0) in the objective f .
Therefore, it’s easy to construct a ﬂow network from f : for each undirected edge {i, j} with i < j, we add a directed edge (i, j) with capacity fi, j(1, 0). We
96

also add edges between s, t and the original nodes i ∈ V according to the same construction we used in our binary segmentation example of Section 1.3: cs,i = fi (0) and ci,t = fi (1).
Using the bijection between binary labels x and cuts S , we have that the cost of a cut is exactly equal to f (x), and hence we have reduced the MRF inference problem to the min-cut problem in a graph.

2.10.2 Linear Programs for Min-Cut

We can get a Linear Program for minimum cut in a graph by ﬁrst starting with an integer programming formulation of the problem. This integer program has a set of binary variables pi ∈ {0, 1} which are 1 whenever i is in the cut S and 0 otherwise. For each edge, we will have an auxiliary variable ai, j which is 1 if the edge (i, j) is cut. We can enforce that ai, j behaves appropriately by adding the constraint ai, j ≥ pi − p j. Note that whenever pi = 1 and p j = 0 then the edge (i, j) is cut, and this constraint forces ai, j ≥ 1 − 0 = 1; for all other values of pi, p j though, we can set ai, j to 0 and still satisfy the constraint.

Therefore, we have an integer programming version of the minimum cut

problem:

min p,a

ci,t pi + cs,i(1 − pi) + ai, jci, j

ii

i, j

s.t. ai, j ≥ pi − p j ai, j ≥ 0

(2.101)

pi ∈ {0, 1}

We get a linear program by relaxing the integrality constraint pi ∈ {0, 1} to

97

the linear constraint pi ∈ [0, 1].

2.10.3 Local Marginal Polytope for First-order Binary Problems

Let’s consider the Local Marginal Polytope, in the case where we have only ﬁrst order clique functions, fi, j, and the labels are binary. In this section, we’ll show that this LP is exactly equal to a min-cut LP, in the case where all the fi, j are submodular. To start, let’s look at just the part of the Local Marginal Polytope dealing with just a single edge {i, j} and all objective terms and constraints that deal with the variables µi, j. This is

min µi, j

µi, j, fi, j

s.t. µi, j(xi, x j) = µ j(x j)
xi

µi, j(xi, x j) = µi(xi)
xj

µi, j ≥ 0

∀xj ∀xi

We’ll expand out all the sums over labels to be totally explicit:

(2.102)

min µi, j

µi,

j(0,

0)

fi,

j(0,

0)

+

µi,

j(0,

1)

fi,

j(0,

1)

+ µi, j(1, 1) fi, j(1, 0) + µi, j(1, 1) fi, j(1, 1)

s.t. µi, j(0, 0) + µi, j(0, 1) = µi(0)

µi, j(1, 0) + µi, j(1, 1) = µi(1)

µi, j(0, 0) + µi, j(1, 0) = µ j(0)

µi, j(0, 1) + µi, j(1, 1) = µ j(1)

µi, j(0, 0), µi, j(0, 1), µi, j(1, 0), µi, j(1, 1) ≥ 0

(2.103)

98

This subset of the LP has 4 variables and 4 equality constraints. Just counting constraints, we’d expect there to be just a single value of µi, j satisfying these constraints (the solution of the 4 × 4 linear system given by these equality constraints). However, the constraints are not linearly independent (for example, add the ﬁrst two and subtract the second two to get 0 = 0), so there are only 3 independent constraints. Thus, there is a single variable which parameterizes solutions to this linear system.

We’ll let this single parameter be ai, j, which we’ll (somewhat arbitrarily, for

now) set to ai, j = µi, j(1, 0). To simplify notation, let pi = µi(1), p j = µ j(1). Note

that µi(0) = 1 − pi and µ j(0) = 1 − p j. Then, manipulating the above equalities we

have

µi, j(1, 1) = µi(1) − µi, j(1, 0) = pi − ai, j

µi, j(0, 0) = µ j(0) − µi, j(1, 0) = 1 − p j − ai, j

(2.104)

µi, j(0, 1) = µ j(1) − µi, j(1, 1) = p j − pi + ai, j From the positivity constraints on the µi, j we get 4 inequalities: ai, j ≥ 0, pi − ai, j ≥ 0, 1 − p j − ai, j ≥ 0 and p j − pi + ai, j ≥ 0. Additionally, the objective is a linear equation in ai, j and pi, p j:

µi, j, fi, j = (1 − p j − ai, j) fi, j(0, 0) + (p j − pi + ai, j) fi, j(0, 1) + ai, j fi, j(1, 0) + (pi − ai, j) fi, j(1, 1)

= − fi, j(0, 0) + fi, j(0, 1) + fi, j(1, 0) − fi, j(1, 1) ai, j

+ fi, j(1, 1) − fi, j(0, 1) pi + fi, j(0, 1) − fi, j(0, 0) p j + fi, j(0, 0)

(2.105)

We’ll write the coefﬁcient of a as δi, j := − fi, j(0, 0) + fi, j(0, 1) + fi, j(1, 0) − fi, j(1, 1). Note that δi, j is exactly the equation deﬁning whether fi, j is submodular, so fi, j is submodular if and only if δi, j ≥ 0.
We have assumed that fi, j is submodular, so δi, j ≥ 0. Since we are minimizing

99

the objective, we want aδi, j to be minimized, and hence a should be as small as possible. Thus, the constraints that matter are ai, j ≥ 0 and p j − pi + ai, j ≥ 0.

We can therefore take every term and replace the four variables

µi, j(0, 0), . . . , µi, j(1, 1) with the single variable ai, j and the constraints ai, j ≥ 0, ai, j ≥ pi − p j. For each pairwise term, we get terms in the objective for δi, jai, j as well as ( fi, j(1, 1) − fi, j(0, 1)pi and ( fi, j(0, 1) − fi, j(0, 0))p j. Collecting all the terms for the pi across all pairwise terms, together with the original value of ( fi(1) − fi(0))pi, we get a linear term

  fi(1) − fi(0) + j∈N(i) fi, j(1, 1) − fi, j(0, 1) + fi, j(1, 0) − fi, j(0, 0)  pi

(2.106)

Letting δi be the coefﬁcient of pi in the above equation, and we get a linear

program:

min p,a

δi pi + δi, jai, j

i i, j

s.t. ai, j ≥ pi − p j ai, j ≥ 0

(2.107)

0 ≤ pi ≤ 1

Finally, note that this is exactly the same linear program we derived for the minimum-cut problem, under the same reparameterization which sets fi, j(0, 0) = fi, j(0, 1) = fi, j(1, 1) = 0.

100

CHAPTER 3 RELATED WORK
There is considerable existing literature on higher-order MRFs — as we have seen, MRFs are a useful framework for probabilistic inference in computer vision, and higher-order MRFs allow more expressiveness for modeling priors that reﬂect the properties of real images. For an extensive survey on MRFs (both higher-order and ﬁrst-order) in computer vision, see [97]. Here, we will cover in detail previous methods most relevant to optimizing higher-order MRFs.
In Section 3.1, we’ll examine the various examples using higher-order MRFs to model computer vision problems. In the rest of the chapter, we’ll break down inference algorithms into a rough categorization. As seen in Section 2.2, binary MRFs have special structure, and are closely related to combinatorial problems like min-cut. Consequently, there is a large diversity of approaches for this special case, which we’ll look at in Section 3.2.
For multi-label MRFs, we’ll categorize algorithms by how they use the Marginal Polytope of Section 2.4. Methods which either do not use Linear Programming at all (or which use only the primal LP) we’ll call Primal methods. Many of these are move-making algorithms, meaning they are local-search algorithms in the space of labels X. Methods using the dual LP are largely message passing algorithms — they optimize the dual program iteratively by performing local updates around each variable, and then sending messages to their neighbors. Finally, Primal-Dual algorithms maintain both a primal and dual solution, and use the information contained in the dual to signiﬁcantly accelerate the convergence of the primal solution. We will consider Primal, Dual and Primal-Dual methods in Sections 3.3, 3.4 and 3.5 respectively.
101

3.1 Higher-Order Models in Computer Vision

Higher-order MRFs have become increasingly important in computer vision, as they can incorporate sophisticated priors which better reﬂect natural images than ﬁrst-order models.

The ﬁrst major category of higher-order MRFs are patch-based priors, which model the statistics of small patches in an image. In Section 1.5.2 we discussed the Fields of Experts model of [77], which uses clique functions on each k × k patch of the image to create a generative model of of image textures. This model consists of a set of k linear patch ﬁlters Ji, which are each run through a nonlinear response function. The cost function is given by

k
fC(xC) = αi log(1 + JiT xC).
i=1

(3.1)

We use machine learning approaches to ﬁnd the best ﬁlters Ji and weights αi, so

that the resulting prior p[x] matches the observed distribution of image patches

as closely as possible. As we observed in Section 1.5.2, using patch based priors

like Fields of Experts for denoising leads to much less over-smoothing than the

best available ﬁrst-order models.

Due to (3.1) being continuously differentiable (treating the labels as continuous intensity values), Fields of Experts models are usually optimized using gradient descent or other continuous non-linear solvers. However, because of the large number of discrete labels for denoising problems (typically 256 discrete intensity levels), this model has been used as a benchmark in many comparisons of higher-order discrete solvers.

Other examples of patch-based priors include robust 3×3 Potts priors for im-

102

age denoising [55], 3×1 vertical patches used for geometric surface labeling [27], and size 3 learned patch priors for 3D object segmentation [98].
Related to patch-based priors like Fields of Experts, segment-based priors ﬁrst segment the image into small, compact regions of similarly colored or textured pixels, and then add a prior that pixels within a segment should take similar labels. The Robust Potts (or Pn) model of [48] enforces consistency between the pixels in a segment by adding a cost for each differently labeled pixel within a segment, which is truncated to a ﬁxed cost after enough differently labeled pixels occur. Cost functions on even larger groups of pixels than segments include the global costs such as the co-occurrence priors of [63], which improve semantic segmentation results by putting high costs on unlikely semantic categories both appearing in an image (for example, “cow” and “sofa”), while not penalizing common pairings (e.g., “cow” and “grass”).
The second major category of higher-order priors uses multiple variables to estimate higher-order derivatives of the image. We have already seen the curvature-based stereo model of [102], in Section 1.5.3. In order to remove the fronto-parallel bias of pairwise stereo priors, this model uses 3 × 1 cliques in the image, with a cost which uses a discrete approximation to the second-derivative of the depth map. By penalizing the second-derivative, this prior thus encourages the result to be locally planar. As we saw in Section 1.5.3, higher-order cliques are necessary to allow non-fronto-parallel planes in the depth image, as ﬁrst-order priors can only penalize differences in neighboring depth values.
In general, natural models for penalizing the kth derivative require cliques of size k + 1. Other higher-order derivative priors include the second-derivative based model of [62] for non-rigid image registration, a 3rd derivative regularizer
103

for 1D signal processing [55], and many curvature-based models, including for binary segmentation [72], 2D and 3D surfaces [88] and segmentation and shape inpainting [85].
3.2 Inference Algorithms for Binary MRFs
Among computer vision inference methods, many of the most successful algorithms have been graph cuts methods. Large comparisons of inference algorithms across diverse datasets, including the Middlebury dataset [90], and OpenGM2 dataset [45], have found graph-cuts to be highly efﬁcient for many problems in computer vision. In particular, the space of binary MRFs is particularly dominated by Graph Cut algorithms.
3.2.1 First-Order Submodular MRFs
For binary submodular problems (as we saw in Section 2.10), the maxﬂow/min-cut algorithm ﬁnds the exact global optimum in polynomial time. This efﬁcient optimization was ﬁrst used for vision problems by [11, 12], with several applications including image restoration and stereo problems. The connection between graph cuts and submodularity was explored in [52], where binary ﬁrst-order polynomials with non-positive quadratic terms, ai, j ≤ 0, were termed regular.
A major reason for the empirical success of graph cuts are the specialized max-ﬂow algorithms designed to have good performance on inputs typical of computer vision problems. For general graphs, ﬂow algorithms based off
104

Push-Relabel[14] tend to have the best performance. However, computer vision graphs are generally constructed from priors which enforce locality between neighboring groups of pixels, and are typically on grids with edges only in a local neighborhood. Thus long augmenting paths tend to be rare. The BoykovKolmogorov ﬂow algorithm [10] exploits this by maintaining two search trees in the residual graph, which grow towards each other from the source and sink. With short augmenting paths, these search trees can do incremental path ﬁnding from source to sink with minimal additional overhead. A small change to the Boykov-Kolmogorov algorithm, called Incremental Breadth-First Search [30], results in theoretically guaranteed runtime for the algorithm, as well as slightly improved (and currently state-of-the-art) empirical performance.
3.2.2 First-Order Nonsubmodular MRFs
For non-submodular ﬁrst order MRFs, graph cuts no longer ﬁnd the optimal solution. In fact, the problems is NP-complete (see [52] and Section 2.2.5). Despite the difﬁculty of ﬁnding optimal solutions, there are useful approximate inference algorithms. The most widespread of these is the Roof-Dual construction of [33], which maximizes a linear lower-bound to the discrete minimization problem. The Roof-Dual method was made practical for computer vision problems by [51], which used the graph construction of [8] in an algorithm called QPBO (for Quadratic Pseudo-Boolean Optimization). Like Graph Cuts for regular functions, QPBO uses the max-ﬂow/min-cut algorithm. However, the underlying graph is twice as large.
The most important feature of QPBO is that it does not always return an
105

optimal labeling, but instead gives a partial labeling of the variables, meaning that some variables are labeled 0, 1, while others are unlabeled (which we will denote as “?”).
The Roof-Dual construction ensures that this partial labeling satisﬁes a property called persistency [33]. Persistency describes what happens when we fuse a partial labeling with another labeling. Let x ∈ {0, 1, ?}V be a partial labeling, and y ∈ {0, 1}V a complete labeling. We can stamp x onto y by replacing yi with xi for all the variables i where xi is 0 or 1. That is, the fused labeling z has zi = xi whenever xi = 0, 1 and zi = yi otherwise. Then, a partial labeling x ∈ {0, 1, ?}V is persistent if for any complete labeling y, the fused result z has weakly decreasing cost: f (z) ≤ f (y).
Because the partial labeling from QPBO is persistent, we can derive several useful facts: First, there must be some optimal labeling y∗ which includes the labeled variables of x. Indeed, let y∗ be any optimal labeling of f , and let z be the fusion of y∗ with x. Since f (z) ≤ f (y∗), we know that z is also optimal, and it includes all the labeled variables from x. Second, given our partial labeling, we can simplify our problem — since we know that there is an optimal solution containing the labeled variables from x, we can now set these variables to be constant in f . This can eliminate unary and pairwise terms from f , and may separate the problem into disjoint connected components. More persistencies can be found recursively to get even more labeled variables, for example in QPBO with Probing [51] or Partial Optimality by Pruning [89].
Aside from QPBO, there are other approximate inference methods, which tend to be fast in practice, and which produce full (not partial) solutions, although without as many theoretical guarantees regarding the solution. A pop-
106

ular approach is to modify the energy function to remove the non-submodular edges. The simplest option, used ﬁrst in [80] is to simply truncate positive quadratic terms to 0. This results in a submodular function which can be exactly optimized by graph-cuts, and if there are not too many submodular edges, then the optimal solution to the truncated energy may be close to optimal for the original function (however, there is no guarantee that it will be close).
A more sophisticated choice is to use an actual upper bound to the original function, as in the LSA-AUX method [31]. This method iteratively ﬁnds a series of upper bounds gt, each of which has gt(xt) = f (xt) at the current point xt. Each move solves a pairwise submodular minimization problem xt+1 = arg min gt(x) with graph cuts, giving a fast local-search method.
A ﬁnal approach modiﬁes the topology of the graph, instead of modifying the costs of each edge. If the graph is a tree (or, more generally, has a property called low treewidth), then we can ﬁnd the exact optimal solution in linear time using dynamic programming. The method of [21] removes edges to get a low treewidth subgraph, while hopefully removing as few edges as possible.
3.2.3 Higher-Order Reductions
A popular strategy for using Graph Cuts with higher-order priors is to reduce the problem to a ﬁrst-order one. A reduction to ﬁrst order takes the original function f (x), and via some transformations, including adding new variables yi and edges, produces a function g(x, y) which satisﬁes f (x) = miny g(x, y). That is, if we minimize g over the auxiliary variables y, then we get back our original function f . If we are trying to solve the minimization problem minx f (x), then
107

solving the joint minimization problem minx,y g(x, y) will give the same answer, and the same minimizer for x.
For higher-order clique functions with particular forms, there are specialized reductions, which take advantage of structure in the cost function to make reducing higher-order to ﬁrst-order relatively straightforward. These include concave cost functions [48], label consistency priors [49], cost functions which are sparse (only a small number of nonzero values) [78], and curvature regularization (including the stereo example of [102] discussed in Section 1.5.3).
For general higher-order functions, for a long time the only option was the method of reduction by substitution [76]. This method does not perform well in practice, as the resulting ﬁrst-order MRF is particularly difﬁcult to optimize. More recent reductions could handle multilinear polynomials with all negative coefﬁcients [52, 24]. The ﬁrst practical method for handling general higher-order functions is the Higher Order Clique Reduction (HOCR) algorithm of [37, 40]. This method produces a ﬁrst order binary MRF for which QPBO can label large numbers of variables on typical vision inputs.
A different approach to ﬁnding higher-order reductions is Generalized Roof Duality (GRD) [43, 44], which proposed a class of submodular relaxations for an arbitrary higher-order MRF with clique size at most 4. GRD ﬁnds the best such relaxation by solving a linear program which searches over all possibilities from a characterization of quadratizable submodular functions from [104]. It then optimizes the relaxed function exactly using graph cuts. The relaxations found by GRD provide very good approximate solutions to the original higherorder problem. However, in addition to the restriction on the maximum clique size (which appears difﬁcult to overcome) GRD is also computationally much
108

more intensive than HOCR.
Other general reduction strategies include the hypergraph-based reduction method of Chapter 4, and methods which choose between reduction strategies based on a separate inference problem [27].
3.2.4 Higher-order Submodular Functions
An alternative to reducing higher-order functions to ﬁrst-order is to instead try to generalize ﬂow algorithms to work directly on the higher-order functions.
We have already seen that the min-cut problem is only tractable when the associated cost function is submodular. The same holds true for higher-order functions. In fact, all submodular functions are exactly minimizable in polynomial time, with a current best asymptotic complexity of O(n6) [73]. Unfortunately, O(n6) is not practical for vision-sized inputs, so methods which are intermediate between min-cut and general submodular function minimization are necessary.
These methods are based off of Submodular Flow, which has been studied for some time in the combinatorial optimization literature [15, 26]. Submodular ﬂow was adapted for higher-order binary MRFs by [53], which proposed an algorithm called Sum-of-Submodular (SoS) ﬂow. This algorithm works fon binary MRFs where each clique function is submodular (and hence the entire function is a “sum of submodular” functions). Recall that the sum of any number of submodular functions is still submodular, so general submodular minimization such as [73] could be used. However, by exploiting the clique structure of MRFs,
109

with cliques of size k << n, SoS ﬂow is able to achieve similar runtime to existing standard ﬂow algorithms.
Intuitively, the difference between max ﬂow and sum-of-submodular ﬂow is that in addition to capacity and conservation constraints, we also have constraints that the ﬂow out of any set S is at most gC(S ∩ C).
The ﬁrst practical implementation of SoS function minimization was the Generic Cuts algorithm of [3], which used an augmenting paths algorithm for ﬂow. This was followed by the SoS IBFS algorithm of [19] (described in Chapter 5) which generalized the currently state-of-the-art IBFS algorithm for vision problems to work for Sum of Submodular functions.
Finally, we also have analogues of the cost-modiﬁcation approaches that we described above for non-submodular ﬁrst-order problems. For higher-order functions, Auxiliary Cuts [6] uses convexity and other properties of image functionals to compute upper bounds which are pairwise submodular functions. These upper bounds are iteratively minimized and updated similarly to LSAAUX [31].
The Pseudo-Bound method [92] extends this idea by considering a parameterized family of functions which include a pairwise submodular upper bound, and ﬁnding all minimizers of the entire family using parametric max-ﬂow — by looking at all minimizers, a greater decrease in energy per iteration is obtained. Note that for both these methods, the upper bounds are all pairwise functions, even if the functions they approximate are higher-order.
A natively higher-order approach is to look for a Sum-of-Submodular upper bound to a non-submodular higher-order function. This approach will be
110

explored in Chapter 6.
3.3 Primal Algorithms
The next three sections will cover methods for optimizing multilabel MRFs. A useful categorization of these algorithms sorts them by how they utilize the Linear Programming relaxation (using the Local Marginal Polytope) and its dual. In this section, we’ll cover algorithms that deal only with the original problem, and are therefore primal algorithms. More speciﬁcally, since they generally are only concerned with actual labelings and not the LP relaxation, they are primalintegral algorithms.
Most primal algorithms for optimizing MRFs have been some variant on local search. The basic recipe of these algorithms is to maintain some current state xt, and then ﬁnd the best solution x within some local neighborhood of xt and update, xt+1 = x .
The earliest algorithms for optimizing MRFs are essentially classic local search algorithms. Iterated Conditional Modes (ICM) [7] updates each variable xi in turn, picking the best label given that all other labels remain constant, xi = argminxi f (xi, xV−i). This is a simple local search with neighborhood size |Xi| at each iteration. However, it is prone to get stuck in poor local optima, since variables cannot vary simultaneously. More recent versions of ICM include Block ICM [46], in which blocks of variables are updated simultaneously. In grid-structured MRFs (when the edges of the MRF form a grid, such as the pixels of an image) dynamic programming can be used to efﬁciently optimize over very large blocks in linear time [13].
111

Simulated Annealing [28] searches in a small local neighborhood similar to ICM. However, instead of choosing the best label for xi among all labels in Xi, a random label is chosen, weighted towards choosing lower energy labels more frequently than higher energy labels. By sometimes choosing suboptimal labels, a larger space of solutions can be explored — in fact, by decreasing the temperature (the rate at which suboptimal solutions are chosen) slowly enough, simulated annealing will ﬁnd the true global minimum, however convergence may take longer than a brute-force search of the entire label space XV.
For multilabel problems, graph-cuts methods are also local search algorithms, although over a much larger neighborhood. The most widely used graph cut techniques, including α-expansion [12] and its generalization, fusion moves [66], repeatedly solve a ﬁrst-order binary MRF in order to minimize the original multilabel energy function. These move-making graph-cut algorithms maintain a current solution x ∈ XV, and at each iteration propose a new solution y ∈ XV. We get a binary problem by allowing each variable i to either keep its current label xi, or switch to the new label yi.
Both alpha-expansion and fusion moves are local search algorithms; however, the search neighborhood is much larger than usually encountered: because each variable has an independent binary choice, alpha-expansion can pick the best move among |2V| possible neighbors of x.
More precisely, if we let S be the set of variables which switch labels, then the new, fused labeling is denoted x[S ← y], which has label xi for i ∈ S and yi for i S . Then, from the original MRF energy f we get a binary function g(S ) = f (x[S ← y]). This function g has the same clique structure as f , so we can optimize it using the techniques used for pseudoboolean functions described
112

above. Once we have found the optimal S ∗ for g (or some approximation of it, in case g is not submodular) we can update xt+1 = xt[S ← yt], and iterate.

Alpha-expansion [11, 12] is a specialization of the above (historically, the ﬁrst to be considered) in which the proposal y is a constant label, yi = α for all i ∈ V. Alpha-expansion continues to be the most widely used graph-cuts algorithm for several reasons. It is simple to implement, as it doesn’t require any special domain knowledge to pick the proposals y. Furthermore, assuming that the clique functions fi, j are metric, meaning they satisfy the following properties

fi, j(a, a) = 0

∀a ∈ X

fi, j(a, b) > 0

∀a b

fi, j(a, b) = fi, j(b, a)

fi, j(a, b) + fi, j(b, c) ≤ fi, j(a, c) ∀a, b, c ∈ X

(3.2)

then the binary subproblem g will always be submodular, and hence we can ex-

actly ﬁnd the optimal S ∗ using min-cut. Finally, α-expansion can have provable

approximation bounds (in the sense of Deﬁnition 28, Section 2.2.5): when the

fi, j are all Potts terms, with fi, j(xi, x j) = λi, j whenever xi x j, and 0 otherwise,

then alpha-expansion is always a 2-approximation [12]. When the fi, j are all the

same, and form a metric, then the work of [47] showed that the approximation

ratio

is

2 f max f min

where

f max

is

the

maximum

value

of

fi, j

and

f min

is

the

minimum

nonzero value of fi, j.

For Fusion Moves [66], the binary subproblems may be non-submodular, in which case only approximate solutions to g can be found. However, proposals can be speciﬁcally chosen to better explore the label space XV. The original paper [66] proposed a number of variants, including Jump Moves, which search through label spaces for problems like Optical Flow, by allowing the current

113

label to move in horizontal or vertical translations from the current solution x. The work of [38] proposed Gradient Descent Fusion Moves, for energy functions which are differentiable (in particular, for Fields of Experts) — the proposed move y is along the direction of the energy gradient ∇ f (x), which quickly moves x towards lower energy solutions.
Finally, for special cases of energy functions, there are globally optimal solutions. In the case where the label set can be given an order Xi = {a1 < a2 < · · · < a }, and the pairwise functions fi, j are all convex according to this order, then there is a graph construction due to Ishikawa [39] which ﬁnds the globally optimal labeling in a single min-cut solve. This condition has been generalized to a multi-label submodularity condition by [84], which similarly has a globally optimal solution.
3.4 Dual Algorithms
The class of dual algorithms all make use of the Local Marginal Polytope, and in particular, of the dual program described in Section 2.9. The Local Marginal Polytope was originally introduced by [86], and extended to the higher-order case by [99]. Recall that the dual program is a maximization over dual variables λC,i(xi), and that these dual variables can be interpreted as a reparameterization, in which some of the energy of the clique functions is partitioned among the unary terms.
Another way of interpreting the dual variables is as messages between variables. In particular, in the pairwise case, the dual variables λC,i for an edge C = {i, j} becomes a message from variable j to i, where λi, j(xi) communicates
114

some function of node j’s belief that the true state for i is xi. For MRFs whose edges form a tree, this intuition can be made exact, with the Max-Product Belief Propagation algorithm [74]. For tree-structured MRFs, Max-Product BP can ﬁnd the exact global optimum in linear time. Early message passing algorithms were developed without knowledge of the Local Marginal Polytope; however, global optimality can be proved by noting that Max-Product BP on a tree is actually a form of Dynamic Programming.
For MRFs which are not trees (for example, image grids) Max-Product BP can be applied to each variable in an iterative algorithm called Loopy BP [25]. When the MRF graph has loops, LBP may fail to converge, and has been observed to enter cycles, without even reaching a local optimum. However, it has achieved good performance in many practical problems, and was a widely-used alternative to graph-cuts methods.
The ﬁrst provably convergent message passing algorithm is the TreeReweighted Sequential message passing algorithm of [50]. The analysis of TRWS interprets the messages as reparameterizations of the energy function f , which provide a global lower bound. This lower bound is maximized, using maxproduct steps on subtrees of the graph. This lower bound is related to the Local Marginal Polytope, but using a different dual program (obtained by organizing the energy into higher-order terms on the subtrees before taking the dual).
Dual Decomposition [56] splits the objective into a set of overlapping terms, and uses subgradient ascent on the dual to enforce consistency among the labelings of the separate parts of the objective. In general, Dual Decomposition works for any splitting of the objective, so long as the subproblems can be exactly optimized efﬁciently. However, the subgradient ascent may take many
115

iterations to converge.
Other message passing algorithms with provable convergence explicitly use the Local Marginal Polytope dual, including Max-Sum Diffusion [100] and MPLP [29]. The latter two algorithms can be interpreted as block-coordinate ascent on the dual program. Because the dual is not smooth (it is piecewise linear) block coordinate ascent will converge to a solution, but it may not be dual-optimal. Methods to smooth the dual objective can converge to the optimal dual solution: these include Accelerated Dual Decomposition [42] and Adaptive Diminishing Smoothing [81].
All of the above methods are for ﬁrst-order MRFs, however the same basic ideas translate to higher-order MRFs as well. A dual decomposition approach based on higher-order pattern-based priors, using Dynamic Programming as the optimizer for the subproblems, was proposed in [55]. Max-sum diffusion was applied to the higher-order case in [101], and TRW-S was similarly generalized to higher-order MRFs in [54]. These latter methods are based on the corresponding algorithms for ﬁrst-order, using the higher order Local Marginal Polytope proposed in [99].
3.5 Primal-Dual Algorithms
Finally, Primal-Dual methods use both the primal and dual programs simultaneously. The connection between graph cuts and primal-dual techniques was established by [57] which showed that α-expansion could be interpreted as simultaneously optimizing primal and dual solutions.
116

The general recipe of primal-dual algorithms is that they iteratively update a primal integer solution (i.e., a labeling x) similarly to alpha-expansion. However, they also maintain a dual solution λ, which guides the search during the binary min-cut solve. Furthermore, the primal and dual solutions are simultaneously updated, so as to satisfy invariants related to complementary slackness, which results in the ﬁnal solution having provable approximation bounds. These technical details will be expanded on in Chapter 7.
The primal-dual algorithm of [57] overcomes the most important limitation of the α-expansion algorithm, which is the requirement that the pairwise energy must be a metric [12]. These methods also extend the approximation bounds for alpha-expansion with metric energies from [47]. The same approximation ratio still holds, but over a much broader class of energy functions.
Empirically, keeping track of the dual variables also allows a number of implementation speedups compared to α-expansion, resulting in the very efﬁcient algorithm FastPD [59], which can be 3-9 times faster than alpha-expansion in practice.
For higher-order MRFs, the ﬁrst primal-dual algorithm is the Sum-ofSubmodular Primal Dual algorithm, SoSPD, which is covered in Chapter 7.
117

CHAPTER 4 HIGHER ORDER REDUCTIONS
We have already seen that ﬁrst-order MRFs have well-understood and effective optimization algorithms, compared to higher-order MRFs. One way of bridging this gap is to ﬁnd a way to transform higher-order MRFs into an equivalent ﬁrst-order one. As described in Section 3.2.3, reduction methods can transform a binary MRF f (x) to a quadratic function g(x, y), such that f (x) = miny g(x, y). Then, we can apply existing algorithms to the reduced ﬁrstorder form g to get a solution to the higher-order original problem.
In this chapter, we will focus on binary MRFs, since the reduction methods discussed herein all work on binary problems only. Multi-label problems can be handled by repeated application of solving binary subproblems, as in (for example) alpha-expansion or fusion moves.
The main result of this chapter is a reduction method which exploits the hypergraph structure of the cliques C to transform a group of terms at once. For n binary variables, each of which appears in terms with k other variables, at worst we produce n non-submodular terms, while [37, 40] produces O(nk). We identify a property (called local completeness) under which our method perform even better, and show that under certain assumptions several important vision problems (including common variants of fusion moves) have this property. We show experimentally that our method produces smaller weight of non-submodular edges, and that this metric is directly related to the effectiveness of QPBO [51]. Running on the same ﬁeld of experts dataset used in [37, 40] we optimally label signiﬁcantly more variables (96% versus 80%) and converge more rapidly to a lower energy. Preliminary experiments suggest that some other higher-order
118

MRFs used in stereo [102] and segmentation [1] are also locally complete and would thus beneﬁt from our work.
4.1 Introduction
While graph-cuts are a popular method for solving ﬁrst-order MRFs, such as the benchmarks described in [90] and [45], they are much more difﬁcult to apply to higher-order MRFs. As a result, until recently this powerful optimization method has only been used for a few specialized higher-order MRFs, such as [48, 102].
The ﬁrst general-purpose practical graph-cuts method for higher-order MRFs is that of Ishikawa [37, 40]. This method works by transforming the higher-order input MRF into an equivalent quadratic (pairwise) MRF by adding additional auxiliary variables and edges. The general class of such methods are known as higher-order reductions — this particular reduction is commonly referred to as Higher-Order Clique Reduction (HOCR). Since the resulting ﬁrstorder MRF is non-submodular, it is optimized using QPBO, which produces a partial labeling (see Section 3.2.2). The quality of this partial labeling (i.e., the number of labeled pixels) is highly sensitive to the energy function.
A more theoretically-motivated approach to ﬁnding higher-order reductions is Generalized Roof Duality (GRD) [43, 44], which proposed a class of submodular relaxations for an arbitrary higher-order MRF with degree at most 4. GRD ﬁnds the best such relaxation by solving a linear program, and optimizes the relaxed function exactly using graph cuts. The relaxations found by GRD provide very good approximate solutions to the original higher-order problem. Beyond
119

the restriction on the MRFs degree, which appears difﬁcult to overcome, GRD is also computationally much more intensive than HOCR.
In this paper we propose an alternative construction to HOCR and GRD, with improved theoretical and experimental performance. Instead of considering terms in the energy function one at a time, we make use of the fact that the clique structure of an MRF is a hypergraph, in order to reduce many terms at once. We will review existing reduction methods for solving higher-order MRFs with graph cuts in section 4.2, We present our new algorithm in section 4.3, and analyze its worst case performance in section 4.4. In section 4.5 we show that for problems with property called local completeness our method performs even better. Under certain assumptions we prove that some important vision problems are locally complete, including the ﬁelds of experts MRF considered by Ishikawa. Experimental results are given in section 4.7, along with experimental evidence that other vision problems [1, 102] are also locally complete.
4.2 Related work
There are a number of methods for reducing an arbitrary multilinear polynomial over binary variables into a quadratic one. The performance of the different methods is summarized in ﬁgure 4.1.
For all methods, we are interested in the size of the obtained quadratic function, including the number of additional vertices and edges required, as these directly affect the size of the min-cut problem which will be solved by QPBO. We make a particular note of the number of nonsubmodular edges as well as the weight of these edges, as these can negatively impact the solution returned
120

Substitution [76] Negative [24] HOCR [37] GRD [43] (d ≤ 4) Ours (worst case) Ours (local completeness)

New variables
O(nk) t
O(td) n + O(t) n + O(td)
n + O(t)

Non-submodular edges O(nk) – O(nk) – n
n

Submodular edges O(nk) td O(td2) O(td) O(td2)
O(td)

Non-submodular weight O(nkM) – O(d2 W ) – O(dW )
O(dW )

Figure 4.1: Resources required to reduce t terms of degrees up to d, for an energy function with n variables each of which occurs with up to k other variables. W is the total weight of all positive terms in the higher-order function. Unlike the other algorithms listed, GRD is only deﬁned for terms of limited degree. There is no clear notion of non-submodular edges in the relaxation produced by GRD, so we mark these entries “–”. Non-submodular weight is the total weight of non-submodular edges in the reduced function.

by QPBO [90], as conﬁrmed in our experiments1.

4.2.1 Reduction by substitution
The original reduction method was introduced by Rosenberg [76]. The reduction operates on the multilinear polynomial representation of f . The algorithm iteratively eliminates all occurrences of some product xix j by introducing a new variable z, replacing xix j by z everywhere it occurs, and then adding the following penalty terms to the energy function: Mxix j − 2Mxiz − 2Mx jz + 3Mz, where M is a suitably large constant. This forces z to take the value of xix j in any optimal solution.
1Note that the total weight of nonsubmodular edges is not a perfect measure of the performance of QPBO. For example, many functions can have non-submodular edges, but after permuting some labels become submodular [84], and these functions can be exactly minimized by QPBO. Nevertheless, our experiments show this is a useful heuristic.
121

If each variable is in terms with at most k other variables, this reduction can be done with O(nk) pairs, which results in O(nk) new variables, O(nk) nonsubmodular terms and O(nk) submodular quadratic terms.
Note that the non-submodular terms have large coefﬁcients. Experimentally it has been reported that QPBO performs very poorly on such energy functions (see, for example, [40, §8.3.4], which states that QPBO ﬁnds almost no persistencies).

4.2.2 Reducing negative-coefﬁcient terms

Kolmogorov and Zabih [52] for d = 3 and Freedman and Drineas [24] for d ≥ 3

suggested the following transformation for negative higher degree terms:

d
−x1 · · · xd = min y (d − 1) − x j y∈B j=1

(4.1)

If we have t negative-coefﬁcient terms of degree d, this gives t new variables and

td submodular quadratic terms, but no non-submodular terms.

Let us note that the above equality remains valid even if we replace some of the x j variables with their complements x j = (1 − x j). In [78] this was used to obtain a transformation for sparse functions, i.e., those with only a few labels xC have fC(xC) 0 (see type-II transformations in [78]).

4.2.3 Reducing positive-coefﬁcient terms

The HOCR transformation [37, 40] was the ﬁrst practical method for general

higher-order functions. For a term of degree d, let nd =

d−1 2

and set ci,d = 1 if

122

d = i and i is odd, and ci,d = 2 otherwise. Each positive term is reduced by

nd d

x1 · · · xd

=

min
u1,··· ,ud

i=1

ui

ci,d

− x j + 2i
j=1

−1

+ xi x j
i< j

For each term of degree d, we get O(d) new variables. Each new variable is

connected to each original variable by a submodular edge, for a total of O(d2)

submodular edges. We also get non-submodular edges between all pairs of

original variables xi, x j whenever xi and x j are in the same clique. If each vari-

able occurs in terms with at most k other variables, then the number of non-

submodular edges is O(nk) (note that if the pair xi, x j occurs in multiple cliques,

we only count the pair once, as they can be combined to a single edge in the

ﬂow network). Finally, if the positive term has positive weight α > 0 then this

term

creates

d(d−1) 2

non-submodular

edges

of

weight

α.

So,

if

the

total

weight

of

positive terms is W, then the quadratic function has non-submodular edges of

total weight O(d2W).

Note that this reduction uses a large number of non-submodular edges: the d original variables are fully connected by positive weight edges. This is problematic, as it has been observed [90] that non-submodular edges can result in poor performance for graph cut optimizers like QPBO.

4.2.4 Generalized Roof Duality
Unlike the above methods, which are all rewrite rules for the individual terms of the multilinear polynomial, Generalized Roof Duality (GRD) [44] ﬁnds a reduction to quadratic form which is globally the best among a large class of candidates, called submodular relaxations.
123

GRD uses a characterization of all submodular functions expressible in quadratic form due to Zivny et. al. [104]. This reduction uses one additional variable for each term, as well as O(d) additional edges for a degree d term.
The reduction also ensures the existence of persistencies, similar to the persistencies of Roof Duality [33] (and its implementation, QPBO [51]) whereby after solving the submodular relaxation, each variable is assigned a value in {0, 1, ?}, and every variable taking value 0 or 1 in the partial labeling actually takes that value in the (possibly unknown) global optimum.
To ﬁnd the best submodular relaxation, GRD solves a linear program. Because GRD ﬁnds the tightest submodular relaxation, the returned labeling is typically of very high quality; however, solving the LP is computationally very expensive, making this algorithm impractical for large-sized problems. Instead of directly solving the LP, the authors also give heuristics to ﬁnd nearly optimal submodular relaxations. These heuristic relaxations (denoted GRD-heur) also give very good labelings. However, we will show in our experiments that even these heuristic methods are several times slower than HOCR and our technique.
Finally, it is worth noting that GRD can only be applied when all terms have degree 3 or 4, and it is doubtful that the method can be generalized. The reduction [104] used by GRD to convert the submodular relaxation to quadratic form has only been described for functions of arity 4. Furthermore, writing down an LP for the optimal submodular relaxation requires being able to compactly describe the set of submodular functions with terms of degree d, and this task is NP hard for d ≥ 4. In contrast, our method and Ishikawa’s have no restriction on the degree of terms involved.
124

4.3 Reducing groups of higher-order terms

The terms of a multilinear polynomial form a hypergraph H. The vertices are the polynomial’s variables, and there is a hyperedge H = {x1, . . . , xd} with weight αH whenever the polynomial has a term αH x1 · · · xd. In contrast to earlier methods which reduce term-by-term, our new method uses this hypergraph structure to reduce a group of terms all at once.

The two theorems below are both concerned with reducing respectively all the positive or all the negative terms containing a single variable, (or small set of variables); we will write this common subset of variables as U. The most important special case of our reduction is shown in ﬁgure 4.2, where we consider all positive terms which contain the variable x1, i.e., U = {x1}.

Theorem 58. Let H be a set of terms such that each H ∈ H contains U, i.e., U ⊆ H.

Furthermore, we require all the hyperedges H have positive weights αH > 0. Let f (x) =

H∈H αH j∈H x j be this polynomial. Then f (x) is equal to



min
y∈{0,1}



H∈H

αH y

j∈U

xj

+

H∈H

αHy

j∈H\U

xj.

(4.2)

Proof. Given any assignment of the variables x1, . . . , xn, either (1) all the variables in U are 1, or (2) some variable in U is 0.

Case 1: Subsituting 1 for the variables in U, f (x) is equal to H∈H αH j∈H\U x j and (4.2) is miny( H∈H αH)y+ H∈H αHy j∈H\U x j. If we assign y = 1, then (4.2) becomes H∈H αH, and if we assign y = 0, then it becomes H∈H αH j∈H\U x j. This quantity is always less than or equal to H∈H αH, so the minimum is achieved when y = 0, in which case, f (x) equals (4.2).

125

Case 2: The product j∈U x j is 0. Since all the terms of f (x) share the common subset U, f (x) = 0. Similarly, (4.2) is H∈H αHy j∈H\U x j. If we assign y = 1, then this sum is 0, whereas if we assign y = 0, then it is positive, since each αH is positive. Thus, the minimum is achieved when y = 1, in which case (4.2) is 0 hence equal to f (x).

For every positive term containing the common subset U, equation (4.2) replaces it with a new term αHy j∈H\U x j. To get a multilinear polynomial, we replace the negated variable y with 1 − y, which splits each term into two:

aH xj = αH

xj − αHy

xj

j∈H j∈H\U

j∈H\U

(4.3)

Corollary 59. When we apply equation (4.2) to a positive term, we obtain a positive

term of smaller degree, and a negative term with y replacing the common subset U.

For reducing the negative-coefﬁcient terms all sharing some common subset, we have a similar theorem.

Theorem 60. Consider H and U as above, where now the coefﬁcients αH are negative

for all H. Let g be the corresponding polynomial. Then for any assignment of the

variables, g(x) is

min
y∈{0,1}

−αH 1 −

xj −

xj y

H∈H

j∈U j∈H\U

(4.4)

Proof. The proof is similar to the proof for Theorem 58. The minimum is achieved when y = j∈U x j.

A crucial difference between this reduction and theorem 58 is that in the positive case, we could let the common subset U be a single variable. However,
126

x1  α4 

α1  α2 
α3 

α4 

α1  α2 

y 

α3  ‐α4 

‐α1  ‐α2 
‐α3 

Figure 4.2: Our main reduction. At left are all the original positive terms containing the common variable x1 (so αi > 0). At right are all the new terms we obtain from equation (4.2). The positive terms on top are just the original terms minus x1, and the negative terms on bottom are the original terms with y replacing x1.

applying Theorem 60 to U = {xi} removes the term αH j∈H x j and replaces it with αHy j∈H\{1} x j, another negative term of the same degree. Trying to apply this reduction repeatedly will thus never terminate. However, if U consists of two or more variables, then grouping all terms containing U and reducing results in smaller degree terms replacing every term that we start with.

4.3.1 Our method
Equations (4.2) and (4.4) can be used for different reduction strategies. Both depend upon the choice of common variables U. Besides choosing |U|, we can also decide the order to consider different choices of U; for example, which single variable to use to apply equation (4.2), or which pair of variables to use to apply equation (4.4).
We will focus on the simplest case: we let the common part U be a single variable xi, and reduce positive terms containing this variable via equation (4.2).
127

Negative terms will be reduced using the method of section 4.2.2. Note that more complicated schemes are also possible, such as picking pairs of variables and reducing both positive and negative terms containing this pair via equations (4.2) and (4.4).

Our method reduces a multilinear polynomial with higher-order terms, to quadratic form in two steps:

Step 1. Eliminate all higher-order positive terms by repeated application of Theorem 58, with the common subset U set to a single variable x1. Gather all terms containing x1, and replace them with equation (4.2). If H consists of all positive terms containing x1, then

αH

xj

=

min
y∈{0,1}

αH x1y

H∈H

j∈H

H∈H

(4.5a)

+ αH

xj

H∈H j∈H\{1}

(4.5b)

− αHy

xj

H∈H

j∈H\{1}

(4.5c)

The positive terms now form a hypergraph on one fewer variable, so repeat

with x2, . . . , xn until all positive terms are reduced.

Step 2. All higher-order terms now have negative coefﬁcients. Reduce them term-by-term using the methods in section 4.2.2.

Note that equation (4.5) is simply the special case of equation (4.2) for a single variable. This special case is illustrated in ﬁgure 4.2.

128

4.4 Worst case performance
The results of applying equation (4.5) consist of three parts: a positive quadratic term (4.5a); and for each term, a positive term on the original variables with x1 removed (4.5b); and a negative term with y replacing x1 (4.5c).
Note that in the course of the reduction, we may create a monomial on some variables x1, . . . , xd and another monomial on the same variables already existed in the input. In this case, we get lucky, since we can just sum the new term’s coefﬁcient with the existing monomial, which doesn’t increase the size of the representation. To analyze the worst-case performance, we will assume that this never happens. In section 4.5 we will revisit this possibility.
Under this assumption, each positive term of degree d that we start with will have a single variable removed every time we apply the reduction. To be fully reduced it must go through d − 1 applications of the rule, producing negative terms of degrees 2, . . . , d. Reducing these d − 1 negative terms by section 4.2.2 results in O(d) new variables and O(d2) submodular quadratic terms.
Overall, to reduce t positive terms of degree d on n variables, in the worst case our method requires n + O(td) new variables, O(td2) submodular terms and at most n non-submodular terms. Even in the worst case our algorithm’s asymptotic performance is similar to HOCR (see ﬁgure 4.1). However, our method produces at most n non-submodular terms, compared to O(nk) for HOCR.
For the weight of non-submodular edges, each positive term contributes αH to (4.5a) each time it is reduced, for a total of (d−1)αH weight in non-submodular edges. If the total weight of positive terms is W, we get O(dW) non-submodular
129

weight in the reduced form, a factor d improvement over HOCR.
4.5 Local completeness
We can improve on this worst-case analysis for some common vision problems such as [77, 102]. We have identiﬁed a property of certain energy functions that we call local completeness, where our algorithm (unlike HOCR or GRD) has improved asymptotic performance. The basic idea behind local completeness is that whenever a monomial x1 · · · xd occurs in the input, we are also likely to see all the monomials on all subsets of these d variables as well. In essence, local completeness argues that typical inputs to vision problems are “bad” in the sense of having lots of terms. If a certain problem is locally complete, then this is a lower-bound argument to show that the input is necessarily large, and hence methods (such as ours) which exploit the shared structure of the graph will be more successful.
To be precise, consider a multilinear polynomial on the binary variables x1, . . . , xn, and denote by H the hypergraph of its monomials, as before. Note that H is not necessarily in minimal form (i.e., if H ⊆ H then both H and H may both be hyperedges in H). Let H be the “completed” hypergraph, formed by all subsets of edges in H (that is, H = H∈H 2H). Deﬁnition 61. A polynomial is locally complete with completeness c (or has local completeness c) if |H| ≥ c|H | for some c ∈ (0, 1].
To explain the terminology, note that the larger hypergraph H is obtained by completing our input H, to include all the subsets of every term that we started
130

with.

Every polynomial is locally complete for some completeness c, as we can al-

ways

choose

c

=

|H |H

| |

.

However,

we are

interested in classes of problems

which

remain complete as the problem size grows, so we say that a family of polynomi-

als is locally complete if there is a ﬁxed c such that all the polynomials have local

completeness c. For example, a family P of polynomials arising from a particu-

lar vision problem would be locally complete if we always had 1/2 of all subsets

of terms appearing in all instances of P.

4.5.1 Performance on locally complete problems

Recall the procedure for reducing positive terms, using equation 4.5. We would like the extra positive term we create, with variables H \ {1}, to combine with some existing term. If it happens that H \ {1} is already a term with coefﬁcient βH\{1}, then we add αH to this coefﬁcient, and do not create a new term.

This motivates the deﬁnition of local completeness: the new positive terms in (4.5b) have variables which are subsets of our original terms, so if our energy function has local completeness c, the new positive terms will combine with existing terms a fraction c of the time.

Theorem 62. If an energy function has local completeness c, the procedure of Sec-

tion

4.3.1

for

reducing

positive

terms

will

result

in

at

most

1 c

|H

|

negative

coefﬁcient

terms

Proof.

By

the deﬁnition of local completeness

|H

|

≤

1 c

|H

|.

As a

notational con-

venience, add in all the extra subsets contained in H as monomials with co-

131

efﬁcient 0, so there are now |H | terms. Having done this, since H is closed under subsets, the positive terms produced by (4.5b) will always combine with existing terms.

Applying equation 4.5 removes the term αH j∈H x j, changes the coefﬁcient on the term with variables H \ {1}, and adds a new negative term αHy j∈H\{1} x j. The total number of terms remains constant.

Therefore, when we have ﬁnished reducing all positive terms, we are left

with only negative terms, and we have as many as we started with, namely

|H

|

≤

1 c

|H

|.

If we started with t terms of up to degree d on n variables, the entire re-

duction

results

in

at

most

n+

1 c

t

new

variables,

1 c

td

submodular

terms

and

n

non-submodular terms. For a family of locally complete inputs, c is constant,

giving the asymptotic results in ﬁgure 4.1.

Local completeness is stronger than strictly necessary — we only really need that when reducing terms containing x1, all the terms H \ {1} already exist. In this case, the analysis depends on the order in which we pick variables. Local completeness gives the stronger property that no matter what order we choose variables to reduce, we always get a large fraction of terms combining.

Finally, we reiterate that local completeness does not state that such problems are easy to solve, but rather the opposite: such problems (which include many vision problems, as shown below) have intrinsically large representations as multilinear polynomials; but nevertheless, in such cases our reduction uses few additional variables and edges.

132

4.6 Locally complete energy functions in vision

We can show that under some reasonable assumptions an important class of vision problems will have locally complete energy functions. Speciﬁcally, we consider fusion moves [66] under an FoE prior [77] with random proposals, as used as a benchmark for HOCR in [37, 40].

The original (non-binary) energy function can be written as a sum over cliques C in the image C fC(xC). A single fusion move has an input image I and a proposed image I , and for every pixel there is binary variable that encodes whether that pixel takes its intensity from I or I . This results in a binary energy function on these variables to compute the optimal fusion move.

We can better analyze fusion moves by moving to a continuous framework.

Embed the original intensities in R, and extend the clique energies fC to func-

tions on Rd. We need two assumptions: (1) fC is d − 1 times continuously differ-

entiable

and

(2)

each

of

the

d

different

mixed

partials

∂d−1 f ∂x1···∂xi···∂xd

(where

∂xi

means

to omit the i-th partial) take their zeros in a set of measure 0.

Theorem 63. Under these two assumptions, the set of proposed-current image pairs (I, I ) for which the fusion move binary energy function does not have local completeness 1 has measure 0 as a subset of Rn × Rn.

We defer the proof of this theorem to Appendix A, and provide a proof sketch. We write the fusion move binary energy function in terms of n binary variables bi. Writing this as a multilinear polynomial in b, each clique C can result in terms tS for each subset S of C. We can show that the energy function is locally complete, if the coefﬁcient on tS is almost never (i.e., with probability 0) zero.
133

For example, here is how to calculate the coefﬁcient on the term b1b2 in a clique of size 3. If I1, I2, I3 are the labellings in the current image on C, and I1, I2, I3 are the proposed labellings, then the coefﬁcient on b1b2 is

fC(I0, I1, I2) − fC(I0, I1, I2) − fC(I0, I1, I2) + fC(I0, I1, I2)

(4.6)

Since the labels are in R, the four 3-pixel images mentioned in this coefﬁcient lie on a rectangle in R3. If we give each of these points v heights of fC(v), then 4.6 is 0 if and only if the four points are coplanar.

In general, we do not expect 4 arbitrary points to lie on a plane. However, if

fC has no curvature, then any 4 such points will be coplanar. In the full proof,

we show that if there exists any open ball of image pairs with zero coefﬁcient

on b1b2,

then

the

energy

function

is

ﬂat

(

∂2 f ∂x1∂x2

=

0) in the

same ball (contradict-

ing our assumption that the partials are nonzero almost everywhere). We also

extend this to larger degree terms, to prove the general case.

Corollary 64. The energy functions obtained from fusion moves with FoE priors and proposals chosen as random images are locally complete with probability 1.

Proof. The functions fC for the FoE model given in [77] are inﬁnitely differentiable, and their mixed partials have their zeros in a set of measure 0. Since the proposed images are chosen from a continuous distribution over Rn, events of measure 0 occur with probability 0.

4.7 Experimental results
We have provided a freely available open source implementation of our algorithm at http://www.cs.cornell.edu/˜afix/software.html. We ex-
134

Figure 4.3: Denoising examples. At left is the noisy input image, with our result in the middle and Ishikawa’s at right. Results are shown after 30 iterations. More images are included in the supplemental material. To compare energy values with visual results, the images on the top row have energies 118,014, 26,103 and 38,304 respectively; those on the bottom have energies 118,391, 25,865 and 38,336.

120000 110000 100000
90000 80000 70000 60000 50000 40000 30000
0

Our method HOCR
GRD-heur

200

400 600 800 Time (seconds)

1000

1200

1 0.9 0.8 0.7 0.6 0.5 0.4
0

�� ����������
������������������ ����
���� ������������

����

����

Our method HOCR
GRD-heur

����

200

400

600

800

1000

1200

�� ����� ����� ���� ����� ����� ����� ����� ���� ����� ����� �����

Time (seconds)

��������������������������

Figure 4.4: Energy after each fusion move (left), and percentage of pixels labeled by QPBO (center), for the image at top of ﬁgure 4.3. Other images from [40] give very similar curves. (right) Fraction of pixels labeled by QPBO vs total weight of non-submodular edges (as a fraction of the total weight of all edges), along with best-ﬁt lines for each method.

Energy Labeled fraction ���������������

135

perimentally compared our reduction with the two available, general-purpose higher-order reductions: HOCR [37, 40] and GRD [43, 44]. These three methods are all direct competitors in the types of energy functions they can handle (with the limitation that GRD is restricted to d ≤ 4). All methods have publicly available code implemented in C++, and provide very similar interfaces for setting up and optimizing a higher-order MRF.
For all experiments, we only report the results from the heuristic version of GRD, GRD-heur. Because the exact version solves an LP, running this method on vision-sized inputs proved to be prohibitive. A single iteration of fusion move took an average of an hour, compared to 40 seconds for GRD-heur. Consequently, it was impossible to run this method on the full dataset. Fortunately, GRD-heur has been shown [43, 44] to have similar optimization quality to the exact GRD, at the gain of signiﬁcantly less computation time.
Our benchmark for evaluating the methods is the Fields of Experts prior for image denoising with fusion moves, using a dataset of 200 images. This same experiment was used in the original evaluation of HOCR [37], and later in the evaluation of GRD [43]. We use the same MRF, and as similar energy functions as possible. The fusion moves alternate between a randomly generated uniform image and a blurred image, and the energy function has clique size 4. We do multiple fusion moves on multiple images, so the effects of randomness are minimal.
To compare the effectiveness of each method on the individual binary subproblems, we ran all three algorithms on the ﬁrst 30 fusion moves for each image, using the same starting point and proposal for each method (we averaged over only 30 iterations, because later iterations reduce the energy by much
136

smaller amounts, and wash-out the differences in the methods). The results, averaged over the 200 × 30 = 6000 fusion moves are summarized in ﬁgure 4.6. Overall, we see that our method is strictly preferable to HOCR in all metrics, giving a better energy improvement per fusion move, labeling more pixels in QPBO, and taking less time. Compared to GRD, we see that GRD does label more pixels, and reduces the energy by an additional 35%; however, it takes over 7 times as long per iteration to compute.
To compare the effect of total-weight of non-submodular edges, we plotted the fraction of pixels labeled by QPBO vs. the fraction of non-submodular weight (divided by the total weight of all edges) in ﬁgure 4.4. There is a clear negative relation between these quantities — best ﬁt lines had slopes of -15.6 and -8.2 for our reduction and HOCR respectively. Correspondingly, our method had a lower average fraction of non-submodular weight, 19.7% vs 26.0%, and a higher fraction labeled by QPBO, 76.1% vs 59.4%.
We also tested the effect of choosing which order to reduce variables. The default for all experiments was to reduce the variables in order x1, . . . , xn. An alternative would be to choose the variables in decreasing order of number of positive terms. As predicted by our theoretical analysis, this difference had no measurable beneﬁt. The “smart ordering” had 1% more average energy reduction, 0.4% fewer variables labeled, and won against the standard ordering (in terms of energy reduction) in only 42% of the 6000 fusion-moves. We also tested a random ordering, which performed somewhat worse than the standard ordering (better in only 2% of fusion moves, and 2% worse energy reduction). Because of extra bookkeeping, the smart ordering took 51% longer on average — consequently, we recommend using the standard order over more complicated
137

HOCR GRD-heur Our method

Final energy 32,199 (+2.3%) 31,375 (−0.3%) 31,473

Time (seconds) 2,050 (+102%) 5,587 (+450%) 1,012

Figure 4.5: Comparison of end-to-end performance on benchmarks in [40] at convergence of the fusion move optimization, averaged over all images. Relative performance compared to our method is shown in parenthesis.

HOCR GRD-heur Our method

Energy improvement 1,302 (−45%) 3,183 (+35%) 2,351

Percent labeled by QPBO 59.4% (−22%) 86.3% (+13%) 76.1%

Time (seconds) 14.1 (+150%) 40.2 (+620%) 5.6

Figure 4.6: Performance comparison of reductions, on benchmarks in [40], averaged over 30 iterations of fusion move. Relative performance compared to our method in parenthesis.

schemes.

As a second experiment, we compared the end-to-end performance of us-
ing each optimizer all the way through a complete run of the fusion-move al-
gorithm. These results are displayed in ﬁgures 4.4 and 4.5.2 Our method con-
verges much faster, despite GRD reducing the energy by slightly more each step.
Overall, the computational inefﬁciency of GRD greatly outweighs the marginal
2We averaged together pairs of consecutive fusion moves in the graph shown at right in ﬁgure 4.4. This avoids the distracting sawtooth pattern visible in [37, 40], due to the alternation between random fusion moves and blurred fusion moves.

Extra variables Non-submodular terms Total terms

HOCR 224,346

421,897

1,133,811

Our method 236,806 (+6%) 38,343 (−90%)

677,183 (−40%)

Figure 4.7: Total size of reductions, on Ishikawa’s benchmarks in [40]. Relative performance of our method in parenthesis.

138

improvement in per-subproblem solution quality for fusion move. The sizes of the obtained reductions for our method and the other term-
rewriting reduction, HOCR, are summarized in ﬁgure 4.7. Overall, our method does better in practice than the asymptotic analysis in ﬁgure 4.1 suggests. As predicted, we produce many fewer non-submodular terms, but we also produce fewer submodular terms (a relative improvement of 10%).
Visual results are shown in ﬁgure 4.3. In the boat image our results appear more accurate in smooth areas like the water, and the face image (shown magniﬁed at bottom left) is also noticably smoother. These results are after 30 fusion moves. The images after convergence (shown, along with more examples, in the supplemental material) are visually similar, though we still obtain lower energy.
Finally, we experimentally computed the local completeness of two early vision problems that are quite far from denoising, namely stereo [102] (clique size 3) and segmentation [1] (clique size 4). We analyzed the binary energy functions produced from 60 iterations of [102]. These energy functions have a very high local completeness; on average the energy functions are c-complete for c = .98, and their least locally complete energy function had c = .96. We also discovered that the higher-order segmentation energy function of [1] is absolutely locally complete (c = 1). These results suggest that our method may be particularly well suited to a number of important vision problems.
139

CHAPTER 5 SUM OF SUBMODULAR MINIMIZATION

The reduction methods of the previous chapter provide an algebraic method for transforming higher-order binary MRFs to ﬁrst-order. However, while this transformation preserves some features of the original energy function (e.g., the global minimum value remains the same) the resulting ﬁrst-order functions are not always easily solvable. In particular, we have seen that non-submodular terms in the resulting reduced energy can lead to poor solutions from optimizers like QPBO.

An alternate approach to minimizing binary MRFs is to apply ﬂow algorithms directly to the higher-order energy. Our goal is to preserve two key properties of max-ﬂow based solvers: (1) global optimality of the solution obtained and (2) fast performance on typical inputs for vision problems. To do this, we will need a generalization of max-ﬂow to the higher-order case — this generalization is called Sum-of-Submodular ﬂow.

Sum-of-Submodular ﬂow, and the corresponding cut problem, Sum-ofSubmodular minimization, occupy a middle ground between the max-ﬂow min-cut problem, and general submodular function minimization. A set function f : 2V → R is Sum-of-Submodular (SoS) if it is a sum

f (S ) = fi + fi + fC(S ∩ C)

i∈S i S

C

where each clique function fC is submodular.

(5.1)

Recall that a sum of submodular functions is itself submodular, so this is a special case of general submodular function minimization. However, we will

140

show that the structure of f (in particular, that the cliques C form a hypergraph) allows much faster minimization than the O(n6) algorithm of [73]. Additionally,

we also have that standard min-cut is a special case of SoS minimization, since for each directed arc (i, j) with capacity ci, j, the cost of that edge being cut can be written as a submodular clique function



fi, j(S )

=

 

ci, j 0

i ∈ S, j S otherwise

(5.2)

So, SoS minimization lies in between min-cut and general submodular minimization:

MIN-CUT ⊆ SOS MINIMIZATION ⊆ SUBMODULAR MINIMIZATION (5.3)

In this chapter, we describe how a Sum-of-Submodular (SoS) function can be minimized by means of an SoS ﬂow network, and give a fast algorithm for solving this minimization, as originally presented in [19].
Throughout this chapter, we will assume that f is a set function, which is a sum of unary and clique functions of the form (5.1). We will also assume that f has been reparameterized such that fC ≥ 0 and fC(∅) = fC(C) = 0, and the linear terms have fi, fi ≥ 0.
5.1 Sum of Submodular Minimization via Submodular Flow

For SoS functions, the cut problem is easy to describe: minimize f (S ) over all sets S ⊆ V. In this section, we detail the dual problem: Sum-of-Submodular ﬂow.

141

5.1.1 Deﬁnitions and Graph Construction
Submodular ﬂow has existed in the combinatorial optimization literature for some time [15, 26]. However, these algorithms are designed for full-order (or nth order) submodular functions, meaning there is no internal clique structure on the function f . The work of [53] was the ﬁrst to develop an algorithm for the clique structured case (SoS ﬂow), and the problem formulation and mathematical notation of this section are based on that work.
SoS ﬂow is similar to the max-ﬂow problem, in that there is a network of nodes and arcs on which we want to push ﬂow from s to t. However, the notion of residual capacity will be slightly modiﬁed from that of standard max-ﬂow.
We begin with a network G = (V ∪ {s, t}, A). We will denote V + s + t by V. As in the max-ﬂow reduction for Graph Cuts, there are source and sink arcs (s, i) and (i, t) for every i ∈ V. Additionally, for each clique C, there is an arc (i, j)C for every pair {i, j} ∈ C.1
Every arc a ∈ A also has an associated residual capacity ca. The residual capacity of arcs (s, i) and (i, t) are the familiar residual capacities from max-ﬂow: these arcs have starting capacities cs,i and ci,t (determined by the unary terms of f ), and whenever we push ﬂow on a source or sink arc, we decrease the residual capacity by the same amount.
For the interior arcs, we need one further piece of information. In addition to residual capacities, we also keep track of residual clique functions f C(S ), related to the ﬂow values by the following rule: whenever we push δ units of ﬂow on
1To explain the notation, note that {i, j} might be in multiple cliques C, so we may have multiple edges (i, j) (that is, G is a multigraph). We distinguish between them by the subscript C.
142

arc (i, j)C, we update f C(S ) by



f

C (S

)

←

 

f C(S ) − δ f C(S ) + δ f C(S )

i ∈ S, j S i S, j ∈ S otherwise

(5.4)

A ﬂow φ is a function φ : A → R≥0 which satisﬁes the usual conservation constraints (meaning ﬂow into a node i is equal to the ﬂow out of i for all i s, t). The residual clique functions f C result from applying (5.4) each time an arc has δ units of ﬂow, so we have that

f C(S ) = fC(S ) −

φi, j,C +

φi, j,C

i∈S , j S

j∈S ,i S

(5.5)

That is, the residual clique function f C(S ) is the original clique function fC(S ),

minus the outﬂow from S along arcs in C, plus the ﬂow into S along arcs in C.

The residual capacities of the interior arcs are chosen so that the f C are always nonnegative. Accordingly, we deﬁne ci, j,C = minS { f C(S ) | i ∈ S , j S }. A ﬂow is feasible if all residual capacities are nonnegative.

5.1.2 Flow as a Reparameterization
The key to understanding Sum-of-Submodular ﬂow is that any ﬂow gives a reparameterization of the original cost function. That is, ﬂows in the graph deﬁned above are just different ways of re-writing the function f . Finding the maximum ﬂow will end up with a reparameterized f which is particularly easy to optimize, leading to an algorithm for ﬁnding the global minimum of f .
Interestingly, this same idea is used in several algorithms for energy minimization, including QPBO (see [51] for a very accessible review of this idea) and
143

the dual-LP methods for optimizing multilabel MRFs described in Section 3.4. We will also see this idea in the development of the SoSPD algorithm of Chapter 7.

To deﬁne this reparameterization, we ﬁrst deﬁne the residual cost function

f (S ) to be

f (S ) = ci,t + cs,i + f C(S ∩ C)

i∈S i S

C

(5.6)

and also deﬁne the value of the ﬂow φ to be the total outﬂow of the source:

ν(φ) = φs,i
i∈V

(5.7)

Lemma 65. Any ﬂow φ gives a reparameterization of the original cost function, with

f (S ) = ν(φ) + f (S )

(5.8)

for all feasible ﬂows φ and all sets S ⊆ V.

Recall that we deﬁned a ﬂow φ to be feasible whenever f C ≥ 0 and f i, f i ≥ 0 for all C and i. In particular, if φ is feasible then f (S ) ≥ 0 for all S , which immediately gives the following lower bound on the minimum of f .
Corollary 66. If φ is feasible, then

f (S ) ≥ ν(φ)

(5.9)

for all S ⊆ V.

Proof of Lemma 65. As with most proofs of one function being a reparametization of another, we have that various quantities have been added and subtracted to the energy in a way that cancels out. In this case, we expand out the residual

144

capacities in (5.6):



f (S ) = ( fi − φi,t) + ( fi − φs,i) +
i∈S i S

C

 fC(S

∩ C) −
i∈S , j∈C\S

φi, j,C

+

j∈S ,i∈C\S

φi, j,C

= fi + fi + fC(S ∩ C)

i∈S i S

C

 



− i∈S φi,t +

C

j∈C\S φi, j,C + i S −φs,i +

C

φi, j,C
i∈S ∩C

= f (S ) − φ(S , V \ S ) − φ({s}, V \ S ) + φ(V \ S , S )

(5.10)

where φ(A, B) is the total ﬂow from a subset A ⊆ V to B ⊆ V. We will use the fact

that φ(A ∪ B, C) = φ(A, C) + φ(B, C) when A, B and C are disjoint.

Because ﬂow is conserved at all nodes i s, t we have that the ﬂow into S equals the ﬂow out of S , so φ(S , V \ S ) = φ(V \ S , S ), and in particular, the last line above is

f (S ) = f (S ) − φ(V \ S , S ) − φ({s}, V \ S ) + φ(V \ S , S ) = f (S ) − φ({s}, S ) + φ(V \ S , S ) − φ({s}, V \ S ) + φ(V \ S , S )

(5.11)

= f (S ) − φ({s}, V) = f (S ) − ν(φ)

5.1.3 The Max-Flow Min-Cut Theorem for SoS Functions
The key theorem2 relating SoS ﬂow and SoS function minimization is a direct analogue of the max-ﬂow min-cut theorem. Recall that in standard max-ﬂow on a graph, a ﬂow is a maximum ﬂow if and only if there are no augmenting paths from s to t. Once we have found this maximum ﬂow, the minimum cut
2All proofs, theorems and lemmas in this section are adapted from [53]
145

is obtained by taking S ∗ to be the set of nodes reachable from s along arcs of positive residual capacity (which, by deﬁnition, can’t include t, otherwise there would be an augmenting path from s to t). In SoS ﬂow, we get a directly analogous theorem.
Given a feasible SoS ﬂow φ, deﬁne the set of residual arcs Aφ to be all arcs a with ca > 0. An augmenting path is an s − t path along arcs in Aφ. We will say that a feasible ﬂow φ∗ is maximal if there are no augmenting paths in Aφ∗.3
Theorem 67. Let φ∗ be a maximal ﬂow. Let S ∗ be the set of all i ∈ V reachable from s along arcs in Aφ∗. Then f (S ∗) is the minimum value of f over all S ⊆ V, and f (S ∗) = ν(φ∗).

The simplest proof of this theorem uses the reparameterization result above (Lemma 65). We have mentioned that at a maximal ﬂow φ∗, the reparameterization f is particularly easy to minimize. In fact, its minimum is the set S ∗ deﬁned above, which can be found by computing a depth-ﬁrst search from s in the residual graph Aφ∗.

Lemma 68. Let φ∗ be a maximal ﬂow, and let S ∗ be the set of i reachable from s along

arcs in Aφ∗. Then

f (S ∗) = 0

(5.12)

Then, the theorem follows immediately from this Lemma, Lemma 65 and
Corollary 66, since f (S ) ≥ ν(φ∗) for all sets S ⊆ V and f (S ∗) = f (S ∗)+ν(φ∗) = ν(φ∗).
3As in standard max-ﬂow, any maximal ﬂow is in fact a maximum ﬂow (i.e., all maximal ﬂows have the same value ν(φ∗)). Note that this follows immediately from Theorem 67, since all maximal ﬂows have value f (S ∗), but to avoid being circular, we’ll state the theorem in terms of maximal ﬂows.

146

Proof of Lemma 68. First, note that we have that f i = 0 for i ∈ S , otherwise we

would have that t is reachable from some i ∈ S , which would give an augment-

ing path from s to t passing through i. Similarly, we have f i = 0 for i otherwise i would be reachable from s along arcs in Aφ∗.

S,

Now, we just need to show that f C(S ∗ ∩ C) = 0 for all cliques C. Fix a C, and let T = S ∗ ∩ C. If T = ∅ or T = C then we’re done, since f C(∅) = f C(C) = 0. So, we can assume that T and C \ T are both nonempty.

Pick any i ∈ T and j ∈ C \ T . We know that j is not reachable from i along arcs of positive residual capacity, so in particular we must have ci, j,C = 0. Since ci, j,C = minS ⊆C:i∈S, j S f C(S ) there is some Ti, j with f C(Ti, j) = 0 and i ∈ Ti, j, j Ti, j.
Now, let Ti = j∈C\T Ti, j. We have that i ∈ Ti and each j ∈ C \ T is not in Ti (since j Ti, j ⊇ Ti). Let T = i∈T Ti. We have that T = T , since each i ∈ T is in Ti, hence also in T , and each j ∈ C \ T is not in any of the Ti, hence not in T .
Therefore, we can write T = i∈T j∈C\T Ti, j, and f C(Ti, j) = 0 for all i ∈ T, j ∈ C \ T . So, we must have f C(T ) = 0 since the zero sets of f C form a lattice (i.e., they are closed under intersection and union), by Corollary 38.

The key idea of this proof, and the reason that this algorithm only works for submodular functions, is that the zero sets of nonnegative submodular functions form a particular structure called a lattice, meaning they are closed under intersections and unions. In fact, the ﬂow values we’re adding and subtracting

147

are linear functions in S :



i∈S , j

S

φi, j,C

−

i

S , j∈S

φi, j,C

=

i∈S , j

S

φi, j,C

−

i∈S, j∈S

φi, j,C

−

i∈S , j∈S

φi, j,C

−
i

S , j∈S

φi, j,C

= φi, j,C − φi, j,C

i∈S , j∈C

i∈C, j∈S

= (φi, j,C − φ j,i,C)
i∈S j∈C

= ψi
i∈S

(5.13)

where ψi := j∈C(φi, j,C − φ j,i,C) is the net outﬂow of node i along arcs in C. Then, we have that f C(S ) = fC(S ) + ψ(S ), and in particular we have that ψ ≤ fC and ψ(C) = fC(C) so that ψ is actually a base of fC (see Section 2.3.3).

So, the ﬂow values we’re adding and subtracting end up being a search over the base polytope of fC. An arc (i, j)C becomes saturated exactly when there’s a set Ti, j with i ∈ Ti, j, j Ti, j which is tight. Then, since the tight sets of f C form a lattice, we can take intersections and unions to get a single, consistent S ∗ which includes all the nodes reachable from s, and which is itself a tight set.

5.2 IBFS for Submodular Flow
The max-ﬂow min-cut theorem gives a simple algorithm for ﬁnding the minimizer of an SoS function — keep track of the current ﬂow and residual graph Aφ, and each iteration ﬁnd an augmenting path from s to t until no more exist. It is easy to show that with integer cost functions this algorithm must terminate in ﬁnitely many steps. This augmenting path algorithm was used in Generic Cuts [3], which was the ﬁrst application of SoS optimization to computer vi-
148

sion, and the ﬁrst implementation of SoS ﬂow.
For ﬂow on graphs, the current state of the art for computer vision applications is Incremental Breadth First Search (IBFS) [30]. This algorithm is based on the Boykov-Kolmogorov algorithm of [10], with an additional guarantee of polynomial time complexity. In this section, we show how to modify IBFS to compute maximum SoS ﬂows, giving a fast algorithm for sum-of-submodular optimization for typical computer vision inputs.
5.2.1 IBFS on Graphs
IBFS is an augmenting paths algorithm: at each step, it ﬁnds a path from s to t with positive residual capacity, and pushes ﬂow along it. Additionally, each augmenting path found is a shortest s-t path in Aφ. To ensure that the paths found are shortest paths, we keep track of distances ds(i) and dt(i) from s to i and from i to t, and search trees S and T containing all nodes of distance at most Ds from s or Dt from t respectively. Two invariants are maintained:
• For every i in S , the unique path from s to i in S is a shortest s-i path in Aφ. • For every i in T , the unique path from i to t in T is a shortest i-t path in Aφ.
The algorithm proceeds by alternating between forward passes and reverse passes. In a forward pass, we attempt to grow the source tree S by one layer (a reverse pass attempts to grow T , and is symmetric). To grow S , we scan through the vertices i at distance Ds away from s, and examine each out-arc (i, j) that has positive residual capacity. If j is not in S or T , then we add j to S at distance level Ds + 1, and with parent i. If j ∈ S then (i, j) is a back-arc, and is not on the
149

shortest path from s to j. If j is in T , then we found an augmenting path from s to t via the arc (i, j), so we can push ﬂow on it.
The operation of pushing ﬂow may saturate some arcs (and cause previously saturated arcs to become unsaturated). If the parent arc of a node i becomes saturated, then i becomes an orphan. After each augmentation, we perform an adoption step, where each orphan ﬁnds a new parent. The details of the adoption step are similar to the relabel operation of the Push-Relabel algorithm [14], in that we search all potential parent arcs in Aφ for the neighbor with the lowest distance label, and make that node our new parent. If this increases the distance ds(i), then the children of i also become orphans and are recursively adopted as well, to maintain the shortest-path invariant.
5.2.2 Modifying IBFS for SoS Flow
In order to apply IBFS to the SoS ﬂow problem (instead of standard graph-ﬂow), all the basic datastructures still make sense: we have a graph where the arcs a have residual capacities ca, and a maximum ﬂow has been found if and only if there is no longer any augmenting path from s to t.
The main change for the SoS ﬂow problem is that when we increase ﬂow on an edge (i, j)C, instead of just affecting the residual capacity of that arc and the reverse arc, we may also change the residual capacities of other arcs (i , j )C for i , j ∈ C. A problematic case would be where (i , j )C is saturated because a set S has f C(S ) = 0, with i ∈ S , j S . If we push δ units of ﬂow from i to j going into S (meaning j ∈ S and i S ) then f C(S ) will now be δ > 0, so (i , j ) is no longer saturated. If d( j ) > d(i ) + 1 before the push, then we would have created
150

a shortcut between i and j , which would violate our invariants on the trees S and T .
However, the following result ensures that this is not a problem. Let Aφ be the set of arcs with residual capacity, according to the current ﬂow. Lemma 69. If (a, b)C was previously saturated, but now has residual capacity as a result of increasing ﬂow along (c, d), then (1) either a = d or there was an arc (a, d) ∈ Aφ and (2) either b = c or there was an arc (c, b) ∈ Aφ. Corollary 70. Increasing ﬂow on an edge never creates a shortcut between s and i, or from i to t.
These results are based on [26], we will prove them in the next section.
Corollary 70 ensures that we never create any new shorter s-i or i-t paths not contained in S or T . A push operation may cause some edges to become saturated, but this is the same problem as in the normal max-ﬂow case, and any orpans so created will be ﬁxed in the adoption step. Therefore, all invariants of the IBFS algorithm are maintained, even in the submodular ﬂow case.
The ﬁnal difference between IBFS and a standard augmenting paths algorithm is the “current arc heuristic”, which is a mechanism for avoiding iterating through all possible potential parents when performing an adoption step. In the case of Submodular Flows, it is also the case that whenever we create new residual arcs we maintain all invariants related to this current arc heuristic, so the same speedup applies here. We cover this heuristic in Section 5.4.
151

5.2.3 Running Time
The asymptotic complexity of the standard IBFS algorithm is O(n2m). In the submodular-ﬂow case, we still perform the same number of basic operations. However, note ﬁnding residual capacity of an arc (i, j)C requires minimizing f C(S ) for S separating i and j. If |C| = k, this can be done in time O(k6) using [73]. However, for k << n, it will likely be much more efﬁcient to use the O(2k) naive algorithm of searching through all values of f C. Overall, we add O(2k) work at each basic step of IBFS, so if we have m cliques the total runtime is O(n2m2k).
This runtime is better than the augmenting paths algorithm of [3] which takes time O(nm22k). Additionally, IBFS has been shown to be very fast on typical vision inputs, independent of its asymptotic complexity [30].
5.3 Proof of the “No Shortcuts” Lemma
Lemma 69. If (a, b)C was previously saturated, but now has residual capacity as a result of increasing ﬂow along (c, d), then (1) either a = d or there was an arc (a, d) ∈ Aφ and (2) either b = c or there was an arc (c, b) ∈ Aφ.
Proof. The ﬂow before the push on (c, d) is denoted by φ, and the set of all arcs with residual capacity for ﬂow φ is Aφ. Recall that when we increase ﬂow on
152

(c, d), we change the residual clique functions f C by



f

C (S

)

=

 

f C(S ) − δ f C(S ) + δ f C(S )

c ∈ S,d S c S,d ∈ S otherwise

(5.14)

Additionally, we deﬁned the residual capacity of the arc (i, j)C to be

ci, j,C

=

min{ S

f

C

(S

)

|

i

∈

S,

j

S }.

(5.15)

We’ll say that a set S ⊆ C is saturated if f C(S ) = 0, and that S separates i from j if i ∈ S , j S . We’re interested in which saturated sets separate i from j, so denote the saturated sets by Si, j deﬁned as

Si, j = {S ⊆ C | i ∈ S , j S , f C(S ) = 0}.

(5.16)

By (5.15), (i, j)C is saturated if and only if Si, j ∅. In particular, since (a, b)C is initially saturated, we know that there’s some S a,b ∈ Sa,b.
Now, to prove (1): assume a d (otherwise, we’re done). We want to show that the edge (a, d) had nonzero residual capacity in the ﬂow φ, so by the above, we need to show that Sa,d = ∅. Assume by way of contradiction that there exists S a,d ∈ Sa,d.
Let S a = S a,b∩S a,d. We know that a ∈ S a since it’s in both S a,b and S a,d. We also know that both b and d are not in S a, since b S a,b and d S a,d. Finally, S a is the intersection of two saturated sets, so it must also be saturated, by Corollary 38 (which says that the zero-valued sets of f form a lattice).
Therefore, S a is a set containing a but not b or d, and with f C(S a) = 0. When we change ﬂow on (c, d), according to (5.14), we only increase the capacity of
153

sets containing d. So after changing φ, we still have f C(S a) = 0. But then, S a separates a from b, so (a, b)C continues to be saturated, a contradiction.
To prove (2), we use a similar argument. We have b c or else we’re done. Assume by way of contradiction that there exists S c,b ∈ Sc,b. Let S c = S c,b ∪ S a,b. We know that a, c ∈ S c and b S c. Furthermore, since S c is the union of two zero-valued sets, it also has f C(S c) = 0.
But then, since c ∈ S c, we know that the capacity of S c doesn’t increase, and therefore after changing φ it continues to be a saturated set. But since a ∈ S c, b S c this means that (a, b)C continues to be saturated, a contradiction.

Corollary 70. Increasing ﬂow on an edge never creates a shortcut between s and i, or from i to t.

Proof. Assume we just increased ﬂow on arc (c, d)C, causing (a, b)C to have positive residual capacity. We’ll consider the case where c ∈ S , the case where d ∈ T follows by symmetry. Because we increased ﬂow on (c, d)C, it is along the shortest path from s to d, so ds(d) = ds(c) + 1.

We know that (c, d)C is either a tree arc, or an arc from S to T and hence ds(d) = ds(c) + 1. Since either d = a or (a, d) ∈ Aφ, we have ds(d) ≤ ds(a) + 1. Similarly, since either c = b or (c, b) ∈ Aφ we have ds(b) ≤ ds(c) + 1. Putting these inequalities together, we get that

ds(b) ≤ ds(c) + 1 = ds(d) ≤ ds(a) + 1

(5.17)

Therefore, when we cause (a, b) to become unsaturated, we don’t create a 154

new shortest path from s to b. The argument that we don’t create a new shortest path from a to t is symmetric.
5.4 The Current Arc Heuristic
Finally, we will describe the “current arc heuristic” of IBFS, and show how its invariants still hold in the submodular ﬂow case. We will discuss the currentarc mechanism for the source tree S , the case regarding T is symmetric. Some useful terminology: an arc (u, v) is admissible if ds(v) = ds(u) + 1 and (u, v) ∈ Aφ. Only admissible arcs can be tree arcs.
For every v ∈ S , we maintain a list of potential parent arcs (u, v) in an unspeciﬁed order, denoted (u, v) ≺ (u , v). The parent node is the parent of v in the tree S , denoted p(v). At any point in the algorithm, the current arc is the arc from this list which is currently the parent arc of v (i.e., the current arc is always (p(v), v)). We maintain the invariant that all arcs before the current arc (according to the order ≺) are not admissible. Therefore, when searching for a new parent for v in an adoption step, we don’t have to scan the entire list of potential arcs, but can instead increment the current arc until we ﬁnd an admissible arc. If we get to the end of the list, we know that all arcs into v are currently inadmissible, so we must increase the label of v. This scheme allows us to “charge” the scanning of edges to the relabels of v, and there are only O(n) relabels per vertex.
In order to make our scheme work for the submodular ﬂow case, we specify a particular order of in-arcs for each v. Put a linear ordering < on V, and order the arcs so that if u < w then (u, v) ≺ (w, v). We use the same linear ordering for doing our breadth-ﬁrst search, so we scan through the nodes at distance Ds in
155

sorted order, according to <.
We need to maintain the invariant that arcs before the current arc (p(v), v) are not admissible. Equivalently, if w < p(v) then (w, v) is not an admissible arc.
When a vertex v ﬁrst enters S , IBFS sets the current arc to the ﬁrst arc in the list (because we’re scanning through nodes at distance Ds in <-sorted order), so the invariant is initially true. So, assume that (u, v) is the current arc of v at a particular stage in the algorithm, and consider all the ways we could violate the invariant. The ﬁrst two cases are already covered by the proof of correctness of the standard IBFS algorithm, but we repeat them brieﬂy for completeness.
• We could change the current arc (i.e., change the parent of v). But we only do this during an adoption step, when we know that (u, v) is inadmissible. All arcs before (u, v) are also inadmissible, and we scan forward through the list till we ﬁnd an admissible (w, v) with u < w (or reach the end), which maintains the invariant.
• We could relabel a vertex w with w < u and (w, v) ∈ Aφ. But relabels only ever increase the label of w. If d(w) < d(v) then d(w) = d(v)−1 (if d(w) < d(v)− 1 then (w, v) would already be on the shortest path to v). So, increasing the label of w makes (w, v) inadmissible.
• We could change the ﬂow φ such that an arc (w, v) which was previously saturated becomes unsaturated, and d(v) = d(w) + 1. This is only a problem if w < u, as the invariant has nothing to say about creating admissible arcs after the current arc. This is illustrated in Figure 5.1 (Left). By Lemma 2.3, the only way we could cause (w, v) to become unsaturated is by pushing on some arc (x, y) such that (1) w = y or
156

v

vy

vy

Increasing distance Increasing distance Increasing distance

w Increasing node-order

u

wx Increasing node-order

u

ux Increasing node-order

w

Figure 5.1: Illustrations of the ﬂow network regarding the current arc heuristic. Saturated edges are dotted while unsaturated edges are solid, parent arcs are colored red. The linear order in nodes increases from left to right. (Left) A potential failure of the current arc heuristic. The node w is less than u and pushing on some other arc causes (w, v) to become unsaturated. (Center) A potential arrangement of the nodes, where (x, y) is the node whose ﬂow is increasing. Note that this conﬁguration is impossible because the parent arc of y is (x, y) which comes after (w, y). Similarly the parent arc of v is (u, v) which comes after (x, v). (Right) the only possible conﬁguration of these nodes. In this case, we don’t create a violation of the current arc heuristic, since the new edge created has u < w.

(w, y) ∈ Aφ and (2) x = v or (x, v) ∈ Aφ. Recall, from our proof of Corollary 70 that this implies d(v) ≤ d(x) + 1 and d(y) ≤ d(w) + 1.
Since we’re pushing on (x, y) we know that (x, y) is a tree-arc, so x = p(y) and ds(y) = ds(x) + 1. If w = y then d(w) = d(y) = d(x) + 1 ≥ d(v) so we don’t create an admissible arc. Similarly, if x = v then d(v) = d(x) = d(y) − 1 ≤ d(w) so we again don’t create an admissible arc.
Since neither w = y or x = v, we’re in the case of Figure 5.1 (Center), where there are arcs (x, y), (w, y) and (x, v), all of which are in Aφ. If we’ve created a new admissible arc (w, v), then d(v) = d(w) + 1, and hence d(v) = d(w) + 1 ≥ d(y) = d(x) + 1 ≥ d(v) so d(v) = d(y) = d(w) + 1 = d(x) + 1. Therefore, each of the three arcs (x, y), (w, y) and (x, v) are also admissible.
Since x = p(y) we know (since the invariant holds before we change φ) that

157

x ≤ w, as (w, y) is admissible. Similarly, since (u, v) is the current arc of v, we must have that u ≤ x since (x, v) is an admissible arc. Therefore, u ≤ x ≤ w, and hence whenever we create a new admissible arc, we create one after the current arc of v, so the invariant is maintained. Since the current arc invariants are maintained, all the arguments regarding the runtime of IBFS hold in the submodular ﬂow case, and we get an overall number of O(n2m) basic steps, for a total of O(n2m2k) total time.
158

CHAPTER 6 SUBMODULAR UPPER BOUNDS FOR HIGHER ORDER ENERGY
FUNCTIONS
6.1 Introduction
Now that we have a tool, Sum-of-Submodular ﬂow, for optimizing binary submodular MRFs, we want to apply in a graph-cuts algorithm for more general higher-order MRFs. The key challenge is that most higher-order vision MRFs are not submodular, including the denoising and stereo applications that popularized higher-order MRFs [77, 102]. Our approach for optimizing these nonsubmodular functions is to instead ﬁnd a submodular function which is close to the original function, and then exactly optimize that submodular proxy. QPBO [8, 51] takes this approach, as does its generalization GRD [43].
Given an arbitrary binary function f , the natural choice of submodular proxy g is an upper bound on f . Since g ≥ f , when we make g small we also make f small. We will call g a submodular upper bound. Experimentally, minimizing submodular upper bounds has been successful at optimizing both pairwise binary functions with the local search algorithm LSA-AUX [31], as well higher-order multilabel problems with Auxiliary Cuts [6] and SoSPD [22].
The main contribution of this chapter is to derive a principled submodular upper bound for higher-order MRFs. We show that we can minimize the distance between the submodular upper bound and the original function, and we give fast algorithms for ﬁnding approximately optimal upper bounds in practice.
159

6.2 Background and Related Work

For this chapter, the most useful deﬁnition of submodularity is the following equivalent condition from Theorem 32: f is submodular if

f (S ) + f (S + i + j) ≤ f (S + i) + f (S + j)

(6.1)

for all S ⊆ V and i, j S .

For higher-order functions, any submodular function can be minimized in O(n6) time [73]; however, this is impractical for vision-sized inputs. A more practical alternative is sum-of-submodular optimization, which ﬁts between graph cuts and submodular optimization in complexity. The Submodular IBFS algorithm (Chapter 5) in particular is well-suited to vision inputs, as it generalizes the popular Boykov-Kolmogorov1 algorithm [10] to higher-order inputs, and for ﬁxed-sized cliques runs in worst-case time O(n2m) for m cliques and n variables (the same complexity as IBFS for pairwise inputs). However, it scales as O(2|C|) as the clique size grows, so we will henceforth only consider higherorder inputs with a small constant for the size of the cliques |C|. Currently, all known methods for optimizing general higher-order functions have this exponential (or at least O(|C|6)) dependence on the size of the cliques. Faster methods exist given for certain special cases of energy functions (such as robust-Pn [49], and other structured cost functions), but cannot handle arbitrary energies, as the current method does.

For non-binary problems, graph cuts methods typically reduce the solu-
tion to a series of binary sub-problems, using move-making algorithms such
1The Boykov-Kolmogorov ﬂow algorithm and IBFS [30] are very similar — the latter makes a small tweak to a subroutine which allows proving an O(n2m) runtime. Despite this, the implementation of BK ﬂow remains popular within the vision community.

160

as alpha-expansion [11] and fusion moves [66], as well as more sophisticated primal dual algorithms such as FastPD [58], and its higher-order generalization SoSPD [22].
The major restriction in the above is that the binary problems are required to be submodular. If the functions are not submodular, then the problem is NP-hard [52]. Yet many optimization problems encountered in practice are not submodular, so we need a submodular function whose optimum is close to the original.
Submodular upper bounds have been used before in several different methods to approximate arbitrary functions. The FastPD algorithm [58] uses a very simple upper bound — if, during a particular expansion move, the cut-cost of an edge would be negative, the variant PD3a simply truncates that cost to 0, ensuring the expansion move problem is submodular (this approach originated in [80]). A more sophisticated upper bound was employed in the LSAAUX method [31] which found a series of upper bounds gt, each of which has gt(xt) = f (xt) at the current point xt. Each move solves a pairwise submodular minimization problem xt+1 = arg min gt(x) with graph cuts, giving a fast localsearch method. However, for pairwise functions, submodular upper bounds are particularly simple: either the edge term fi, j is already submodular (in which case no upper bound is necessary) or it is supermodular, in which case the best upper bound is a linear function.
For higher-order functions, Auxiliary Cuts [6] uses convexity and other properties of image functionals to compute upper bounds which are pairwise submodular functions. These upper bounds are iteratively minimized and updated similarly to LSA-AUX. The Pseudo-Bound method [92] extends this idea
161

by considering a parameterized family of functions which include a pairwise submodular upper bound, and ﬁnding all minimizers of the entire family using parametric max-ﬂow — by looking at all minimizers, a greater decrease in energy per iteration is obtained. Note that for all these methods, the upper bounds are all pairwise functions, even if the functions they approximate are higher-order.
Our approach is based on a linear program for minimizing the distance between the submodular upper bound and the original function. We give new approximations to this LP that have nice theoretical properties, and which perform well in practice. For multilabel problems, these upper bounds allow approximate optimization of expansion and fusion moves, even when the movemaking binary-subproblem is non-submodular. We will show that by employing in our upper bounds in a Fusion-moves framework, better energy optimization is obtained.

6.2.1 Notation

Some notation we will use throughout the rest of the chapter: Recall that for binary problems, we are identifying set functions f (S ) with the vector notation f (x), x ∈ {0, 1}n. For a set S , and element i, we will write S + i for S ∪ {i}. For a set of coefﬁcients ai for i ∈ V, we let a(S ) = i∈S ai. Note that a is a linear function. For a set of coefﬁcients bi, j on edges {i, j}, we will write

b(S ) =

bi, j

i< j:i, j∈S

(6.2)

Note that b is a quadratic function. Whenever we use a pair of indices i, j, this

is always an unordered pair, so bi, j = b j,i and δS,i, j = δS, j,i. For two set functions

162

f, g : 2V → R we will deﬁne the 1-norm and ∞-norm distance between them by treating the functions as length 2|V| vectors, i.e., g − f 1 = S |g(S ) − f (S )| and g − f ∞ = maxS |g(S ) − f (S )|.

6.3 Submodular Upper Bounds
Recall that a function is submodular if it satisﬁes (6.1) for all S ⊆ V and i, j S . Given this deﬁnition, we can write an optimization problem minimizing the p-norm distance between an arbitrary function f over all submodular upper bounds g.

min g − f p g
s.t. g(S ) ≥ f (S )

∀S ⊆ V

g(S ) + g(S + i + j) ≤ g(S + i) + g(S + j) ∀S ⊆ V, i, j S

(6.3)

For p = 1 and p = ∞ the problem (6.3) is a linear program. There are 2|C| variables, however for small clique size this can be solved exactly using off-theshelf LP solvers. Note that small clique sizes is also the restriction for when sumof-submodular ﬂow is tractable, so this is not an additional restriction. We do apply this upper bound clique-by-clique, so that the ﬁnal sum-of-submodular function g is the sum g = C gC of upper bounds for each clique, gC ≥ fC.
It’s worth noting that other distance measures than the 1- and ∞-norm are possible here. For example, we also considered the 2-norm — however, this leads to a quadratic program in the objective of (6.3), which is harder to solve. Experimentally, we found that the 2 norm was outperformed by the ∞-norm on

163

the examples in Section 8.2, while taking longer to compute.

One reason for choosing the ∞-norm is that we can prove a global approximation bound on the minimization problem minx f (x). Let kC = gC − fC ∞ be the ∞-norm objective for (6.3) for each clique C. Then, we always have fC(xC) ≥ gC(xC)−kC for any xC, and hence f (x∗) ≥ g(x∗)− C kC for the minimizing assignment x∗. That is, C kC is an additive approximation bound relating the minimum of g and the minimum of the original function f . In practice, this is a very weak bound; however, it does argue that improving the ∞-norm distance between f and g will result in better energy after minimizing g.

Finally, we’ll note that the upper bound for pairwise MRFs proposed in the

LSA-AUX method of [31] is a special case of (6.3). This upper bound is obtained

by taking each quadratic edge term fi, j(xi, x j) = bi, jxix j of the objective: if bi, j ≤ 0

then the term is submodular, in which case the upper bound is gi, j = fi, j. If

bi, j > 0 then the term is supermodular, in which case they give a linear upper

bound gi, j(xi, x j)

=

bi,

j

xi+ 2

x

j

.

It is simple to check that this linear upper bound

minimizes (6.3) for both p = 1 and p = ∞. Essentially, in the pairwise case, there

aren’t enough degrees of freedom, so all nontrivial submodular upper bounds

are linear functions. This is not the case for cliques with |C| > 2.

6.4 Upper Bound Approximations
In practice, even though we can solve (6.3) using linear programming, it is much more efﬁcient to ﬁnd approximate solutions (in some of our experiments, optimization using the LP upper bounds took upwards of an hour). In this section, we will present 2 alternate submodular upper bounds, both of which minimize
164

the 1-norm objective of (6.3) under some additional constraints, and both of which are much easier to compute.
Our intuition for these upper bounds is to ﬁrst consider the question: what simple families of submodular functions are there? In particular, there are two classes of functions which can easily be shown to be submodular. The ﬁrst are quadratic (pairwise) functions, in which every quadratic term has non-positive coefﬁcient. The second family are functions of the form φ(S ) = h(|S |) where h is a concave function.
6.4.1 The Iterative Heuristic of SoSPD
The ﬁrst upper bound method for SoS functions was the heuristic employed by the original paper for the SoSPD [22] algorithm (Chapter 7). The basic intuition is to ﬁrst look at the equations deﬁning submodularity, f (S ) + f (S + i + j) ≤ f (S + i) + f (S + j). If this inequality is violated, we either lower f (S ), f (S + i + j) or raise f (S + i), f (S + j). Since we’re ﬁnding an upper bound, we can only increase the values. Thus, our only options are to increase f (S + i) or f (S + j).
The SoSPD heuristic simply iterates through all the sets in decreasing size, and increases both f (S + i) and f (S + j) by the amount needed to ﬁx the violated inequality. However, increasing f (S + i) may cause some other inequality to be violated (e.g., f (S − k) + f ((S − k) + i + k) ≤ f ((S − k) + i) + f ((S − k) + k) for some k). Therefore, this method needs to repeatedly iterate through all the sets S until all inequalities are satisﬁed.
The speciﬁcs of this method are given in the supplementary material of [22];
165

however, it was primarily intended as a simple heuristic for a subroutine of the SoSPD algorithm, and is both theoretically and experimentally inferior to the principled lower bounds below. This method is guaranteed to converge, however there are no results concerning how close to the original function f this upper bound will be.

6.4.2 Quadratic-Based Submodular Upper Bounds

Consider a quadratic function q(x) = i aixi + i< j bi, jxix j. In set notation, q(S ) = a(S )+b(S ), and q is submodular if and only if bi, j ≤ 0 for all i, j. Our goal is to ﬁnd a submodular quadratic function q such that gq(S ) := f (S ) + q(S ) is submodular and gq ≥ f .

To do so, deﬁne δS,i, j = f (S ) + f (S + i + j) − f (S + i) − f (S + j). That is, δS,i, j is the amount by which the submodularity inequality for S , i, j is violated (possibly negative if the inequality is satisﬁed). For i, j, deﬁne ∆i, j = max{maxS :i, j S δS,i, j, 0}, the maximum violation over all constraints containing i, j.

Construct a quadratic submodular function q∗ by setting b∗i, j = −∆i, j for all

i,

j

and

a∗i

=

1 2

j ∆i, j for all i. We now show that q∗ is actually the best q which

makes gq a submodular upper bound to f , under the 1-norm. More speciﬁcally,

consider the program

min gq − f 1 q
s.t. gq = f + q gq ≥ f gq and q are submodular

(6.4)

166

Theorem 71. q∗ is a minimizer of (6.4).

Before we prove Theorem 71, we will note that even though q is quadratic, gq may not be. In particular, for the function f of three variables, f (x) = 2x1x2 − x1 x2 x3, we have q∗(x) = x1 + x2 − 2x1 x2 and gq∗(x) = x1 + x2 − x1 x2 x3, which is submodular, but not quadratic.
Also, it’s worth noting that q∗ is an approximate solution to (6.3) in that we are minimizing over a smaller set, by restricting g to be f + q for a submodular, quadratic q. However, we can compute q∗ quickly, by iterating ﬁrst over all pairs (i, j) and then over all S with i, j S and computing δS,i, j for each. For small clique sizes C this is operation is very efﬁcient (though it is O(2|C|) as |C| grows).

We now prove Theorem 71. First, we show that the constraints of (6.4) can be re-written as inequalities on the coefﬁcients a and b of q.
Proposition 72. (6.4) is equivalent to

min{ q 1 : q ≥ 0, bi, j ≤ −∆i, j ∀i, j} q

(6.5)

Proof. The only non-trivial part of the equivalence is showing we can replace

the submodularity constraints with bi, j ≤ −∆i, j for all i, j. So, begin with the

inequalities deﬁning submodularity for gq: For every S and i, j S we have a

constraint

gq(S ) + gq(S + i + j) − gq(S + i) − gq(S + j) ≤ 0.

(6.6)

167

Substituting in gq = f + q, and recalling the deﬁnition of δS,i, j we have

0 ≥ gq(S ) + gq(S + i + j) − gq(S + i) − gq(S + j) = f (S ) + f (S + i + j) − f (S + i) − f (S + j) + q(S ) + q(S + i + j) − q(S + i) − q(S + j) = δS,i, j + q(S ) + q(S + i + j) − q(S + i) − q(S + j)
Then, we can expand out q to get

(6.7)

δS,i, j + q(S ) + q(S + i + j) − q(S + i) − q(S + j) = δS,i, j + a(S ) + a(S + i + j) − a(S + i) − a(S + j) + b(S ) + b(S + i + j) − b(S + i) − b(S + j)

(6.8)

= δS,i, j + bi, j

Therefore, gq is submodular if and only if bi, j ≤ −δS,i, j for all S with i, j S . As for the constraint that q is submodular, this happens if and only if bi, j ≤ 0. Therefore, we have that all these constraints can be replaced by bi, j ≤ −∆i, j for all i, j.

Since b∗i, j = −∆i, j, to show that q∗ is feasible for (6.5) we just need to show that q∗ ≥ 0. Note that we can rearrange q∗ using vector notation as follows

q∗(x) = a∗i xi + b∗i, j xi x j
i i< j



=

i



1 2

j ∆i, j xi + i< j −∆i, j xi x j

=
i< j

xi

+ 2

xj

−

xi x j

∆i, j

(6.9) (6.10) (6.11)

Then,

we

can

check

that

for

the

four

assignments

of

xi,

xj

∈

{0, 1}

that

xi+x j 2

− xi x j

≥

0, so q∗ ≥ 0, and hence q∗ is feasible for (6.5).

168

Finally, to prove Theorem 71 we look at the objective g − f 1 = q 1.

q 1 = q(S ) = ai + bi, j

S

S i∈S

S i, j∈S

i< j

(6.12)

= ai +

bi, j

i S :i∈S

i< j S :i, j∈S

= 2|C|−1 ai + 2|C|−2 bi, j
i i< j

(6.13) (6.14)

Since for feasible q we have gq(C) − f (C) ≥ 0 we must have q(C) ≥ 0, and hence i ai ≥ − i< j bi, j. Therefore, we get that gq − f 1 ≥ −2|C|−2 i< j bi, j ≥ 2|C|−2 i< j ∆i, j.
On the other hand

so q∗ is optimal.

q∗ 1 = 2|C|−1

a∗i + 2|C|−2

b∗i, j

i i< j



= 2|C|−1

i



1 2

j ∆i, j + 2|C|−2 i< j −∆i, j

= 2|C|−2 ∆i, j
i< j

(6.15)

6.4.3 Cardinality-Based Submodular Upper Bounds
In this section, we will focus another family of functions, the concave cardinality-based function, φ(S ) = h(|S |) where h is concave. To motivate this, recall that such a φ is submodular if and only if h is concave. The goal of this section is to ﬁnd a concave cardinality-based function φ such that gφ(S ) := f (S ) + φ(S ) is submodular and gφ ≥ f . We are going to solve the following program:

169

min φ

gφ − f

p

s.t. gφ = f + φ

gφ ≥ f,

gφ and φ are submodular.

(6.16)

Let’s introduce n = |C|, and ψk = h(k) for k = 0, 1, . . . , n as a shorthand, so that φ(S ) = h(|S |) = ψ|S |. We also let ∆k = max{max|S |=k−1,i, j S δS,i, j, 0} for k = 1, . . . , n − 1, which is the maximum submodularity violation over all sets of size k − 1.2 Note again that δS,i, j might be negative but ∆k is always non-negative. Now, we can rewrite (6.16) in an equivalent form:

min ψ

ψ

p

s.t. ψ ≥ 0,

2ψk ≥ ψk−1 + ψk+1 + ∆k, k = 1, . . . , n − 1

where we deﬁne

n
ψ 1 :=
k=0

n k

ψk

ψ ∞ = max ψk k

Lemma 73. (6.16) and (6.17) are equivalent.

(6.17) (6.18)

Proof. Since g(S ) − f (S ) = ψ|S |, the objective is ψ p = gφ − f p by how we’ve just deﬁned ψ p. The constraint ψ ≥ 0 is equivalent to gφ ≥ f .
Next, for gφ to be submodular, we require
f (S + i) + ψ|S |+1 + f (S + j) + ψ|S |+1 ≥ f (S ) + ψ|S | + f (S + i + j) + ψ|S |+2 (6.19)
2Recall that δS,i, j is deﬁned as in Section 6.4.2, and gives the violation of the submodular constraint for S , i, j.

170

for all S , i, j S . Rearranging this, we get

2ψ|S |+1 − ψ|S | − ψ|S |+2 ≥ f (S ) + f (S + i + j) − f (S + i) − f (S + j)

(6.20)

= δS,i, j

This must hold for all S and i, j S . Equivalently, for every k (where k = |S | + 1) we have 2ψk −ψk−1 −ψk+1 ≥ maxS :|S |=k−1,i, j S δS ,i, j. Also note that since h is concave we always have 2ψk − ψk−1 − ψk+1 ≥ 0. Therefore, the submodularity of gφ is equivalent to the constraints 2ψk − ψk−1 − ψk+1 ≥ ∆k for k = 1, . . . , n − 1.

Since 2ψk − ψk−1 − ψk+1 is widely referred as the discrete Laplacian operator, we will refer these constraints as Laplacian constraints. Furthermore, if we require all the inequalities in the Laplacian constraints to be satisﬁed with equality, these are the called the inhomogeneous Laplacian equations. Our algorithm to ﬁnd the cardinality-based upper bound is to solve these Laplacian equations, in a procedure detailed below. We will show later in this section that this algorithm gives us a feasible submodular upper bound. Additionally, ψ∗ is optimal under the 1-norm and a 2-approximation to (6.16) under the ∞-norm.
Without loss of generality, we can assume ψ0 = ψn = 0 — since ψ0 and ψn only appear in the RHS of the Laplacian constraints, for any feasible solution ψ, we could decrease ψ0, ψn to get ψ0 = ψn = 0 and none of the constraints will be violated and we decrese the objective value.
After ﬁxing ψ0 and ψn to be 0, there are n − 1 variables ψ1, . . . , ψn−1 and n − 1 Laplacian constraints remaining. We can uniquely solve the corresponding Laplacian equations since the coefﬁcient matrix has full rank. Also note that this is very efﬁcient efﬁcient to solve — since the coefﬁcient matrix M is tridiag-

171

onal, we can easily factor it into an LU decomposition and using forward and backward substitution, solve for Mψ = ∆ in O(n) time. The bottleneck of the cardinality-based upper bound computation is still the computation of δS,i, j and ∆k, which takes exponential time in the clique size. The following lemma proves the correctness of the algorithm.
Lemma 74. Solving the Laplacian matrix gives us a feasible solution for (6.17), where we deﬁne ψ∗ = M−1∆.

Proof. Clearly, all the Laplacian constraints are satisﬁed since we treat them as equalities and solve the linear system for a point which makes all of them satisﬁed simultaneously. To show non-negativity, it’s straightforward to show that the inverse matrix of Laplacian coefﬁcient matrix is a positive matrix; for completeness, we include this in Appendix B. Our deﬁnition of ∆k is non-negative, hence our solution ψ∗ = M−1∆ is a non-negative matrix times a non-negative vector hence also non-negative.

Now, let’s show some optimality properties of the cardinality-based upper bound. The following lemma is the key fact for the analysis.

Lemma 75. For ∀k ≤

n 2

,

all

feasible

ψ

have

ψk

+ ψn−k

≥

Lk,

where

L0

=

0

and

Lk

=

Lk−1 +

n−k i=k

∆i.

Solving

the

Laplacian

equations

gives

us

the

ψ

which simultaneously

minimizes the quantity ψk + ψn−k for ∀k, meaning ψ∗k + ψ∗n−k = Lk.

Proof. We prove this by induction on k. The base case k = 0 is trivial since we enforce ψ0 = ψn = 0 in our algorithm, and the non-negativity of ψ ensures that ψ0 + ψn reaches its lower bound L0 = 0.

172

For k ≥ 1, we can sum up the k-th to the (n − k)-th Laplacian constraints, to

get and hence

n−k n−k
(2ψl − ψl−1 − ψl+1) ≥ ∆l
l=k l=k

n−k n−k

n−k

∆l ≤ (ψl − ψl−1) + (ψl − ψl+1)

l=k l=k

l=k

(6.21) (6.22)

= ψn−k − ψk−1 − ψn−k+1 + ψk and rearranging, we have

n−k n−k
ψk + ψn−k ≥ ψk−1 + ψn−k+1 + ∆i ≥ Lk−1 + ∆i = Lk
i=k i=k

(6.23)

Since we get ψ’s from solving the Laplacian equations, all the above inequalities

hold with equality. So inductively assuming ψ∗k−1 + ψ∗n−k+1 = Lk−1 we have

n−k
ψ∗k + ψ∗n−k = ψ∗k−1 + ψ∗n−k+1 + ∆i = Lk
i=k

(6.24)

Theorem 76. The cardinality-based upper bound is optimal under the p = 1 version of (6.17).

Proof. We have the 1-norm objective of (6.17) to be

n k=0

n k

ψk.

Since

the

binomial

coefﬁcients are symmetric, this objective function is a positive linear combina-

tions of ψk + ψn−k. For odd n we have

n
ψ 1 :=
k=0

n k

ψk

=

(n−1)/2 k=0

n k

(ψk

+

ψn−k)

(6.25)

and for even n:

ψ

n/2−1
1=
k=0

n k

(ψk

+

ψn−k)

(6.26)

+1 2

n
n 2

(ψ n 2

+ ψn) 2

Since we separately minimize each sum ψk+ψn−k, due to Lemma 75, we minimize

the whole objective as well.

173

Theorem 77. The cardinality-based upper bound gives a 2-approximation under the p = ∞ version of (6.17).

Proof. Under the ∞-norm, the objective function in (6.17) is maxk ψk. In

Lemma 75, we have established the lower bound for ψk + ψn−k ≥ Lk so that

max{ψk, ψn−k}

≥

Lk 2

.

Hence,

the

minimum

of

(6.17)

is

at

least

maxk

Lk 2

.

We also

know our choice of ψ∗ has ψ∗k + ψ∗n−k = Lk for each k and all the ψk are non-

negative, i.e., in the worst case, our algorithm can give us a feasible solution

with objective value maxk Lk. Therefore, our algorithm is a 2-approx under ∞-

norm.

Additionally, we can prove the following general approximation ratio for arbitrary p-norm (p ≥ 1), which contains the previous two theorems as special cases. The proof for the general case is analogous to the proof for the 1-norm, and we will defer it to Appendix C

Theorem

78.

The

cardinality-based

upper

bound

gives

a

2(1−

1 p

)

-approximation

for

(6.17).

r

174

CHAPTER 7 A PRIMAL-DUAL ALGORITHM FOR HIGHER-ORDER MULTILABEL
MARKOV RANDOM FIELDS

Now that we have adapted the state-of-the-art ﬂow algorithm for vision problems to solve higher-order submodular MRFs, and also shown how to handle arbitrary non-submodular binary MRFs using upper bounds, we can turn to the problem of handling multi-label problems. In this chapter we propose a new primal-dual energy minimization method for arbitrary higher-order multilabel MRFs. Primal-dual methods provide guaranteed approximation bounds, and can exploit information in the dual variables to improve their efﬁciency. Our algorithm generalizes the PD3 [57] technique for ﬁrst-order MRFs, and relies on the SoS IBFS algorithm of Chapter 5 to optimize the binary MRFs at each step. We provide approximation bounds similar to PD3 [57], and the method is fast in practice. It can optimize non-submodular MRFs, and additionally can incorporate problem-speciﬁc knowledge in the form of fusion proposals.
7.1 Higher-order Multi-label MRFs

In multi-label problems, we now allow the label set Xi for each variable i to be larger than just {0, 1}. We minimize the cost of the labeling f : X → R deﬁned by

f (x) = fi(xi) + fC(xC).
i C∈C

(7.1)

fC is a function of just the variables xi with i ∈ C (a subvector of x which we de-

note xC). We assume, without loss of generality, that fi, fC ≥ 0 (by reparametriz-

ing them using Lemma 19 of Section 2.1). Special cases include ﬁrst-order MRFs

where |C| = 2, and binary MRFs where |Xi| = 2; we are interested in the general

175

case of higher-order multi-label MRFs, where we restrict neither |C| nor |Xi|.

For multi-label MRFs, the most popular graph-cuts algorithms are based on alpha-expansion. Recall that alpha-expansion solves a series of binary problems, where each variable i can either keep its current label xi, or switch to a ﬁxed label α. Alpha-expansion cycles through all α ∈ Xi until no variable changes in an entire loop through all α.

For certain classes of pairwise function fi, j, alpha-expansion has provable

approximation guarantees. If the fi, j are all Potts terms, then alpha-expansion

is a 2 approximation. When the fi, j are all the same, and form a metric, then

the

work

of

[47]

showed

that

the

approximation

ratio

is

2 f max f min

where

f max

is

the

maximum value of fi, j and f min is the minimum nonzero value of fi, j.

The connection between graph cuts and primal-dual techniques was established by [57] who showed that α-expansion could be interpreted as simultaneously optimizing primal and dual solutions. [57] proposed several primaldual algorithms that generalized α-expansion and provided both theoretical and practical advantages. These methods apply to much more general energy functions and extend the approximation bounds of [47]. Empirically, keeping track of the dual variables allows a number of implementation speedups compared to α-expansion, resulting in the very efﬁcient algorithm FastPD [59].

7.1.1 Summary of Our Method
In this chapter, we provide an generalization of the primal-dual algorithm PD3 of [57] that can efﬁciently minimize an arbitrary higher-order multilabel energy
176

function. Brieﬂy: PD3 relies on the max-ﬂow / min-cut algorithm; the ﬂow values update the dual variables and the min-cut updates the primal variables. Our method instead uses the SoS ﬂow of Chapter 5 which can exactly minimize the class of Sum-of-Submodular functions, with a corresponding SoS max-ﬂow.
Primal-dual methods rely on the optimality conditions for linear programming, in particular the complementary slackness conditions of Section 2.8. These conditions relate the slack in a constraint of the primal problem with the non-zeros of dual variables, and give necessary conditions for a pair of primal and dual solutions to be optimal.
Our algorithm begins with the Local Marginal Polytope (LMP) relaxation of Section 2.4. Recall that the LMP has two kinds of constraints, corresponding to the unary terms and clique-based terms in equation (7.1). We refer to the respective complementary slackness conditions as unary and clique slackness conditions. We will keep track of a primal solution x and (not necessarily feasible) dual solution λ. We will ensure that x, λ always satisfy the clique slackness conditions, and at each step of the algorithm, we will try to move x to be closer to satisfying unary slackness. The algorithm converges to a solution where both slackness conditions hold, but we generally lose feasibility. However, there exists some ρ such that λ/ρ is dual-feasible. This gives us a ρ-approximation algorithm for a class of functions we call weakly associative.
We review the related work in Section 7.2. Our algorithm is presented in Section 7.3, and we conclude with an experimental evaluation in Section 8.3.
177

7.2 Related Work
7.2.1 Graph Cut Methods and Higher-Order MRFs
The most popular graph cut methods for multilabel ﬁrst-order MRFs rely on move-making techniques. Those methods, which notably include αexpansion [12] and fusion moves [66], reduce the multilabel problem to a series of binary subproblems which are then solved by max-ﬂow [8, 52]. In αexpansion [12], the binary problem involves each pixel deciding whether to keep its current label or adopt a particular new label α. The expansion move algorithm also provides a guaranteed approximation bound.
[57, 58] proposed a primal-dual framework that generalizes α-expansion. They interpreted this algorithm as optimizing the primal and dual problem of the LP-relaxation of the MRF energy function simultaneously. In addition, the general primal-dual algorithm overcomes the most important limitation of the α-expansion algorithm, which is the requirement that the pairwise energy must be a metric [12]. The same approximation ratio still holds for a much broader class of energy functions. Furthermore, by tracking the dual variables to speed up the optimization, it can be 3-9 times faster in practice [59].
7.2.2 Linear Programming and Duality for MRFs
Much of the theory of MRF optimization algorithms revolves around a speciﬁc linear programming relaxation of (7.1) known as the Local Marginal Polytope formulation [86], which was extended to higher-order MRFs in [99]. Every lin-
178

ear program (LP) has a corresponding dual, and the dual program has resulted in efﬁcient algorithms such as [56, 57, 59]. We derived the dual program for the Local Marginal Polytope in Section 2.9.

Recall that the dual program has variables for each clique C, i ∈ C and label xi, denoted λC,i(xi); and is given by

max λ

i

min hi(xi) xi

hi(xi) = fi(xi) + λC,i(xi)
C

λC,i(xi) ≤ fC(xC)
i∈C

∀i ∀C, xC

(7.2a) (7.2b) (7.2c)

We can informally think of the dual variable λC,i(xi) as taking part of the cost fC(xC), and redistributing it to the unary terms. Following [57], the functions hi(xi) will be called the “height” of label xi at variable i, and semantically can be thought of as the original cost fi(xi), plus any redistribution λC,i from the cliques to the unary terms at i. The dual is always a lower bound on the value f (x) of any labeling.

7.2.3 Sum-of-Submodular Flow

We will summarize the most important features of SoS ﬂow from Chapter 5. We have a set of vertices V plus the source s and sink t, and arcs (s, i) and (i, t) for each i ∈ V. We are also given a sum-of-submodular function:

g(S ) = gC(S ∩ C) + ci,t + cs,i
C∈C i∈S i S

(7.3)

179

where C ∈ C are called cliques in V, and each gC is a submodular function, called a clique function, with

gC(∅) = gC(C) = min gC(S ∩ C) = 0. S

(7.4)

Intuitively, the difference between max ﬂow and sum-of-submodular ﬂow is that in addition to capacity and conservation constraints, we will also require that the ﬂow out of any set S is at most gC(S ∩ C). To be precise, a sum-ofsubmodular ﬂow has ﬂow values φs,i and φi,t on the source and sink edges, as well as ﬂow values φC,i for each clique C and i ∈ C. Then, a maximum sum-ofsubmodular ﬂow is a solution to the following LP:

max φ

φs,i

i

s.t. φs,i ≤ cs,i,

φi,t ≤ ci,t

φs,i − φi,t − φC,i = 0
Ci
φC,i ≤ gC(S )
i∈S

∀i ∀i ∀C, S ⊆ C

(7.5a) (7.5b) (7.5c) (7.5d)

Here, (7.5b) are the capacity constraints for source and sink edges, with capacities given by the unary terms cs,i, ci,t, (7.5c) are the ﬂow-conservation constraints at i and (7.5d) are the additional constraints that the φC in a set S are at most gC(S ). [53] shows that this LP can be solved by a generalized ﬂow algorithm.

Finally, we have a sum-of-submodular version of the min-cut max-ﬂow theorem, originally from [53], and described in Section 5.1.3. If φ maximizes (7.5), and S minimizes (7.3), then the objective value (7.5a) of φ is equal to g(S ). Furthermore, the notion of saturated edges extends to the clique function: (1) if

180

Initialize x arbitrarily.

Initialize

λC,i(xi)

=

1 |C|

fC (xC ),

and

λC,i(a)

=

0

for

a

xi.

while unary slackness condititions are not satisﬁed do

y ← result of proposal generator

PRE-EDIT-DUALS(x, y, λ)

x , λ ← UPDATE-DUALS-PRIMALS(x, y, λ)

POST-EDIT-DUALS(x , λ )

end while

return x

Algorithm 1: Our SoSPD algorithm.

i ∈ S then φi,t = ci,t (2) if i S then φs,i = cs,i, and most importantly (3) for every clique C, gC(S ∩ C) = i∈S φC,i.

7.3 The SoS Primal Dual Algorithm

Our algorithm, which we will call SoSPD, is designed around ensuring that two main conditions are satisﬁed regarding the primal and dual solutions. These conditions give us our approximation bound, as well as help design the rest of the algorithm. The conditions are complementary slackness conditions (Section 2.8), in which the inequalities in the dual that correspond to a particular primal solution are actually satisﬁed with equality.

Deﬁnition 79. Given a labeling x and dual solution λ, we say that x, λ satisfy the clique slackness conditions if the constraints in (7.2c) corresponding to xC are satisﬁed with equality. That is, we have

λC,i(xi) = fC(xC)
i∈C

∀C

(7.6)

Proposition 80. If x, λ satisfy the clique slackness conditions, then f (x) = i hi(xi).

181

Proof. Remembering our redistribution argument, this means we have exactly

partitioned fC(xC) among the λ, so the sum of the heights is the original cost f (x).

That is,



hi(xi) =  fi(xi) + λC,i(xi)
ii C

 



=  fi(xi) + 

λC,i(xi)

i Ci

(7.7)

= fi(xi) + fC(xC) = f (x)
iC

Deﬁnition 81. x, λ satisfy the unary slackness conditions if for each i we have hi(xi) =

mina hi(a).

Corollary 82. If x, λ satisfy both the clique and unary slackness conditions, and λ is feasible, then x minimizes f .

Proof. From Proposition 80, the sum of heights i hi(xi) is equal to f (x), and by the deﬁnition of unary slackness, the sum of heights is also equal to the dual objective, the lower-bound on all possible values f (x).

Since our original problem is NP-hard we can’t expect both slackness condi-

tions to hold for a feasible dual λ and integral primal x (for any solution we can

ﬁnd in polynomial time). We instead apply a technique called dual scaling [57],

in which we allow our duals to become slightly infeasible, but in a way that they

can be multiplied by a scalar to become feasible. More speciﬁcally, the structure

of

(7.2)

always

allows

us

to

scale

down

λ

by

1 ρ

for

some

ρ

≥

1

to

get

a

feasible

solution. This gives us approximate optimality.

Lemma 83. If x and λ satisfy the unary and clique slackness conditions, and λ/ρ is dual feasible, then f (x) ≤ ρ f (x∗), where x∗ is the true optimum.

182

Proof. Since x, λ satisfy both slackness conditions, we know that f (x) =

i minai hi(ai), hence

f (x) = ρ

i

1 min
ai ρ

fi(ai) +

C

λC,i(ai)

≤ρ

i

min ai

fi(ai) +

C

1 ρ

λC,i(ai)

≤ ρ f (x∗)

where the ﬁrst inequality is because fi ≥ 0, and the second from λ/ρ being dual-

feasible.

Lemma 83 gives the basic motivation behind our algorithm. Between iterations, x, λ will always satisfy the clique slackness conditions, and the goal of each iteration is to change x to move to lower height labels. At the end of the algorithm, all the xi will be the lowest height labels for each i, and the unary slackness conditions are satisﬁed. Then, we’ll prove that there exists some ρ such that λ/ρ is dual-feasible, and hence we have a ρ-approximation algorithm.
The difﬁcult step in this algorithm is that when we change the labeling x to decrease the height, we must still maintain the clique slackness conditions. We cannot simply set each xi to the lowest height label, lest the clique slackness conditions cease to hold. Instead we simultaneously pick a set of labels to change, and adjust the dual variables such that the new clique slackness conditions are tight. For the higher order case, we can show that sum-of-submodular ﬂow is exactly the tool we need to ensure the clique slackness conditions still hold when changing labels.
At a high-level, the algorithm works as follows. At each iteration, much like the α-expansion or fusion move algorithms, we have a current labeling x and a proposed labeling y. We use sum-of-submodular ﬂow to pick a set S of variables that switch labels, and the max-ﬂow min-cut theorem for sum-of-submodular
183

ﬂow will ensure that the new variables x , λ also satisfy the clique slackness conditions.
Our SoSPD technique is summarized in Algorithm 1, and each iteration has 3 subroutines. The main work of the algorithm occurs in UPDATE-DUALSPRIMALS, which sets up the sum-of-submodular ﬂow problem, and picks a set of variables to swap. We will describe this subroutine ﬁrst, in Section 7.3.1, making some assumptions about x, λ which may not hold in general. Then, it is the job of the other two subroutines, PRE-EDIT-DUALS and POST-EDIT-DUALS (Sections 7.3.2 and 7.3.3) to make sure these assumptions do hold, and that therefore the algorithm functions correctly.
7.3.1 Update-Duals-Primals
To begin with, we need notation for fusion moves [66]. If we have current and proposed labelings x and y, and S is the set of variables that change label, we’ll denote the fused labeling by x = x[S ← y], which has xi = yi if i ∈ S , and xi = xi if i S .
Given our current state x, λ, we’re going to construct a sum-of-submodular ﬂow network. The values φC,i will be the amount we add or subtract from λC,i(yi), and the source-sink ﬂow φs,i, φi,t will give the change in height of hi(yi). We will only ever adjust the dual variables λC,i(yi) corresponding to the proposed labeling y.1
The easy part is deﬁning the source-sink capacities. If hi(yi) < hi(xi) then we
1Note that if xi = yi, we do not change λC,i(xi). We could accomplish this by simply removing such i from the ﬂow network. However such vertices i will have, by construction, no outgoing capacity in the network, so φC,i must always be 0.
184

can raise the height of label yi by the difference, and still prefer to switch labels. Similarly, if hi(yi) > hi(xi), we can lower the height of yi by the difference without creating a new label we’d prefer to swap to. We deﬁne source-sink capacities by cs,i = hi(xi) − hi(yi), ci,t = 0 if hi(yi) > hi(xi), and cs,i = 0, ci,t = hi(yi) − hi(xi) otherwise.

In addition to decreasing the heights of the variables, our other main concern is making sure that the clique slackness conditions continue to hold. Consider an individual clique C for now, and let us examine what our labeling xC could look like after a fusion step. The possible labelings are xC[S ← yC] for each subset S of C. We want to make sure that after the swap, i λC,i(xi) = fC(xC), so deﬁne a function gC equal to the difference:

gC(S ) := fC(xC[S ← yC]) − λC,i(yi) − λC,i(xi)
i∈S i S

(7.8)

For now, we’ll assume that (1) gC is a submodular function and (2) gC(∅) = gC(C) = 0, gC(S ) ≥ 0. These assumptions will end up being enforced by PREEDIT-DUALS, which we describe below.

Under these assumptions the capacities c and functions gC deﬁne a sumof-submodular ﬂow network, so we can ﬁnd a ﬂow φ and cut S such that gC(S ∩ C) = i∈S φC,i (by the sum-of-submodular version of the max-ﬂow min-cut theorem [53], paraphrased at the end of Section 7.2.3). Then, we set x = x[S ← y], and λC,i(yi) = λC,i(yi) + φC,i. By deﬁnition of gC, we have
fC(xC) = gC(S ∩ C) + λC,i(yi) + λC,i(xi)
i∈S i S
= [λC,i(yi) + φC,i] + λC,i(xi) = λC,i(xi ).
i∈S i S i
Therefore, the primal and dual solutions satisfy the clique slackness condi-

185

tions, and our source-sink capacities were chosen so that hi(xi) ≤ hi(xi).
Finally, note that unless every edge out of s gets saturated (and hence S = ∅) then at least one height has strictly decreased.
7.3.2 Pre-Edit-Duals
The job of PRE-EDIT-DUALS is to ensure that the assumptions we made in UPDATE-DUALS-PRIMALS are actually true. Namely, we need (1) the function gC must be submodular and (2) gC(∅) = gC(C) = 0 and gC(S ) ≥ 0.
For (1), ﬁrst note that if fC(xC[S ← yC]) is a submodular function of S , then so is gC, since a submodular function plus a linear function is still submodular. Such functions were called expansion-submodular in [19]. To handle general energy functions, we need an approach for the case where the fusion move is not submodular.
We take a similar approach to the PD3 variant PD3a [57], which ﬁnds an overestimate of the original energy function. For pairwise energies ﬁnding a submodular overestimate simply consists of truncating negative capacities to 0. In our case, we must ﬁnd a submodular upper bound, fC(S ), such that fC(S ) ≥ fC(xC[S ← yC]). Our only other requirements are that fC(∅) = fC(xC), fC(C) = fC(yC), and that f ({i}) ≤ maxxC fC(xC) for i ∈ C. We will use the methods of Chapter 6 for ﬁnding submodular upper bounds of functions. We consider each of the choices presented there in the experiments.
Having computed fC, we then substitute it for fC, just for this iteration. To simplify the notation we will write fC(xC) to mean fC(S ) wherever xC = xC[S ←
186

yC ].
To establish assumption (2), we make use of Edmonds algorithm [16], described by Lemma 35 from Section 2.3.3. This states that for any submodular function g with g(∅) = 0, there is a vector ψ such that g(S ) + ψ(S ) ≥ 0 and g(C) = −ψ(C) (where we are using the standard notation ψ(S ) := i∈S ψi). In fact, the vector deﬁned by ψi = g({1, . . . , i − 1}) − g({1, . . . , i}) will sufﬁce.
To ensure (2) holds, we start with gC(S ) deﬁned as in (7.8). Note that we have gC(∅) = fC(xC) − i∈C λC,i(xi), which by the clique slackness condition, we know is 0. We can therefore compute a ψ as just described, and update λC,i(yi) ← λC,i(yi) − ψi. Since gC(S ) + ψ(S ) ≥ 0 and gC(C) + ψ(C) = 0, when we update gC ← gC + ψ with the new values of λ, we satisfy gC(S ) ≥ 0 and gC(C) = 0.

7.3.3 Post-Edit-Duals

Having run UPDATE-DUALS-PRIMALS, we know that fC(xC) = i λC,i(xi). However, from PRE-EDIT-DUALS, f might be an overestimate of f .

The subroutine POST-EDIT-DUALS enforces the clique slackness conditions,

by

setting

λC,i(yi )

=

1 |C|

fC (xC )

for

each

clique C.

Note

that

if

f

is

an

overestimate,

this can only ever decrease the sum of heights hi(xi) (since we ﬁrst average, and

then subtract the overestimate from λ).

One ﬁnal property of POST-EDIT-DUALS: since fC(xC) ≥ 0, we always know that λC,i(xi) ≥ 0. We will use this in the proof of approximation ratio, momentarily.

187

7.3.4 Proof of Convergence
Much like the pairwise algorithms α-expansion and fusion move, we have monotonically decreasing energy. Lemma 84. The objective value f (x) is non-increasing.
Proof. First, recall that x, λ satisfy the clique slackness conditions, so f (x) = i hi(xi). We also know that PRE-EDIT-DUALS doesn’t change any of the heights
hi(xi), UPDATE-PRIMALS-DUALS can only decrease hi(xi) (by deﬁnition of the source-sink capacities) and POST-EDIT-DUALS also doesn’t increase the sum of heights.
The convergence of our method is not guaranteed for arbitrarily bad fusion moves (for instance, we could have a bad proposal generator which always suggests labels which have greater height than xi). For α-expansion proposals, however, convergence is guaranteed. Proposition 85. With the proposal yi = α for each i, at the end of the iteration either f (x ) < f (x) or hi(xi) ≤ hi(α) for all i.
Proof. From the discussion of UPDATE-DUALS-PRIMALS, one of two things happens: (1) the height of at least one variable is strictly decreased, or (2) the minimum cut is S = ∅. If (1), then neither of the other subroutines increases the sum of heights, so by Proposition 80 we have f (x ) < f (x). If (2) then all edges out of s are saturated, so UPDATE-DUALS-PRIMALS increased hi(α) to be at least hi(xi). Furthermore, x = x and so neither PRE-EDIT-DUALS nor POST-EDIT-DUALS changes any of the λC,i(xi), and therefore hi(xi) ≤ hi(α) holds at the end of the iteration.
188

Lemma 86. If after running through iterations of α-expansion for every label α, f (x) does not strictly decrease, then the unary slackness conditions must hold, and the algorithm terminates.
Proof. Since every α-expansion iteration didn’t change the objective f (x), by Proposition 85 each such iteration ensures that hi(α) ≥ hi(xi). Also note that a β-expansion for β α doesn’t change any of the hi(α). Therefore, the xi are all minimum height labels, and the unary slackness conditions are satisﬁed.
Overall, with integer costs, the objective decreases by at least 1 each outeriteration and therefore eventually halts. The running time of each iteration is dominated by the SoS ﬂow computation — we use SoS-IBFS [19] which has runtime O(|V|2|C| 2k), where k = max |C|. It is difﬁcult to provide a non-trivial bound on the number of α-expansion iterations, but in practice we always observe convergence after 4 passes through the label set. Note that this is exponential in the clique size, since we represent submodular functions as tables of 2k values. However, this is also true of other state of the art methods for higher-order MRFs such as [18, 40, 55].
7.3.5 Approximation Bounds
Let f max = max fC(xC), f min = min fC(xC), where the max and min are over all cliques C and all non-constant labelings xC (meaning there is no a with xi = a for all i ∈ C). There is a natural class of MRFs where f min > 0 (i.e. all non-constant labelings have positive costs), and where constant labelings have zero cost. We call such MRFs weakly associative; they encourage all variables in a clique to have
189

the same label, but are otherwise unrestricted on non-constant labelings. This generalizes what [57] calls non-metric energies.

Our

approximation

ratio

will

be

ρ

=

k .f max f min

Note that ρ is ﬁnite only for

a weakly associative MRF. This generalizes the approximation ratio for PD3,

which

is

2

f max f min

.

Theorem 87. SoSPD with α-expansion for a weakly associative MRF f is a ρapproximation algorithm, i.e., the primal solution x at the end will have f (x) ≤ ρ f (x∗).

Proof. The ﬁrst task is to show that λ doesn’t get too big. In particular, after any

iteration, λC,i(ai) ≤ f max for all ai. Note that after UPDATE-DUALS-PRIMALS, we

have

φC,i ≤ gC({i}) := fC({i}) − λC,i(xi) − λC,i(yi)
ji
Since we constructed f to have f ({i}) ≤ f max and POST-EDIT-DUALS from the

previous iteration makes sure λC,i(xi) ≥ 0, we get λC,i(yi) = λC,i(yi) + φC,i ≤ fCmax.

If POST-EDIT-DUALS in the present iteration changes λC,i(yi), it sets it to

1 |C|

fC (x

)

≤

fCmax.

Therefore, λC,i(yi)

≤

fCmax, and we don’t change λC,i(ai) for any

ai yi in this iteration, so inductively, at the end of the algorithm λC,i(ai) ≤ fCmax

for all labels ai.

For feasibility, we need to show that (7.2c) holds for each clique C and label-

ing xC. For non-constant xC we have

1 ρ

λC,i

(xi)

≤

|C| fCmax ρ

≤

fCmin

≤

fC (xC )

i

For constant labeling xC = α, note that in the last α-expansion, PRE-EDIT-DUALS

enforces that fC(C) − i λC,i(α) = gC(C) = 0, and neither of the other subroutines

violate this. Therefore

i

1 ρ

λC,i(α)

=

1 ρ

fC

(C)

=

1 ρ

fC (α)

=

0, where the second

190

equality is because we constructed fC with fC(C) = f (yC), and the last is since f is weakly associative.
Finally, Lemma 83 says that at convergence, f (x) is no more than ρ f (x∗).
191

CHAPTER 8 EXPERIMENTAL EVALUATION OF THE SOSPD ALGORITHM
The last two chapters presented an algorithm for optimizing general nonsubmodular higher-order MRFs, based off linear programming relaxations to the local marginal polytope, and using submodular ﬂow and submodular upper bounds to solve the binary subproblems in each fusion or expansion move. In this chapter, we will give the experimental results for these algorithms, showing that they are also empirically faster than existing state of the art algorithms.
For our benchmarks, we choose the two exemplar higher-order problems described in the introduction: the Field of Experts model from Section 1.5.2 and the curvature regularizing stereo model of Section 1.5.3. We will describe in detail the datasets and models used for the experiments in Section 8.1.
There are two main questions we want to answer with these experiments. In Section 8.2, we explore which of the several proposed submodular upper bounds is the most effective for optimization in typical computer vision input problems. Then, in Section 8.3 we demonstrate the speedup of the primal-dual SoSPD algorithm over existing algorithms for multilabel higher-order inference.
8.1 Benchmarks and Datasets
For our experiments, we are interested in benchmarks which represent typical higher-order vision inputs, and which are difﬁcult to optimize.
192

8.1.1 Field of Experts Denoising

The ﬁrst benchmark, Fields of Experts denoising, we have already seen in the experiments for the reduction method of Chapter 4. This benchmark is based off the model of [77], and has been used in many higher-order optimization papers, including [40, 18, 43, 22]. The dataset consists of 100 grasycale images from the Berkeley Segmentation Database [70], to which independent Gaussian noise has been added to each pixel.

Recall that the Field of Experts model has 255 labels for each pixel xi, with

labels corresponding to denoised 8-bit intensity values. The unary terms are a

L2 data-cost

1 2σ2

xi − yi

2

i

(8.1)

where yi is the observed, noisy image and σ is an estimate of the gaussian noise

added to each pixel.

The Field of Experts prior is applied to each local patch of the image. We

use each (overlapping) 2x2 patch, for cliques of size 4. The clique functions are

given by

k
fC(xC) = αi log(1 + JiT xC).
i=1

(8.2)

where the Ji are learned linear ﬁlters passed through a nonlinear activation

function log(1 + ·), and αi are learned mixture components between these acti-

vations. The exact coefﬁcients were trained by maximum-likelihood estimation

on a training set. To allow reproducibility of results, we obtained the speciﬁc

energy functions to be minimized from the OpenGM benchmark [45].1

1Available at http://http://hci.iwr.uni-heidelberg.de/opengm2/.

193

8.1.2 Curvature Regularizing Stereo Reconstruction

The second benchmark is based on the second-order stereo model of [102], as described in Section 1.5.3. The stereo reconstruction algorithm of [102] encourages the disparity map to be piecewise smooth using a 2nd order prior, composed of all 1 × 3 and 3 × 1 patches in the image, which each penalizing a robust function of the curvature of the disparity map.

A number of optimization methods are proposed in [102], which are com-

posed to get the ﬁnal result. The most important step consists of pre-generating

a set of 14 piecewise-planar proposed disparity maps, and then using these as

proposals to the fusion-move algorithm to improve the current disparity until

convergence. This method is called SEGPLN in [102]. We use the truncated-

quadratic costs which penalizes the deviation of each 3 × 1 patch from a plane,

via the robust L2 cost:

fi, j,k(xi, x j, xk)

=

 min 


∆ x1

−

∆ 2
xj

+

∆ xk

2 , τ . 

(8.3)

To give a fair benchmark comparison, we use the simpliﬁed model used in [55] and later adapted by [22], which omits the binary occlusion labels for each pixel. Because the stereo-reconstruction problem is doing inference in a continuous domain (of all possible disparity values, as real numbers) this model also discretizes the problem by giving each pixel 14 discrete labels, one for each pre-generated proposal. Experimentally, the fusion-move part of [102] (in which the algorithm repeatedly proposes and fuses these 14 proposals) is the bulk of the time spent by the algorithm, as well as the most important for energy reduction. So, good performance of discrete optimizers for this step can dramatically improve the performance of the whole algorithm.

194

Note that this discretization still allows estimating sub-pixel disparities (since the proposals can take any ﬂoating point value) using only 14 labels.
Data was obtained by running the code2 for [102] and recording the proposed fusion moves and corresponding unary terms. The dataset consists of 3 stereo pairs, “cones”, “teddy” and “venus” from the Middlebury Stereo Dataset [82, 83] obtained from http://vision.middlebury.edu/stereo/ data/.
8.2 Comparison of Upper Bound Methods
8.2.1 Experimental Setup
Our ﬁrst set of experiments tests the effectiveness of the various upper bound methods of Chapter 6. We are primarily interested in the performance of these upper bounds as a subroutine within the SoSPD algorithm, as SoSPD can handle multilabel higher-order problems, including the two benchmarks of Section 8.1. It is possible to apply submodular upper bounds directly to the optimization of non-submodular higher-order binary problems; however, these problems are typically much less interesting from an application perspective as most computer vision problems are multilabel.
In addition to SoSPD, we also test the effectiveness of the upper bounds in a fusion move algorithm [66], by using the submodular upper bound to convert each (possibly non-submodular) higher-order binary fusion move into a
2http://www.robots.ox.ac.uk/˜ojw/software.htm.
195

Method SOSPD-QUAD SOSPD-HEUR SOSPD-CARD SOSPD-LP1 SOSPD-LP∞

Energy 30712.58 30748.19 31190.61 30706.60 30704.49

Time (s) 16.049 20.212 22.179
2099.896 5233.162

% Best 0.00% 0.00% 0.00% 10.00%
90.00%

Table 8.1: Comparison of upper bound methods for Fields of Experts denoising averaged over 10 images. Results for SOSPD-LP1 and SOSPD-LP∞ computed with the Gurobi LP solver. Gradient descent proposals were used to generate fusion moves in SoSPD.

Method SOSPD-QUAD SOSPD-CARD SOSPD-HEUR

Energy 32593.23 33074.61 32629.35

Time (s) 15.881 21.908 20.032

% Best 100.00%
0.00% 0.00%

Table 8.2: Comparison of upper bound methods for the full Fields of Experts denoising dataset, averaged over 100 images. Gradient descent proposals used for SOSPD and REDUCTION-FUSION.

Method SOSPD-QUAD SOSPD-HEUR SOSPD-CARD SOSPD-LP1 SOSPD-LP∞

Energy 8.958 × 109 8.952 × 109 8.958 × 109 8.953 × 109 8.948 × 109

Time (s) 104.421 102.311 116.521 115.281 119.012

% Best 0.00% 0.00% 0.00% 0.00%
100.00%

Table 8.3: Comparison of upper bound methods for the Stereo dataset, averaged across 3 stereo pairs, “cones”, “teddy” and “venus”. Results for SOSPD-LP1 and SOSPD-LP∞ were computed using our custom simplex implementation.

196

submodular one, and then solve the resulting SoS optimization problem using the SoS IBFS algorithm of 5. Note that the fusion-move algorithm can be considered a pure-primal version of the primal-dual algorithm SoSPD, in that both algorithms will take the same sequence of fusion or expansion moves, with the same optimal binary labeling at each step. Thus, when both algorithms use the same upper bound, they will arrive at the same answer, though typically SoSPD is more efﬁcient. We observed in our experiments that for a given lower bound, fusion moves had nearly identical3 ﬁnal energy to the corresponding SoSPD result, but took a little over twice as long on both the stereo and denoising datasets, so we did not include them on the tables here.
Both Fusion-moves and SoSPD allow a choice of fusion proposals at each iteration. For the stereo example, we simply cycle through the 14 labels, doing an expansion move on each one. For the Fields of Experts experiments, we use the gradient descent proposals of [38], which have been shown to be the most effective fusion proposals for this dataset.
For the implementation, we used the publicly available code from [22] for the implementation of sum-of-submodular ﬂow, as well as the SoSPD algorithm. The L1 and L∞ linear programs in (6.3) were solved by the linear programming package Gurobi. Additionally, for size 3 cliques in the stereo benchmark, the linear programs involved are very small (only 6 variables and 6 constraints) so we implemented a version of the simplex method with the constraints hardcoded, which was much faster than the general Gurobi solver. All code is in C++, and will be released under an open source license.4
We compare ﬁve different submodular upper bounds. We give each an ab-
3With differences largely due to stopping conditions 4Available at www.cs.cornell.edu/˜afix.
197

Figure 8.1: Visual results for Fields of Experts denoising with different upper bound methods. Top row, left to right: (a) SOSPD-HEUR. Bottom row: (b) SOSPD-QUAD (c) SOSPD-CARD.
breviation in the tables and ﬁgures: the p-norm minimizing upper bounds of Section 6.3, equation (6.3), we’ll denote as LP1 and LP∞, the quadratic-based approximation of 6.4.2 is QUAD, the cardinality approximation of 6.4.3 is CARD, the baseline heuristic of [22] (in Section 6.4.1) we’ll denote by HEUR. We also use, for example, SOSPD-QUAD for the SoSPD algorithm with the quadratic approximation, and FUSION-LP1 to denote the fusion move algorithm with the 1-norm upper bound, etc.
8.2.2 Results
Our ﬁrst experiment tests how close our proposed approximations come to the L1 and L∞ minimizing upper bounds. Because of the slow-runtime of solving the LP for the L1 and L∞ upper bounds, we ran these on only the ﬁrst 10 images for the Field of Experts dataset. Results of this experiment are summarized in
198

Figure 8.2: Reconstructed depth maps for stereo pair “cones”. Top row, left to right (with % of pixels within ±1 disparity) (a) REDUCTION-FUSION, 49.0% (b) SOSPD-QUAD, 49.9% Center row (c) SOSPD-CARD, 49.9%. (d) SOSPD-HEUR, 49.7% Bottom row (e) SOSPD-LP1, 50.0% (f) SOSPD-LP∞, 49.7%
199

������

������� ������� �������
������ ������ ������ ������ ������ ������ ������
��

���������� ���������� ���������� ����������������
�� ��� ��� ��� ��� ��������������

Figure 8.3: Comparison of upper bound methods: Energy over time for the Fields of Experts denoising experiment, using the image in Figure 8.1. Reduction-Fusion, using the reduction method of Chapter 4 provided for comparison.

Table 8.1. Overall, the L∞ upper bound performed the best in the majority (90%) of instances, while the L1 upper bound had only slightly higher energy. The norm-minimizing upper bounds together perform better than all other methods, indicating that these norms are a good measure to minimize for picking good upper bounds. Additionally, the pairwise approximation SoSPD-Quad was very close to the L∞ result, with the energy gap between them less than 1/4 the gap between SoSPD-Quad and the next competitor, SoSPD-Heur. This suggests that the proposed upper bound approximations can come close to the Linear Programming solution, while being more than 100 times faster.
Next, we run all non-LP methods on the full denoising dataset, with results in Table 8.2. Notably, the pairwise approximation has both the best energy for

200

every image in the dataset, as well as being the fastest overall.
For the stereo example, results for the 3 stereo pairs are summarized in Table 8.3. Since the cliques were of size 3, we were able to use the custom simplex method mentioned above for the L1 and L∞ upper bounds. The L∞ upper bound achieved the best energy for all 3 images, while taking only 16% more time (and nearly 4x faster than the non upper bound method, REDUCTION-FUSION).
Across both datasets, we ﬁnd that, the linear programming based upper bounds have the best energy performance, particularly the ∞-norm upper bound. However, for cliques of size four or more, computing the linear programming solution becomes expensive. Thus, the quadratic-based approximation is also promising, as it is faster than the norm-based methods for both datasets, while still having very similar energy optimization performance. As expected, the methods all achieve very similar ﬁnal energies, and correspondingly the visual results for all upper bound algorithms are nearly indistinguishable, as seen in Figure 8.1 and 8.2.
8.3 Evaluation of SoSPD
Now that we have identiﬁed the best upper bound algorithms for different problems, we now want to compare the performance of SoSPD against existing state-of-the-art algorithms for higher-order multilabel inference. In the following experiments, we use the custom-simplex implementation of SOSPD-LP∞ for cliques of size 3, and SOSPD-QUAD for larger cliques.
For experimental comparisons, the method of [55] does not currently have
201

“Teddy” FGBZ-Fusion HOCR-Fusion GRD-Fusion SoSPD-Fusion
“Cones” FGBZ-Fusion HOCR-Fusion GRD-Fusion SoSPD-Fusion

Pixels within ±1 83.3% 83.8% 84.9% 84.8%
Pixels within ±1 74.9% 74.2% 75.2% 75.2%

Final energy 9.320 × 109 9.298 × 109 9.256 × 109 9.172 × 109
Final energy 1.1765 × 1010 1.1789 × 1010 1.1690 × 1010 1.1664 × 1010

Time 468s 210s 1116s 129s
Time 340s 172s 1138s 133s

Table 8.4: Evaluation of SoSPD: Numerical results for stereo reconstruction, for the two images in Figure 8.4.

FGBZ-Gradient HOCR-Gradient GRD-Gradient SoSPD-Gradient

Energy @ 10s 4.17 × 108 4.35 × 108 6.72 × 108 2.87 × 108

Final energy 2.353 × 108 2.368 × 108 2.348 × 108 2.347 × 108

Time 86s 78s 776s 42s

Table 8.5: Evaluation of SoSPD: Numerical results for denoising, averaged over the 100 images in the test set. For the second column, we stop both methods after 10 seconds, and compare energy values.

publicly available code, so we are left with the class of fusion-reduction methods [18, 40, 43]. While the Generalized Roof Duality method of [43] can produce good solutions, it is typically much slower than [18, 40], and is restricted to cliques of size at most 4. We observed that it obtains similar or slightly-worse energy values to SoSPD, while taking at least 10x more time, even for the heuristic version of GRD. We therefore focus on FGBZ [18] and HOCR [40] due to their speed and generality.

202

(a) (b) (c) (d)
Figure 8.4: Evaluation of SoSPD: visual results for stereo reconstruction. (a) Ground truth disparities, with results from (b) FGBZ-Fusion (c) SoSPD-Fusion and (d) SoSPD-Best-Fusion. Top row is the “teddy” image, bottom row is “cones”. Results for SoSPD have slightly more correct pixels, and converged much faster — see Table 8.4 for details.
8.3.1 Stereo reconstruction
We have two variants of SoSPD for this experiment, which only differ in the choice of proposed moves. The ﬁrst, SoSPD-Fusion, rotates through the 14 labels, and successively chooses each to be an α-expansion proposal for that iteration. The second, SoSPD-Best-Fusion, uses an idea from [5] to pick the best α for each iteration. More speciﬁcally, we choose the α which will have the greatest total capacity leaving the source, in order to encourage as many nodes to switch to lower height labels as possible. We compared with the baselines, FGBZ-Fusion using the reduction [18], and HOCR-Fusion using the reduction [40]. Both methods cycle through the pre-generated proposals and perform fusion move.
Numerical results are in Table 8.4 and images in Figure 8.4. Overall, the
203

3.5e+10 3e+10
2.5e+10

FGBZ-Fusion HOCR-Fusion SoSPD-Alpha SoSPD-Best-Alpha

Energy

2e+10

1.5e+10

1e+10

Energy

5e+09 0
1.8e+10 1.7e+10 1.6e+10 1.5e+10 1.4e+10 1.3e+10 1.2e+10 1.1e+10
0
8e+08 7e+08 6e+08 5e+08 4e+08 3e+08 2e+08 1e+08
0

10 10 10

20 30 40 Time (seconds)

50

FGBZ-Fusion HOCR-Fusion SoSPD-Alpha SoSPD-Best-Alpha

20 30 40 Time (seconds)

50

FGBZ-Gradient HOCR-Gradient
SoSPD-Alpha SoSPD-Gradient

20 30 Time (seconds)

40

50

60 60 60

Energy

Figure 8.5: Evaluation of SoSPD: Energy reduction over time for the stereo images (top) “teddy” (center) “cones”. (bottom) Energy reduction over time for the denoising image “penguin”. Note that, in addition to converging faster, for a ﬁxed time budget we achieve much better energy than the baseline.

204

Figure 8.6: Evaluation of SoSPD: Visual results for Field of Experts denoising. (top left) noisy image (top right) SoSPD-α (center left) FGBZ-Gradient, 10 sec (center right) SoSPD-Gradient, 10 sec (bottom left) FGBZ-Gradient at convergence (bottom right) SoSPD-Gradient at convergence.
205

SoSPD variants and reduction methods reach similar energy and visual results; however, SoSPD is fastest overall (2.5x-3.5x vs FGBZ, 1.3x-1.5x vs HOCR).
8.3.2 Field of Experts denoising
SoSPD with α-expansion decreases the energy quickly initially, but gets stuck in poor local optima, with ﬂat images as seen in Figure 8.6. Fortunately, gradient descent proposals [38] have been shown to be very effective at optimizing FoE priors. We call the combination of SoSPD with these fusion proposals SoSPDGradient.
We compare against fusion move with the same proposals, and the reductions of [18] and [40]. Overall, when comparing SoSPD vs. [18] for the same proposal method, SoSPD is signiﬁcantly faster, and achieves slightly lower energy at convergence. Additionally, given a ﬁxed time budget of 10 seconds, both the energy and visual results of SoSPD are signiﬁcantly better, as seen in Figure 8.6 and Table 8.5.
206

CHAPTER 9 STRUCTURED LEARNING OF SUM-OF-SUBMODULAR HIGHER
ORDER ENERGY FUNCTIONS
Now that we’ve covered inference of MRFs, we will turn to the second question of optimization: modeling. SoS functions can naturally express higher order priors involving, e.g., local image patches; however, it is difﬁcult to fully exploit their expressive power because they have so many parameters. Rather than trying to formulate existing higher order priors as an SoS function, we take a discriminative learning approach, effectively searching the space of SoS functions for a higher order prior that performs well on our training set. We adopt a structural SVM approach [41, 95] and formulate the training problem in terms of quadratic programming; as a result we can efﬁciently search the space of SoS priors via an extended cutting-plane algorithm. We also show how the state-of-the-art max ﬂow method for vision problems [30] can be modiﬁed to efﬁciently solve the submodular ﬂow problem. Experimental comparisons are made against the OpenCV implementation of the GrabCut interactive segmentation technique [79], which uses hand-tuned parameters instead of machine learning. On a standard dataset [32] our method learns higher order priors with hundreds of parameter values, and produces signiﬁcantly better segmentations. While our focus is on binary labeling problems, we show that our techniques can be naturally generalized to handle more than two labels.
9.1 Introduction
Discrete optimization methods such as graph cuts [12, 52] have proven to be quite effective for many computer vision problems, including stereo [12], inter-
207

active segmentation [79] and texture synthesis [61]. The underlying optimization problem behind graph cuts is a special case of submodular function optimization that can be solved exactly using max ﬂow [52]. Graph cut methods, however, are limited by their reliance on ﬁrst-order priors involving pairs of pixels, and there is considerable interest in expressing priors that rely on local image patches such as the popular Field of Experts model [77].
While SoS functions have more expressive power, they also involve a large number of parameters. Rather than addressing the question of which existing higher order priors can be expressed as an SoS function, we take a discriminative learning approach and effectively search the space of SoS functions with the goal of ﬁnding a higher order prior that gives strong results on our training set.1
Our main contribution is to introduce the ﬁrst learning method for training such SoS functions, and to demonstrate the effectiveness of this approach for interactive segmentation using learned higher order priors. Following a Structural SVM approach [41, 95], we show that the training problem can be cast as a quadratic optimization problem over an extended set of linear constraints. This generalizes large-margin training of pairwise submodular (a.k.a. regular [52]) MRFs [2, 91, 94], where submodularity corresponds to a simple non-negativity constraint. To solve the training problem, we show that an extended cuttingplane algorithm can efﬁciently search the space of SoS functions.
1Since we are taking a discriminative approach, the higher-order energy function we learn does not have a natural probabilistic interpretation. We are using the word “prior” here somewhat loosely, as is common in computer vision papers that focus on energy minimization.
208

9.2 Related Work
Many learning problems in computer vision can be cast as structured output prediction, which allows learning outputs with spatial coherence. Among the most popular generic methods for structured output learning are Conditional Random Fields (CRFs) trained by maximum conditional likelihood [65], Maximum-Margin Markov Networks (M3N) [93], and Structural Support Vector Machines (SVM-struct) [95, 41]. A key advantage of M3N and SVM-struct over CRFs is that training does not require computation of the partition function. Among the two large-margin approaches M3N and SVM-struct, we follow the SVM-struct methodology since it allows the use of efﬁcient inference procedures during training.
In this paper, we will learn submodular discriminant functions. Prior work on learning submodular functions falls into three categories: submodular function regression [4], maximization of submodular discriminant functions, and minimization of submodular discriminant functions.
Learning of submodular discriminant functions where a prediction is computed through maximization has widespread use in information retrieval, where submodularity models diversity in the ranking of a search engine [103, 67] or in an automatically generated abstract [87]. While exact (monotone) submodular maximization is intractible, approximate inference using a simple greedy algorithm has approximation guarantees and generally excellent performance in practice.
The models considered in this paper use submodular discriminant functions where a prediction is computed through minimization. The most popular such
209

models are regular MRFs [52]. Traditionally, the parameters of these models have been tuned by hand, but several learning methods exist. Most closely related to the work in this paper are Associative Markov Networks [94, 2], which take an M3N approach and exploit the fact that regular MRFs have an integral linear relaxation. These linear programs (LP) are folded into the M3N quadratic program (QP) that is then solved as a monolithic QP. In contrast, SVM-struct training using cutting planes for regular MRFs [91] allows graph cut inference also during training, and [17, 60] show that this approach has interesting approximation properties even the for multi-class case where graph cut inference is only approximate. More complex models for learning spatially coherent priors include separate training for unary and pairwise potentials [64], learning MRFs with functional gradient boosting [71], and the Pn Potts models, all of which have had success on a variety of vision problems. Note that our general approach for learning multi-label SoS functions, described in section 9.3.4, includes the Pn Potts model as a special case.
9.3 S3SVM: SoS Structured SVMs
In this section, we ﬁrst review the SVM algorithm and its associated Quadratic Program (section 9.3.1). We then decribe a general class of SoS discriminant functions which can be learned by SVM-struct (section 9.3.2) and explain this learning procedure (section 9.3.3). Finally, we generalize SoS functions to the multi-label case (section 9.3.4).
210

9.3.1 Structured SVMs
Structured output prediction describes the problem of learning a function h : X −→ Y where X is the space of inputs, and Y is the space of (multivariate and structured) outputs for a given problem. To learn h, we assume that a training sample of input-output pairs S = ((x1, y1), . . . , (xn, yn)) ∈ (X × Y)n is available and drawn i.i.d. from an unknown distribution. The goal is to ﬁnd a function h from some hypothesis space H that has low prediction error, relative to a loss function ∆(y, y¯). The function ∆ quantiﬁes the error associated with predicting y¯ when y is the correct output value. For example, for image segmentation, a natural loss function might be the Hamming distance between the true segmentation and the predicted labeling.
The mechanism by which Structural SVMs ﬁnds a hypothesis h is to learn a discriminant function f : X × Y → R over input/output pairs. One derives a prediction for a given input x by minimizing f over all y ∈ Y.2 We will write this as hw(x) = argminy∈Y fw(x, y). We assume fw(x, y) is linear in two quantities w and Ψ fw(x, y) = wT Ψ(x, y) where w ∈ RN is a parameter vector and Ψ(x, y) is a feature vector relating input x and output y. Intuitively, one can think of fw(x, y) as a cost function that measures how poorly the output y matches the given input x.
Ideally, we would ﬁnd weights w such that the hypothesis hw always gives correct results on the training set. Stated another way, for each example xi, the correct prediction yi should have low discriminant value, while incorrect predictions y¯i with large loss should have high discriminant values. We write this
2Note that the use of minimization departs from the usual language of [95, 41] where the hypothesis is argmax fw(x, y). However, because of the prevalence of cost functions throughout computer vision, we have replaced f by − f throughout.
211

constraint as a linear inequality in w

wT Ψ(xi, y¯i) ≥ wT Ψ(xi, yi) + ∆(yi, y¯i) : ∀y¯ ∈ Y.

(9.1)

It is convenient to deﬁne δΨi(y¯) = Ψ(xi, y¯) − Ψ(xi, yi), so that the above inequality becomes wT δΨi(y¯i) ≥ ∆(yi, y¯i).

Since it may not be possible to satisfy all these conditions exactly, we also add a slack variable to the constraint for each example i. Intuitively, the slack variable ξi represents the maximum misprediction loss on the ith example. Since we want to minimize the prediction error, we add an objective function which penalizes large slack. Finally, we also penalize w 2 to discourage overﬁtting, with a regularization parameter C to trade off these costs.

Quadratic Program 1. n-SLACK STRUCTURAL SVM

min 1 wT w + C

w,ξ≥0 2

n

n i=1

ξi

wT δΨi(y¯i) ≥ ∆(yi, y¯i) − ξi ∀i, ∀y¯i ∈ Y

9.3.2 Submodular Feature Encoding

We now apply the Structured SVM (SVM-struct) framework to the problem of learning SoS functions.
For the moment, assume our prediction task is to assign a binary label for each element of a base set V. We will cover the multi-label case in section 9.3.4. Since the labels are binary, prediction consists of assigning a subset S ⊆ V for each input (namely the set S of pixels labeled 1).
Our goal is to construct a feature vector Ψ that, when used with the SVMstruct algorithm of section 9.3.1, will allow us to learn sum-of-submodular en-
212

ergy functions. Let’s begin with the simplest case of learning a discriminant function fC,w(S ) = wT Ψ(S ), deﬁned only on a single clique and which does not depend on the input x.

Intuitively, our parameters w will correspond to the table of values of the clique function fC, and our feature vector Ψ will be chosen so that wS = fC(S ). We can accomplish this by letting Ψ and w have 2|C| entries, indexed by subsets T ⊆ C, and deﬁning ΨT (S ) = δT (S ) (where δT (S ) is 1 if S = T and 0 otherwise). Note that, as we claimed,

fC,w(S ) = wT Ψ(S ) = wT δT (S ) = wS .
T ⊆C

(9.2)

If our parameters wT are allowed to vary over all R2|C|, then fC(S ) may be an arbitrary function 2C → R, and not necessarily submodular. However, we can enforce submodularity by adding a number of linear inequalities. Recall that f is submodular if and only if f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B). Therefore, fC,w is submodular if and only if the parameters satisfy

wA∪B + wA∩B ≤ wA + wB : ∀A, B ⊆ C

(9.3)

These are just linear constraints in w, so we can add them as additional constraints to Quadratic Program 1. There are O(2|C|) of them, but each clique has 2|C| parameters, so this does not increase the asymptotic size of the QP.
Theorem 88. By choosing feature vector ΨT (S ) = δT (S ) and adding the linear constraints (9.3) to Quadratic Program 1, the learned discriminant function fw(S ) is the maximum margin function fC, where fC is allowed to vary over all possible submodular functions f : 2C → R.

213

Proof. By adding constraints (9.3) to the QP, we ensure that the optimal solution w is deﬁnes a submodular fw. Conversely, for any submodular function fC, there is a feasible w deﬁned by wT = fC(T ), so the optimal solution to the QP must be the maximum-margin such function.

To introduce a dependence on the data x, we can deﬁne Ψdata to be ΨdTata(S , x) = δT (S )Φ(x) for an arbitrary nonnegative function Φ : X → R≥0.
Corollary 89. With feature vector Ψdata and adding linear constraints (9.3) to QP 1, the learned discriminant function is the maximum margin function fC(S )Φ(x), where fC is allowed to vary over all possible submodular functions.

Proof. Because Φ(x) is nonnegative, constraints (9.3) ensure that the discriminant function is again submodular.

Finally, we can learn multiple clique potentials simultaneously. If we have a neighborhood structure C with m cliques, each with a data-dependence ΦC(x), we create a feature vector Ψsos composed of concatenating the m different features ΨCdata.
Corollary 90. With feature vector Ψsos, and adding a copy of the constraints (9.3) for each clique C, the learned fw is the maximum margin f of the form

f (x, S ) = fC(S )ΦC(x)
C∈C
where the fC can vary over all possible submodular functions on the cliques C.

(9.4)

9.3.3 Solving the quadratic program
214

1: Input: S = ((x1, y1), . . . , (xn, yn)), C,

2: W ← ∅

3: repeat

4: Recompute the QP solution with the current constraint set:

(w,

ξ)

←

argminw,ξ≥0

1 2

wT

w

+

Cξ

s.t. for all (y¯1, . . . , y¯n) ∈ W :

1 n

wT

n i=1

δΨi(y¯i)

≥

1 n

n i=1

∆(yi

,

y¯i)

−

ξ

s.t. for all C ∈ C, A, B ⊆ C :

wC,A∪B + wC,A∩B ≤ wC,A + wC,B

5: for i=1,...,n do

6: Compute the maximum violated constraint:

yˆi ← argminyˆ∈Y{wT Ψ(xi, yˆ) − ∆(yi, yˆ)} by using IBFS to minimize fw(xi, yˆ) − ∆(yi, yˆ).

7: end for

8: W ← W ∪ {(yˆ1, . . . , yˆn)}

9: until the slack of the max-violated constraint is ≤ ξ + .

10: return (w,ξ)

Algorithm 2: : S3SVM via the 1-Slack Formulation.

The n-slack formulation for SSVMs (QP 1) makes intuitive sense, from the point of view of minimizing the misprediction error on the training set. However, in practice it is better to use the 1-slack reformulation of this QP from [41]. Compared to n-slack, the 1-slack QP can be solved several orders of magnitude faster in practice, as well as having asymptotically better complexity.
The 1-slack formulation is an equivalent QP which replaces the n slack variables ξi with a single variable ξ. The loss constraints (9.1) are replaced with constraints penalizing the sum of losses across all training examples. We also include submodular constraints on w.

215

Quadratic Program 2. 1-SLACK STRUCTURAL SVM

min 1 wT w + C ξ w,ξ≥0 2

s.t.

1 wT n

n i=1

δΨi(y¯i)

≥

1 n

n i=1

∆(yi, y¯i) − ξ

∀(y¯1, ..., y¯n) ∈ Yn

wC,A∪B + wC,A∩B ≤ wC,A + wC,B

∀C ∈ C, A, B ⊆ C

(9.5)

Note that we have a constraint for each tuple (y¯1, . . . , y¯n) ∈ Yn, which is an exponential sized set. Despite the large set of constraints, we can solve this QP to any desired precision by using the cutting plane algorithm. This algorithm keeps track of a set W of current constraints, and solves the current QP with regard to those constraints, and then given a solution (w, ξ), ﬁnds the most violated constraint and adds it to W. Finding the most violated constraint consists of solving for each example xi the problem

yˆi = argmin fw(x, yˆ) − ∆(yi, yˆ).
yˆ∈Y

(9.6)

Since the features Ψ ensure that fw is SoS, then as long as ∆ factors as a sum over the cliques C (for instance, the Hamming loss is such a function), then (9.6) can

be solved with Submodular IBFS. Note that this also allows us to add arbitrary

additional features for learning the unary potentials as well. Pseudocode for the

entire S3SVM learning is given in Algorithm 2.

9.3.4 Generalization to multi-label prediction
Submodular functions are intrinsically binary functions. In order to handle the multi-label case, we use expansion moves [12] to reduce the multi-label optimization problem to a series of binary subproblems, where each pixel may ei-
216

ther switch to a given label α or keep its current label. If every binary subproblem of computing the optimal expansion move is an SoS problem, we will call the original multi-label energy function an SoS expansion energy.

Let L be our label set, with output space Y = LV. Our learned function will have the form f (y) = C∈C fC(yC) where fC : LC → R. For a clique C and label , deﬁne C = {i | yi = }, i.e., the subset of C taking label .
Theorem 91. If all the clique functions are of the form

fC(yC) = g (C )
∈L

(9.7)

where each g is submodular, then any expansion move for the multi-label energy func-

tion f will be SoS.

Proof. Fix a current labeling y, and let B(S ) be the energy when the set S switches to label α. We can write B(S ) in terms of the clique functions and sets C as

B(S ) = gα(Cα ∪ S ) + g (C \ S )
C∈C α

(9.8)

We use a fact from the theory of submodular functions: if f (S ) is submodular, then for any ﬁxed T both f (T ∪ S ) and f (T \ S ) are also submodular. Therefore, B(S ) is SoS.

Theorem 91 characterizes a large class of SoS expansion energies. These functions generalize commonly used multi-label clique functions, including the Pn Potts model [48]. The Pn model pays cost λi when all pixels are equal to label i, and λmax otherwise. We can write this as an SoS expansion energy by letting g (S ) = λi − λmax if S = C and otherwise 0. Then, g (S ) is equal to the Pn Potts model, up to an additive constant. Generalizations such as the robust Pn
217

Figure 9.1: Example images from the binary segmentation results. From left to right, the columns are (a) the original image (b) the noisy input (c) results from Generic Cuts [3] (d) our results.
model [49] can be encoded in a similar fashion. Finally, in order to learn these functions, we let Ψ be composed of copies of Ψdata — one for each g , and add corresponding copies of the constraints (9.3).
As a ﬁnal note: even though the individual expansion moves can be computed optimally, α-expansion still may not ﬁnd the global optimum for the multi-labeled energy. However, in practice α-expansion ﬁnds good local optima, and has been used for inference in Structural SVM with good results, as in [60].
9.4 Experimental Results
In order to evaluate our algorithms, we focused on binary denoising and interactive segmentation. For binary denoising, Generic Cuts [3] provides the most natural comparison since it is a state-of-the-art method that uses SoS priors. For interactive segmentation the natural comparison is against GrabCut [79], where we used the OpenCV implementation. We ran our general S3SVM method,
218

which can learn an arbitrary SoS function, an also considered the special case of only using pairwise priors. For both the denoising and segmentation applications, we signiﬁcantly improve on the accuracy of the hand-tuned energy functions.
9.4.1 Binary denoising
Our binary denoising dataset consists of a set of 20 black and white images. Each image is 400 × 200 and either a set of geometric lines, or a hand-drawn sketch (see Figure 9.1). We were unable to obtain the original data used by [3], so we created our own similar data by adding independent Gaussian noise at each pixel.
For denoising, the hand-tuned Generic Cuts algorithm of [3] posed a simple MRF, with unary pixels equal to the absolute valued distance from the noisy input, and an SoS prior, where each 2 × 2 clique penalizes the square-root of the number of edges with different labeled endpoints within that clique. There is a single parameter λ, which is the tradeoff between the unary energy and the smoothness term. The neighborhood structure C consists of all 2 × 2 patches of the image.
Our learned prior includes the same unary terms and clique structure, but instead of the square-root smoothness prior, we learn a clique function g to get an MRF ESVM(y) = i |yi − xi| + C∈C g(yC). Note that each clique has the same energy as every other, so this is analogous to a graph cuts prior where each pairwise edge has the same attractive potential. Our energy function has 16 total parameters (one for each possible value of g, which is deﬁned on 2 × 2
219

patches).
We randomly divided the 20 input images into 10 training images and 10 test images. The loss function was the Hamming distance between the correct, un-noisy image and the predicted image. To hand tune the value λ, we picked the value which gave the minimum pixel-wise error on the training set. S3SVM training took only 16 minutes.
Numerically, S3SVM performed signﬁcantly better than the hand-tuned method, with an average pixel-wise error of only 4.9% on the training set, compared to 28.6% for Generic Cuts. The time needed to do inference after training was similar for both methods: 0.82 sec/image for S3SVM vs. 0.76 sec/image for Generic Cuts. Visually, the S3SVM images are signiﬁcantly cleaner looking, as shown in Figure 9.1.
9.4.2 Interactive segmentation
The input to interactive segmentation is a color image, together with a set of sparse foreground/background annotations provided by the user. See Figure 9.2 for examples. From the small set of labeled foreground and background pixels, the prediction task is to recover the ground-truth segmentation for the whole image.
Our baseline comparison is the Grabcut algorithm, which solves a pairwise CRF. The unary terms of the CRF are obtained by ﬁtting a Gaussian Mixture Model to the histograms of pixels labeled as being deﬁnitely foreground or background. The pairwise terms are a standard contrast-sensitive Potts poten-
220

Input
GrabCut
S3SVM-AMN
S3SVM
Figure 9.2: Example images from binary segmentation results. Input with user annotations are shown at top, with results below.
tial, where the cost of pixels i and j taking different labels is equal to λ·exp(−β|xi− x j|) for some hand-coded parameters β, λ. Our primary comparison is against the OpenCV implementation of Grabcut, available at www.opencv.org.
As a special case, our algorithm can be applied to pairwise-submodular energy functions, for which it solves the same optimization problem as in Associative Markov Networks (AMN’s) [94, 2]. Automatically learning parameters allows us to add a large number of learned unary features to the CRF.
As a result, in addition to the smoothness parameter λ, we also learn the relative weights of approximately 400 features describing the color values near a pixel, and relative distances to the nearest labeled foreground/background pixel. Further details on these features can be found in the Supplementary Material. We refer to this method as S3SVM-AMN.
Our general S3SVM method can incorporate higher-order priors instead of
221

just pairwise ones. In addition to the unary features used in S3SVM-AMN, we add a sum-of-submodular higher-order CRF. Each 2 × 2 patch in the image has a learned submodular clique function. To obtain the beneﬁts of the contrast-sensitive pairwise potentials for the higher-order case, we cluster (using k-means) the x and y gradient responses of each patch into 50 clusters, and learn one submodular potential for each cluster. Note that S3SVM automatically allows learning the entire energy function, including the clique potentials and unary potentials (which come from the data) simultaneously.
We use a standard interactive segmentation dataset from [32] of 151 images with annotations, together with pixel-level segmentations provided as ground truth. These images were randomly sorted into training, validation and testing sets, of size 75, 38 and 38 respectively. We trained both S3SVM-AMN and S3SVM on the training set for various values of the regularization parameter c, and picked the value c which gave the best accuracy on the validation set, and report the results of that value c on the test set.
The overall performance is shown in the table below. Training time is measured in seconds, and testing time in seconds per image. Our implementation, which used the submodular ﬂow algorithm based on IBFS discussed in section 5.2, will be made freely available under the MIT license.

Algorithm Average error Training Testing

Grabcut

10.6± 1.4%

n/a 1.44

S3SVM-AMN

7.5± 0.5% 29000 0.99

S3SVM

7.3± 0.5% 92000 1.67

Learning and validation was performed 5 times with independently sam-

222

Figure 9.3: A multi-label segmentation result, on data from [36]. The purple label represents vegetation, red is rhino/hippo and blue is ground. There are 7 labels in the input problem, though only 3 are present in the output we obtain on this particular image.
pled training sets. The averages and standard deviations shown above are from these 5 samples.
While our focus is on binary labeling problems, we have conducted some preliminary experiments with the multi-label version of our method described in section 9.3.4. A sample result is shown in ﬁgure 9.3, using an image taken the Corel dataset used in [36].
223

CHAPTER 10 CONCLUSION
In this thesis, we set out to expand the set of models for which fast inference is possible. As we’ve noted in the introduction, modeling and inference are tightly coupled — applications demand more accurate solutions which require more sophisticated models, however they are limited to the kinds of problems for which fast inference algorithms are known.
For computer vision problems, Markov Random Fields continue to be the tool of choice for encoding spatial relations between pixels in an image. We’ve shown that many useful properties of images cannot be encoded by ﬁrst-order MRFs, and that many natural features of images, especially their local, patchbased statistics, are best encoded by higher-order MRFs.
When developing optimization algorithms for higher-order MRFs, we have been able to leverage existing graph-cuts methods, using reduction methods to turn higher-order MRFs into ﬁrst-order problems. We’ve also seen that the basic ideas behind graph-cuts can be generalized to higher-order models, including alpha-expansion and other primal large-neighborhood search algorithms.
For designing higher-order versions of graph-cuts methods, the key concept appears to be submodularity. This has been known to be the necessary condition for ﬁrst-order MRFs since [52]; however, the condition for ﬁrst-order graphs is much simpler than the general case. In particular, we have found that Sumof-Submodular inference is a natural middle-ground between min-cut based inference and fully general submodular function minimization with
MIN-CUT ⊆ SOS MINIMIZATION ⊆ SUBMODULAR MINIMIZATION (10.1)
224

In particular, we are able to take advantage of the clique structure by treating the cliques as a hypergraph over the variables — this allows a fairly straightforward generalization of augmenting paths based algorithms to the higher order case (including the state-of-the-art for vision inputs, IBFS [30]). In this algorithm, we noted that submodularity is the key property for augmenting paths to ﬁnd a globally optimal solution.
For multilabel problems, our key tool has been Linear Programming, and in particular, the Local Marginal Polytope relaxation of the MRF inference problem. The primary feature of linear programming based algorithms is that they can actually say something about the global behavior of the problem, and in particular, using duality we get a global lower bound on the optimal solution. This is in contrast with many popular primal-only algorithms such as alphaexpansion and fusion-moves which make local choices (even if they are searching over a very large local neighborhood), and which cannot say anything about the global optimum.
Furthermore, we’ve seen that the LP dual is also useful to speed up inference algorithms, by guiding the binary subproblems within a fusion-move algorithm, as done by the primal-dual algorithm SoSPD. In particular, we have generalized the FastPD algorithm [59] for ﬁrst-order MRFs to work on higherorder problems, with many of the same speedups over pure-primal algorithms. Furthermore, we’ve used the dual LP to prove approximation ratios for our algorithm, giving a guaranteed bound on how far we can be from the optimum solution.
A major limitation of higher-order MRFs is that they are difﬁcult to design by hand, as they have many more parameters than ﬁrst-order models. We have
225

explored one method for learning higher-order models, using a Structural SVM approach; however, many more learning algorithms are possible. A key feature of this (and related) learning algorithms, though, is that they require repeated application of inference. As a result, every time new inference algorithms allow new models to be efﬁciently optimized, we can do learning on these models as well.
Finally, we will note that higher-order MRFs are a heavyweight solution for many applications. The algorithms presented in this thesis have brought higherorder MRFs from being largely-intractable to being reasonably fast to optimize. However, the algorithms involved still scale poorly with the clique size (typically O(2|C|) and generally require minutes to run, compared to the milliseconds required for real-time performance. However, for achieving maximum accuracy, they present a much greater ﬂexibility for modeling and encoding of constraints.
226

APPENDIX A LOCAL COMPLETENESS

We are considering a general labeling problem, with variables x1, . . . , xn, tak-

ing labels in sets L1, . . . , Ln. In an MRF the energy function can be written as a

sum of clique energies: there is some set C of cliques, and clique functions fC

such that

E(x1, . . . , xn) = fC(xC)
C

(A.1)

To minimize this energy with respect to a fusion move, we have an input image I and a proposed image I , and for each pixel a binary variable bk that encodes whether the k-th pixel takes label Ik or Ik.

The energy function is now a sum of clique energies over these n binary variables. For a clique C of size d, we can write the clique energy fC as a sum of terms in the binary variables and their negations, by specifying the energy pointwise for each possible assignment of the d binary variables in C:

• For each assignment γ ∈ Bd, where B = {0, 1}, let f (γ) be the energy of the clique in the fused image according to γ. For instance, with d = 4 and γ = (0, 1, 1, 0) we have f (γ) = fC(I0, I1, I2, I3).
• Let b(γ) be the term whose i-th literal is bi or bi, according to whether γi is 1 or 0 respectively. For γ = (0, 1, 1, 0), we would have b(γ) = b0b1b2b3.
• Note that the term b(γ) is 1 exactly when the binary variables (b1, . . . , bd) take the assignment γ. Thus, we can write the clique energy as:

fC(b1, . . . , bd) = f (γ)b(γ)
γ∈Bd

(A.2)

227

The ﬁrst step of our reduction is to transform this to a multilinear polynomial by substituting 1 − bi for bi each time a negated variable occurs. This could possibly result in terms with coefﬁcients for each subset of b1, . . . , bd.

For each subset S ⊆ {b1, . . . , bd}, we can actually calculate the coefﬁcient on

the term tS = j∈S b j. Let ΓS be the set of assignments γ ∈ Bd with γi = 0 for

i S , and let

 σ(γ) =  1 The number of 0s in γ is even
 −1 otherwise

(A.3)

Then, after we substitute (1 − bi) for all occurrences of bi and collect all terms with the same variables, the coefﬁcient on the term tS is

Coeﬀ(tS ) = σ(γ) f (γ)
γ∈ΓS

(A.4)

Therefore, to show that our energy function is locally dense, it sufﬁces to show that Coeﬀ(tS ) is never (or rarely) 0 for any subset S {b1, . . . , bd} (note that we don’t care if t{x1,...,xd} has coefﬁcient 0, since it is not a subset of any term).

We can obtain a general theorem about the binary energy functions corre-

sponding to fusion moves, by moving to a continuous framework. We embed

the original intensities in R, and extend the clique energies fC to functions on

Rd. We need two assumptions: (1) fC is d − 1 times continuously differentiable

and

(2)

each

of

the

d

different

mixed

partials

∂d−1 f ∂x1···∂xi···∂xd

(where

∂xi

means

to

omit

the i-th partial) take their zeros in a set of measure 0.

Theorem 92. Under these two assumptions the set of proposed-current image pairs (I, I ) for which the fusion move binary energy function does not have local density 1 has measure 0 as a subset of Rn × Rn.

228

Proof. By the above argument, it sufﬁces to show that for a clique C on variables x1, . . . , xd, the current and proposed images (IC, IC) which have Coeﬀ(tS ) = 0 for each S {x1, . . . , xd} have measure 0 in Rd × Rd.

For every fusion move (IC, IC) and assignment γ ∈ Bd, we get a point v(γ) in



Rd, the result

of

fusion

on just the clique

pixels:

v(kγ)

=

 

Ik Ik

γk = 0 γi = 1

If we list out these points, we get the set

(I1, . . . , , Id−1, Id), (I1 . . . , Id−1, Id), (I1, . . . , Id−1, Id), . . . , (I1, . . . , Id−1, Id), (I1, . . . , Id)
containing each possible fusion of IC and IC. Note that these 2d points form an axis-aligned rectangular prism in Rd. Denote these points as Verts(IC, IC). Every fusion move (IC, IC) gives a rectangular prism in this fashion.
Now, ﬁx S , and let the bad set of fusion-moves, B, be those (IC, IC) for which Coeﬀ(tS ) = 0. To produce a contradiction, assume that B has nonzero measure. It is a closed set, so it contains an open ball. So there is some fusion move (x0, y0) and radius δ such that for all x, y ∈ Rd with | x|, | y| < δ, the fusion-move (x0 + x, y0 + y) is still a bad fusion-move.

Now, since Coeﬀ(tS ) = 0 for the fusion move (x0, y0), we can manipulate equation A.4 to get that

f (v(1,1,···1)) = −

σ(γ) f (v(γ))

γ∈ΓS \{(1,...,1)}

(A.5)

Since there’s an open ball of bad fusion moves around (x0, y0), for ∈ Rd with | | < δ, the fusion move (x0, y0 + ) is also bad. If we set (γ) equal to (γ1 1, . . . , γn n) (i.e. the vector which is i when γi is 1, and 0 otherwise), then the fusion move

229

(x0, y0 + ) gives a rectangular prism with vertices v(γ) + (γ) for γ ∈ Bd. Then, since

(x0, y0 + ) is still a bad fusion move, we can again manipulate equation A.4 to

get

f (v(1,1,...,1) + ) = −

σ(γ) f (v(γ) + (γ))

γ∈ΓS \{(1,...,1)}

(A.6)

Let gγ( ) = f (v(γ) +

(γ). Notice that since

(γ) i

=

0

whenever

γi

=

0,

we

have

that this function only depends on the variables xi where γi = 1. Thus, we have

that for γ ∈ ΓS \ {(1, . . . , 1)}, the function gγ depends on at most d − 2 of the i.

This is because S is a proper subset of {x1, . . . , xd}, so all the γ have γd = 0, and

then we remove the element (1, . . . , 1) which has d − 1 1s.

Therefore, we have that the mixed partial

∂d−1 gγ ∂x1···∂xd−1

is 0 for all |

|

<

δ.

Therefore,

since f (v(1,...,1) + ) is a linear combination of the gγ, it also has this mixed partial

equal to 0. But then, we have found a set of nonzero measure (the open ball of

radius δ around y0) with mixed partial 0, contradicting our hypothesis.

Therefore, the set of bad fusion moves in fact must have measure 0.

230

APPENDIX B LAPLACIAN EQUATIONS

In the proof of Lemma 74 in the main paper, in order to show the solution of Laplacian equation Mψ = ∆ is non-negative, we claimed it’s straightforward to show that the inverse of the coefﬁcient matrix M−1 is nonnegative (meaning each component is nonnegative), hence ψ = M−1∆ is non-negative (since ∆ is also non-negative). Now, we will show M−1 is non-negative for the completeness.

It’s useful to have the following fact about the inverse of a tridiagonal matrix [96]:

Lemma 93. The inverse of a non-singular tridiagonal matrix M



M = ac011

b1 a2 c2

b2 ... ...

... ...
cn−1

ban0−n1

is given by where



(M−1)i j

=

 (−1)i+ jΠkj−=1i bkθi−1φ j+1/θn,  (−1)i+ jΠik−=1jckθ j−1φi+1/θn,

if i if i

≤ >

j j

θi = aiθi−1 − bi−1ci−1θi−2 for i = 2, 3, . . . , n

φi = aiφi+1 − biciφi+2 for i = n − 1, . . . , 1 with initial values θ0 = 1, θ1 = a1, φn+1 = 1, φn = an.

(B.1)
(B.2) (B.3)

In our Laplacian equations, we have ai = 2, ∀i and b j = c j = −1, ∀ j. Substitute

231

them into Lemma 93, we have:



(M−1)i j

=

 θi−1φ j+1/θn,  θ j−1φi+1/θn,

if i if i

≤ >

j j

where

θi = 2θi−1 − θi−2 for i = 2, 3, . . . , n

φi = 2φi+1 − φi+2 for i = n − 1, . . . , 1 with initial values θ0 = φn+1 = 1, θ1 = φn = 2.

(B.4) (B.5)

It’s easy to use induction to show θi = i + 1 and φi = n + 2 − i from their

recursive deﬁnition in (B.5). Therefore, we have

(M−1)i j

=

 


i·

(n + 1 − n+1

j) ,

j

·

(n n

+ +

1 1

−

i)

,

if i if i

≤ >

j j

(B.6)

Clearly, we have M−1 to be a positive matrix in our Laplacian equations.

232

APPENDIX C APPROXIMATION RATIO FOR CARDINALITY UPPER BOUNDS

Theorem

79.

The

cardinality-based

upper

bound

gives

a

2(1−

1 p

)-approximation.

Proof. Similar to the proof of Theorem 76, we can rewrite the objective for odd

n as

n ψ p := 
k=0

n k

1
p ψk p

=

(n−1)/2  k=0

1

n k

(ψk p

+

p ψn−k p)

(C.1)

and for even n:

ψ p=

n/2−1

n k

(ψk p + ψn−k p)

k=0

1

+1 2

n
n 2

(ψ n p + ψ n p) 22

p

(C.2)

Let’s use Ψk = ψk p + ψn−k p as a shorthand. We can see the objective can be repre-

sented as with ak ≥ 0, ∀k.

1

ψ

p

:=

n



2
k=0

p akΨk

(C.3)

Consider ψ∗ and ψ¯ as the true optimal solution and our approximation so-

lution. Deﬁne Ψ∗ and Ψ¯ accordingly. Consider each term Ψk, we must have

Ψ∗k

≥

Lk p 2p−1

since

the

RHS

is

the

solution

for

the

following

program:

min ψ

Ψk

s.t. Ψk = ψk p + ψn−k p ψk + ψn−k ≥ Lk,

(C.4)

ψk, ψn−k ≥ 0

where

the

minimizer

is

achieved

by

ψk

=

ψn−k

=

Lk 2

.1

1Recall Lk is the lower bound of ψk + ψn−k which only depends on ∆.

233

Meanwhile, we must have Ψ¯ ≤ Lk p since the RHS is the solution for the

following program:

max ψ

Ψk

s.t. Ψk = ψk p + ψn−k p ψk + ψn−k = Lk,

(C.5)

ψk, ψn−k ≥ 0 where the maximizer is achieved by either ψk = 0, ψn−k = Lk or ψk = Lk, ψn−k = 0.2

Therefore,

we

must

have

Ψ¯ k Ψ∗k

≤

2p−1

for

∀k.

As a

non-negative linear

combina-

tion of non-negative numbers, we also have

n 2
k=0

akΨ¯ k

n

≤ 2p−1

2
k=0

akΨ∗k

(C.6)

hence

1

n 2
k=0

akΨ¯ k

p 1

≤

2(1−

1 p

)

n

2
k=0

akΨ∗k

p

(C.7)

2Recall we proved in Lemma 5 in the main paper that our approximation ψ¯ must let each Ψ¯ k = ψ¯ k + ψ¯ n−k achieves its lower bound Lk, hence we have the second constraint.
234

BIBLIOGRAPHY
[1] Bjo¨ rn Andres, Jo¨ rg H. Kappes, Ullrich Ko¨ the, Christoph Schno¨ rr, and Fred A. Hamprecht. An empirical comparison of inference algorithms for graphical models with higher order factors using opengm. In DAGMSymposium, pages 353–362, 2010.
[2] Dragomir Anguelov, Benjamin Taskar, Vassil Chatalbashev, Daphne Koller, Dinkar Gupta, Geremy Heitz, and Andrew Y. Ng. Discriminative learning of Markov Random Fields for segmentation of 3D scan data. In CVPR, pages 169–176, 2005.
[3] Chetan Arora, Subhashis Banerjee, Prem Kalra, and S. N. Maheshwari. Generic cuts: an efﬁcient algorithm for optimal inference in higher order MRF-MAP. In ECCV, 2012.
[4] Maria-Florina Balcan and Nicholas J. A. Harvey. Learning submodular functions. In ACM Symposium on Theory of Computing (STOC), pages 793– 802, 2011.
[5] D. Batra and P. Kohli. Making the right moves: Guiding alpha-expansion using local primal-dual gaps. In CVPR, pages 1865–1872, 2011.
[6] I. Ben Ayed, L. Gorelick, and Y. Boykov. Auxiliary cuts for general classes of higher order functionals. In CVPR, 2013.
[7] J. Besag. On the statistical analysis of dirty pictures (with discussion). Journal of the Royal Statistical Society, Series B, 48(3):259–302, 1986.
[8] E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Mathematics, 123(1-3), 2002.
[9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[10] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max-ﬂow algorithms for energy minimization in vision. TPAMI, 26(9):1124–1137, 2004.
[11] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. In International Conference on Computer Vision (ICCV), pages 377–384, 1999.
235

[12] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. TPAMI, 23(11):1222–1239, 2001.
[13] Qifeng Chen and Vladlen Koltun. Fast MRF optimization with application to depth reconstruction. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 3914–3921, 2014.
[14] B. V. Cherkassky and A. V. Goldberg. On implementing push-relabel method for the maximum ﬂow problem. Algorithmica, 19:390–410, 1997.
[15] J. Edmonds and R. Giles. A min-max relation for submodular functions on graphs. Annals of Discrete Mathematics, 1:185–204, 1977.
[16] Jack Edmonds. Submodular functions, matroids, and certain polyhedra. In Michael Jnger, Gerhard Reinelt, and Giovanni Rinaldi, editors, Combinatorial Optimization Eureka, You Shrink!, volume 2570 of Lecture Notes in Computer Science, pages 11–26. Springer Berlin Heidelberg, 2003.
[17] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In International Conference on Machine Learning (ICML), pages 304–311, 2008.
[18] A. Fix, A. Gruber, E. Boros, and R. Zabih. A graph cut algorithm for higher-order Markov Random Fields. In ICCV, 2011.
[19] A. Fix, T. Joachims, S. Park, and R. Zabih. Structured learning of sum-ofsubmodular higher order energy functions. In ICCV, 2013.
[20] Alexander Fix and Sameer Agarwal. Duality and the Continuous Graphical Model, pages 266–281. Springer International Publishing, Cham, 2014.
[21] Alexander Fix, Joyce Chen, Endre Boros, and Ramin Zabih. Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I, chapter Approximate MRF Inference Using Bounded Treewidth Subgraphs, pages 385–398. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
[22] Alexander Fix, Chen Wang, and Ramin Zabih. A primal-dual algorithm for higher-order multilabel markov random ﬁelds. In CVPR, 2014. Supplemental Material at www.cs.cornell.edu/˜afix/.
236

[23] L. Ford and D. Fulkerson. Flows in Networks. Princeton University Press, 1962.
[24] D. Freedman and P. Drineas. Energy minimization via graph cuts: Settling what is possible. In CVPR, 2005.
[25] Brendan Frey and David MacKay. A revolution: Belief propagation in graphs with cycles. In Neural Information Processing Systems (NIPS), 1997.
[26] S Fujishige and X Zhang. A push/relabel framework for submodular ﬂows and its reﬁnement for 0-1 submodular ﬂows. Optimization, 38(2):133–154, 1996.
[27] Andrew C Gallagher, Dhruv Batra, and Devi Parikh. Inference for order reduction in markov random ﬁelds. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1857–1864. IEEE, 2011.
[28] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. TPAMI, 6:721–741, 1984.
[29] Amir Globerson and Tommi S. Jaakkola. Fixing max-product: Convergent message passing algorithms for map lp-relaxations. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 553–560. Curran Associates, Inc., 2008.
[30] Andrew V. Goldberg, Sagi Hed, Haim Kaplan, Robert E. Tarjan, and Renato F. Werneck. Maximum ﬂows by incremental breadth-ﬁrst search. In European Symposium on Algorithms, pages 457–468. Springer-Verlag, 2011.
[31] Lena Gorelick, Yuri Boykov, Olga Veksler, Ismail Ben Ayed, and Andrew Delong. Submodularization for binary pairwise energies. In CVPR, 2014.
[32] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman. Geodesic star convexity for interactive image segmentation. In CVPR, 2010.
[33] P. L. Hammer, P. Hansen, and B. Simeone. Roof duality, complementation and persistency in quadratic 0-1 optimization. Mathematical Programming, 28:121–155, 1984.
[34] P.L. Hammer and S. Rudeanu. Boolean Methods in Operations Research and Related Areas. Springer, 1968.
237

[35] J. M. Hammersley and P. E. Clifford. Markov random ﬁelds on ﬁnite graphs and lattices. Unpublished manuscript, 1971.
[36] Xuming He, Richard S. Zemel, and Miguel A´ . Carreira-Perpin˜ a´n. Multiscale conditional random ﬁelds for image labeling. In CVPR, pages 695– 703, 2004.
[37] H. Ishikawa. Higher-order clique reduction in binary graph cut. In CVPR, 2009.
[38] H. Ishikawa. Higher-order gradient descent by fusion-move graph cut. In ICCV, 2009.
[39] Hiroshi Ishikawa. Exact optimization for Markov Random Fields with convex priors. TPAMI, 25(10):1333–1336, 2003.
[40] Hiroshi Ishikawa. Transformation of general binary MRF minimization to the ﬁrst order case. TPAMI, 33(6), 2010.
[41] T. Joachims, T. Finley, and Chun-Nam Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):27–59, 2009.
[42] Vladimir Jojic, Stephen Gould, and Daphne Koller. Accelerated dual decomposition for map inference. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 503–510, 2010.
[43] F. Kahl and P. Strandmark. Generalized roof duality for pseudo-boolean optimization. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 255–262, 2011.
[44] Fredrik Kahl and Petter Strandmark. Generalized roof duality. Discrete Applied Mathematics, 160(1617):2419 – 2434, 2012.
[45] Jorg H Kappes, Bjoern Andres, Fred A Hamprecht, Christoph Schnorr, Sebastian Nowozin, Dhruv Batra, Sungwoong Kim, Bernhard X Kausler, Jan Lellmann, Nikos Komodakis, et al. A comparative study of modern inference techniques for discrete energy minimization problems. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1328–1335. IEEE, 2013.
[46] B. M. Kelm, N. Mueller, B. H. Menze, and F. A. Hamprecht. Bayesian estimation of smooth parameter maps for dynamic contrast-enhanced mr
238

images with block-icm. In Computer Vision and Pattern Recognition Workshop, 2006. CVPRW ’06. Conference on, pages 96–96, June 2006.
[47] Jon Kleinberg and Eva Tardos. Approximation algorithms for classiﬁcation problems with pairwise relationships: metric labeling and Markov Random Fields. J. ACM, 49(5):616–639, 2002. ACM Press.
[48] Pushmeet Kohli, M. Pawan Kumar, and Philip H.S. Torr. P3 and beyond: Move making algorithms for solving higher order functions. TPAMI, 31(9):1645–1656, 2008.
[49] Pushmeet Kohli, Lubor Ladick, and Philip Torr. Robust higher order potentials for enforcing label consistency. IJCV, 82:302–324, 2009.
[50] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. In International Workshop on Artiﬁcial Intelligence and Statistics (AISTATS), 2005.
[51] V. Kolmogorov and C. Rother. Minimizing nonsubmodular functions with graph cuts-a review. TPAMI, 29(7):1274–1279, July 2007. Earlier version appears as technical report MSR-TR-2006-100.
[52] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? TPAMI, 26(2):147–59, 2004.
[53] Vladimir Kolmogorov. Minimizing a sum of submodular functions. Discrete Appl. Math., 160(15):2246–2258, October 2012.
[54] Vladimir Kolmogorov and Thomas Schoenemann. Generalized sequential tree-reweighted message passing. CoRR, abs/1205.6352, 2012.
[55] N. Komodakis and N. Paragios. Beyond pairwise energies: Efﬁcient optimization for higher-order MRFs. In CVPR, pages 2985–2992, 2009.
[56] Nikos Komodakis, Nikos Paragios, and Georgios Tziritas. Mrf energy minimization and beyond via dual decomposition. In IN: IEEE PAMI., 2011.
[57] Nikos Komodakis and Georgios Tziritas. A new framework for approximate labeling via graph cuts. In International Conference on Computer Vision (ICCV), 2005.
239

[58] Nikos Komodakis and Georgios Tziritas. Approximate labeling via graph cuts based on linear programming. TPAMI, 29(8):1436–1453, 2007.
[59] Nikos Komodakis, Georgios Tziritas, and Nikos Paragios. Fast primaldual strategies for MRF optimization. Technical Report 0605, Ecole Centrale de Paris, 2006.
[60] H. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic labeling of 3D point clouds for indoor scenes. In Conference on Neural Information Processing Systems (NIPS), 2011.
[61] Vivek Kwatra, Arno Schodl, Irfan Essa, Greg Turk, and Aaron Bobick. Graphcut textures: Image and video synthesis using graph cuts. SIGGRAPH, 2003.
[62] Dongjin Kwon, Kyong Joon Lee, Il Dong Yun, and Sang Uk Lee. Nonrigid image registration using dynamic higher-order mrf model. In ECCV, pages 373–386, 2008.
[63] Lubor Ladicky, Chris Russell, Pushmeet Kohli, and Philip H. S. Torr. Computer Vision – ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V, chapter Graph Cut Based Inference with Co-occurrence Statistics, pages 239–253. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
[64] Lubor Ladicky, Christopher Russell, Pushmeet Kohli, and Philip H. S. Torr. Associative hierarchical CRFs for object class image segmentation. In ICCV, pages 739–746, 2009.
[65] J. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
[66] V Lempitsky, C Rother, S Roth, and A Blake. Fusion moves for Markov Random Field optimization. TPAMI, 32(8):1392–1405, Aug 2010.
[67] Hui Lin and Jeff Bilmes. Learning mixtures of submodular shells with application to document summarization. In UAI, pages 479–490, 2012.
[68] L. Lova´sz. Submodular functions and convexity, pages 235–257. Springer Berlin Heidelberg, Berlin, Heidelberg, 1983.
240

[69] Carsten Lund and Mihalis Yannakakis. On the hardness of approximating minimization problems. J. ACM, 41(5):960–981, September 1994.
[70] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
[71] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. Contextual classiﬁcation with functional max-margin Markov networks. In CVPR, pages 975–982, 2009.
[72] Claudia Nieuwenhuis, Eno Tppe, Lena Gorelick, Olga Veksler, and Yuri Boykov. Efﬁcient regularization of squared curvature. CoRR, abs/1311.1838, 2013.
[73] James B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Math. Program., 118(2):237–251, January 2009.
[74] Judeah Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988.
[75] R. Potts. Some generalized order-disorder transformations. Proceedings of the Cambridge Philosophical Society, 48:106–109, 1952.
[76] I.G. Rosenberg. Reduction of bivalent maximization to the quadratic case. Technical report, Centre d’Etudes de Recherche Oprationnelle, 1975.
[77] Stefan Roth and Michael Black. Fields of experts. IJCV, 82:205–229, 2009.
[78] C. Rother, P. Kohli, W. Feng, and J.Y. Jia. Minimizing sparse higher order energy functions of discrete variables. In CVPR, pages 1382–1389, 2009.
[79] C. Rother, V. Kolmogorov, and A. Blake. “GrabCut” - interactive foreground extraction using iterated graph cuts. SIGGRAPH, 23(3):309–314, 2004.
[80] C. Rother, S. Kumar, V. Kolmogorov, and A. Blake. Digital tapestry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
[81] Bogdan Savchynskyy, Stefan Schmidt, Jo¨ rg Kappes, and Christoph Schno¨ rr. Efﬁcient mrf energy minimization via adaptive diminishing
241

smoothing. Uncertainty in Artiﬁcial Intelligence, UAI-2012, pages 746–755, 2012.
[82] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002.
[83] Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I– 195. IEEE, 2003.
[84] Dmitrij Schlesinger. Exact solution of permuted submodular minsum problems. In Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 28–38. 2007.
[85] Alexander Shekhovtsov, Pushmeet Kohli, and Carsten Rother. Pattern Recognition: Joint 34th DAGM and 36th OAGM Symposium, Graz, Austria, August 28-31, 2012. Proceedings, chapter Curvature Prior for MRF-Based Segmentation and Shape Inpainting, pages 41–51. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
[86] M.I. Shlezinger. Syntactic analysis of two-dimensional visual signals in the presence of noise. Cybernetics, 12(4):612–628, 1976.
[87] R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2012.
[88] Petter Strandmark and Fredrik Kahl. Energy Minimization Methods in Computer Vision and Pattern Recognition: 8th International Conference, EMMCVPR 2011, St. Petersburg, Russia, July 25-27, 2011. Proceedings, chapter Curvature Regularization for Curves and Surfaces in a Global Optimization Framework, pages 205–218. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
[89] Paul Swoboda, Bogdan Savchynskyy, Jorg H. Kappes, and Christoph Schnorr. Partial optimality by pruning for map-inference with general graphical models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
[90] Rick Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marshall Tappen, and Carsten Rother. A
242

comparative study of energy minimization methods for Markov Random Fields. TPAMI, 30(6):1068–1080, 2008.
[91] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In ECCV, pages II: 582–595, 2008.
[92] Meng Tang, Ismail Ben Ayed, and Yuri Boykov. Pseudo-bound optimization for binary energies. In ECCV, 2014.
[93] B. Taskar, C. Guestrin, and D. Koller. Maximum-margin markov networks. In Advances in Neural Information Processing Systems (NIPS), 2003.
[94] Benjamin Taskar, Vassil Chatalbashev, and Daphne Koller. Learning associative markov networks. In ICML, 2004.
[95] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In International Conference on Machine Learning (ICML), pages 104–112, 2004.
[96] Riaz A Usmani. Inversion of a tridiagonal jacobi matrix. Linear Algebra and Its Applications, 212:413–414, 1994.
[97] Chaohui Wang, Nikos Komodakis, and Nikos Paragios. Markov random ﬁeld modeling, inference & learning in computer vision & image understanding: A survey. Computer Vision and Image Understanding, 117(11):1610 – 1627, 2013.
[98] Chaohui Wang, Olivier Teboul, Fabrice Michel, Salma Essaﬁ, and Nikos Paragios. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2010: 13th International Conference, Beijing, China, September 20-24, 2010, Proceedings, Part III, chapter 3D Knowledge-Based Segmentation Using Pose-Invariant Higher-Order Graphs, pages 189–196. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
[99] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In Uncertainty in AI, 2007.
[100] T. Werner. A linear programming approach to max-sum problem: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(7):1165–1179, July 2007.
243

[101] Toma´sˇ Werner. High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In CVPR 2008: Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 109–116, Madison, USA, June 2008. IEEE Computer Society, Omnipress.
[102] Oliver Woodford, Philip Torr, Ian Reid, and Andrew Fitzgibbon. Global stereo reconstruction under second-order smoothness priors. TPAMI, 31:2115–2128, 2009.
[103] Yisong Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In International Conference on Machine Learning (ICML), pages 271– 278, 2008.
[104] Stanislav Zivny, David Cohen, and Peter Jeavons. The expressive power of binary submodular functions. In Mathematical Foundations of Computer Science 2009, volume 5734 of LNCS, pages 744–757. 2009.
244