Algebra of inference in graphical models revisited

1 downloads 0 Views 162KB Size Report
Sep 29, 2014 - AI] 29 Sep 2014. 1. Algebra of inference in graphical models revisited. Siamak Ravanbakhsh, Russell Greiner. Computing Science Dept.
1

Algebra of inference in graphical models revisited Siamak Ravanbakhsh, Russell Greiner Computing Science Dept., University of Alberta, Edmonton, AB, Canada T6G 2E8

arXiv:1409.7410v2 [cs.AI] 29 Sep 2014

{mravanba, rgreiner}@ualberta.ca



Abstract—Graphical Models use the intuitive and well-studied methods of graph theory to implicitly represent dependencies between variables in large systems and model the global behaviour of a complex system by specifying only local factors. The variational perspective poses inference as optimization and provides a rich framework to study inference when the object of interest is a (log) probability. However, graphical models can operate on a much wider set of algebraic structures. This paper builds on the work of Aji and McEliece (2000), to formally and broadly express what constitutes an inference problem in a graphical model. We then study the computational complexity of inference and show that inference in any commutative semiring is NP-hard under randomized reduction. By confining inference to four basic operations of min, max, sum and product, we introduce the inference hierarchy with an eye on computational complexity and establish the limits of message passing using distributive law in solving the problems in this hierarchy.

Learning and inference of the global properties of a system of interacting local functions is a pattern that has been rediscovered (almost independently) in different fields of science. Graphical models have been used in bioinformatics, neuroscience, communication theory, statistical physics, image processing, compressed sensing, robotics, sensor networks, social networks, natural language processing, speech recognition, combinatorial optimization and artificial intelligence. The variational perspective [1] expresses inference as constrained optimization, and message passing procedures are often derived using the method of Lagrange multipliers. The variational perspective is in essence concerned with probabilities – in particular exponential family – and “probabilistic” graphical models. [2] offer a different viewpoint into message passing as generalized distributive law. According to this view, message passing procedures are systematic procedures to leverage distributive law in a broad class of algebraic objects. Where inference is confined to (log) probabilities these two approaches result in the same message passing procedures [3]. Here we use the term “graphical model” rather than “probabilistic graphical model” to stress this generality. Although the framework of Aji and McEliece explains some widely used procedures (from matrix multiplication to fast Fourier transform) as message passing, it is unable to formulate general inference problems in graphical models. For example, a wellstudied problem such as marginal-MAP inference (a.k.a. most probable explanation) that does not allow direct application of

distributive law, does not fit within this framework. In section 1, we backtrack from the structure used by Aji and McEliece – i.e., commutative semiring – to express inference in graphical models using a more basic form of commutative semigroup. Section 2 focuses on four semigroups defined by summation, multiplication, minimization and maximization to construct a hierarchy of inference problems within PSPACE, such that the problems in the same class of the hierarchy belong to the same complexity class. Here, we encounter some new inference problems and establish the completeness of problems at lower levels of hierarchy, such as min-max and sum-min, w.r.t. their complexity classes. In section 3 we augment our simple structures with two properties to obtain message passing. Here, we also observe that replacing a semigroup with an abelian group, gives us normalized marginalization as a form of inference inquiry. Then using a previously unexplored property of commutative semirings –i.e., existence of identity and annihilator – section 3.2 show that inference in any commutative semiring is NP-hard. We also review BP equations and obtain a negative result regarding the possibility of using distributive law to perform efficient inference with more than one marginalization operations.

1

T HE

PROBLEM OF INFERENCE

We use commutative semigroups to both define what a graphical model represents and also to define inference over this graphical model. Definition 1.1. A commutative semigroup is a tuple G = (Y ∗ , ⊗), where Y ∗ is a set and ⊗ : Y ∗ × Y ∗ → Y ∗ is a binary operation that is (I) associative: a ⊗ (b ⊗ c) = (a ⊗ b) ⊗ c and (II) commutative: a ⊗ b = b ⊗ a for all a, b, c ∈ Y ∗ . If every element a ∈ Y ∗ has an inverse a−1 (often written a1 ), ⊗





and an identity 1 such that a ⊗ a−1 = 1, and a ⊗ 1 = a the commutative semigroup is an abelian group. Here, the associativity and commutativity properties of a commutative semigroup make the operations invariant to the order of elements. In general, these properties are not “vital”

2

and one may define inference starting from a magma.1 Example 1.1. • the set of strings with concatenation operation form a semigroup with empty string as identity element. However this semigroup is not commutative. • the set of natural numbers N with summation define a commutative semigroup. • integers modulo n with addition defines an abelian group. • the power-set 2S of any set S, with intersection operation defines a commutative semigroup with S as its identity element. • the set of natural numbers with greatest common divisor defines a commutative semigroup. • in fact any semilattice is a commutative semigroup. • given two commutative semigroups on two sets Y ∗ and Z ∗ , their Cartesian product is also a commutative semigroup. Factor-graph [5], Markov networks [6], Bayesian networks [7] and Forney graphs [8] are all different forms of graphical models. Since factor-graphs can represent Markov and Bayesian networks and Forney graphs, we base our treatment on factor-graphs. Here, we only consider graphical models over discrete variables. Let x = {x1 , . . . , xN } be a set of N discrete variables xi ∈ Xi , where Xi is the domain of xi and x ∈ X = X1 ×. . .×XN . Let I ⊆ N = {1, 2, . . . , N} denote a subset of variable indices and xI = {xi | i ∈ I} ∈ XI be the tuple of variables in x indexed by the subset I. A factor fI : XI → YI is a function over a subset of variables. Definition 1.2. A factor-graph is a pair (F , G ) such that • F = {fI } is a collection of factors with collective range S Y = I YI . • F has a polynomial-sized representation in N. ∗ ∗ • G = (Y , ⊗) is a commutative semigroup, where Y is the closure of Y w.r.t. ⊗. It compactly represents the expanded (joint) form O fI (xI ) q(x) = (1) I

Note that the connection between the set of factors F and the commutative semigroup is through the “range” of factors. Since each factor fI requires O(log(|XI |)) bits for its repreP sentation, F needs O( I log(|XI |)) bits. Sufficient conditions for this number to be polynomial are: 1) |I| = O(log(N)) ∀I and 2) |F | = Poly(N). However, the first condition is not necessary as for example exponential sized sparse factors can have a polynomial representation. F can be conveniently represented as a bipartite graph, that includes two sets of nodes: variable nodes xi , and factor nodes I. A variable node i (note that we will often identify a variable xi with its index “i”) is connected to a factor node I if and only if i ∈ I. We will use ∂ to denote the neighbours of a variable or factor node in the factor graph– that is ∂I = {i | i ∈ I} (which is the set I) and ∂i = {I | i ∈ I}. A marginalization operation shrinks the expanded form q(x) using another commutative semigroup with binary operation 1. Magma [4] generalizes semigroup, as it does not require associativity property nor an identity element. Here, to use magma (in Definition 1.2), the elements of Y ∗ and/or X should be ordered so as to avoid ambiguity in the order of pairwise operations over the set.

⊕. Inference is a combination of expansion and one or more shrinkage operations, which can be difficult due to the exponential size of the expanded form. Definition 1.3. Given a function q : XJ → Y, and a commutative semigroup G = (Y ∗ , ⊕), where Y ∗ is the closure of Y w.r.t. ⊕, the marginal of q for I ⊂ J is M def q(xJ ) q(xJ\I ) = (2) xI

L

L

where xI q(xI ) is short for xI ∈XI q(xI ), and means the operation ⊕ is performed over the set of all the values that the tuple xI ∈ XI can take. We can think of q(x) as a |J|-dimensional tensor and marginalization as performing ⊕ operation over the axes in the set I. The result is another |J \ I|-dimensional tensor (or function) that we call the marginal. Here if the marginalization is over all the dimensions in J, we denote the marginal by q(∅) instead of q(x∅ ) and call it the integral of q. Now we define an inference problem as a sequence of marginalizations over the expanded form of a factor-graph. Definition 1.4. An inference problem seeks q(xJ0 ) =

M M

xJ

M

M−1 M

...

xJ

M−1

1 M

xJ

1

O

fI (xI )

(3)

I

where ∗ • Y is the closure of Y (the collective range of factors), 1

• • •

M

w.r.t. ⊕, . . . ,m⊕ and ⊗. Gm = (Y ∗ , ⊕) ∀1 ≤ m ≤ M and Ge = (Y ∗ , ⊗) are all commutative semigroups. J0 , . . . , JL partition the set of variable indices N . q(xJ0 ) has a polynomial representation in N. 1

M

Note that ⊕, . . . , ⊕ are referring to potentially different operations as each belong to a different semigroup. When J0 = ∅, we call the inference problem integration (denoting the inquiry by q(∅)) and otherwise we call it marginalization. Here, having a constant sized J0 is not always enough to ensure that q(xJ0 ) has a polynomial representation in N. This is because the size of q(xJ0 ) for any individual xJ0 ∈ XJ0 may grow exponentially with N (e.g., see claim 2.1). Example 1.2. A “probabilistic” graphical model is defined using a expansion semigroup Ge = (ℜ+ , ×) and often a marginalization semigroup Gm = (ℜ+ , +) and the expanded form Q represents the unnormalized joint probability q(x) = I fI (xI ), whose marginal probabilities are simply called marginals. Replacing the summation with marginalization semigroup Gm = (ℜ+ , max), seeks the maximum probability Q state and the resulting integration problem q(∅) = maxx I fI (xI ) is known as maximum a posteriori (MAP) inference. Alternatively by adding a second marginalization operation to the summation, we get the marginal MAP inference XY (4) fI (xI ). q(xJ0 ) = max xJ

2

xJ

1

I

3 2 1 L P N Q L = max. = and where here = , If the object of interest is the negative log-probability (a.k.a. energy), the product expansion semirgroup is replaced by Ge = (ℜ, sum). Instead of sum marginalization semigroup, we can use the log-sum-exp semigroup, Gm = (ℜ, ⊕) where def a ⊕ b = log(e−a + e−b ). The integral in this case is the log-partition function. If we change the marginalization semigroup to Gm = (ℜ, min), the integral is the minimum energy (corresponding to MAP).

The operation min (also max) is a choice function, which M

means mina∈A a ∈ A. The implication is that if ⊕ in Definition 1.4 is min (or max), we can replace it with argxJ max without changing the inference problem. For M example, in performing min-max inference we may ask for x∗ = argx min maxI fI (xI )

where J1 , J2 and J3 partition N . Lets assume J1 = {2, . . . , N 2 }, + 1, . . . , N} and J = {1}. Since we have three J2 = { N 3 2 marginalization operations M = 3. Here the first and second marginalizations are exponential and the third one is polynomial (since |J3 | is constant). Therefore D = {3}. Since 1 L P the only exponential summation is xJ = xJ , S = {1}. 1 1 In our inference hierarchy, this problem belongs to the class ∆3 ({1}, {3}). Alternatively, if we assume J1 , J2 and J3 all linearly grow with N, the corresponding inference problem becomes a member of Σ({1, 3}, ∅). Define the base member of families as def

def

Σ0 (∅, ∅) = {sum} Φ0 (∅, ∅) = {min} def

Ψ0 (∅, ∅) = {max}

2

T HE

INFERENCE HIERARCHY

def

Π0 (∅, ∅) = {prod}

def

∆1 (∅, {1}) = {sum − sum, min − min, max − max}

Often, the complexity class is concerned with the decision version of the inference problem in Definition 1.4. The decision version of an inference problem asks a yes/no question ?

about the integral: q(∅) ≥ q for a given q. Here, we produce a hierarchy of inference problems in analogy to polynomial [9], the counting [10] and arithmetic [11] hierarchies. We assume the following in the Definition 1.4: (I) any l

two consecutive marginalization operations are distinct (⊕ = 6 l+1

⊕ ∀1 ≤ l < M). (II) the marginalization index sets Jl ∀1 ≤ l ≤ M are non-empty. Moreover if |Jl | = O(log(N)) we call this marginalization operation a polynomial marginalization as here |XJl | = Poly(N). (III) inference involves M (1 < M ≤ N) marginalization operations – that is M may grow with N. (IV) Y ∗ ⊆ ℜ+ . Note that we can re-express any inference problem to enforce (I) and (II) and therefore they do not impose any restriction. Define five inference families Σ, Π, Φ, Ψ, ∆. The families are associated with that “last” marginalization operation – M

i.e., ⊕ in Definition 1.4). Σ is the family of inference problems

where none of these initial members are actual inference problems and only provide the basis for recursive definition of different families. The exception is ∆1 (∅, {1}), which contains three inference problems.2 def

Let ΞM (S, D) = ΣM (S, D) ∪ ΠM (S, D) ∪ ΦM (S, D) ∪ ΨM (S, D) ∪ ∆M (S, D) be the union of all families with parameters M, S and D. Also let ΞM \ ΣM (S, D) be the short notation for ΞM (S, D) \ ΣM (S, D) – i.e., the inference problems that are in ΞM (S, D) and not in ΣM (S, D). Now define the inference family members recursively def  ΣM+1 (S ∪ {M + 1}, D) = sum − ξ | ξ ∈ ΞM \ ΣM (S, D) def  ΦM+1 (S, D) = min −ξ | ξ ∈ ΞM \ ΦM (S, D) def  ΨM+1 (S, D) = max −ξ | ξ ∈ ΞM \ ΨM (S, D) def

ΠM+1 (S, D) = ∅ where |JM | = ω(log(N)), M > 0 def  ∆M+1 (S, D ∪ {M + 1}) = ⊕ −ξ | ξ ∈ ΞM (S, D) where |JM | = O(log(N)), M > 1 and ⊕ ∈ {sum, min, max}

M

where ⊕ = + . Similarly, Π is associated with product, Φ with minimization and Ψ with maximization. ∆ is the family of inference problems where the last marginalization M

is polynomial (i.e., |JM | = O(log(N)) regardless of ⊕). Each family is parameterized by a subscript M and two sets S and D (e.g., ΦM (S, D) is an inference “class” in family Φ). As before, M is the number of marginalization operations, S is the set of indices of the (exponential) sum-marginalization and D is the set of indices of polynomial marginalizations. We will see that these arguments are sufficient to identify the complexity class of inference. Example 2.1. Sum-min-sum-product is a short notation for the decision problem XY X ? fI (xI ) ≥ q min xJ3

xJ

2

xJ

1

I

In definition above, we ignored the inference problems in which product appears in any of the marginalization semigroups (e.g., product-sum). The following claim, explains this choice. Claim 2.1. 3 For ⊕M = prod, the inference query q(xJ0 ) can have an exponential representation in N and inference is outside PSPACE4 . 2. We treat M = 1 for ∆ specially as in this case the marginalization operation can not be polynomial. This is because if |J1 | = O(log(N)), then |J0 | = Ω(N) which violates the conditions in the definition of the inference problem. 3. All proofs appear in the appendix. 4. PSPACE is the class of problems that can be solved by a (nondeterministic) Turing machine in polynomial space.

4

hierarchy7

2.1 Single marginalization The inference classes in the hierarchy with one marginalization are ∆1 (∅, {1}) = {min − min, max − max, sum − sum} Ψ1 (∅, ∅) = {max − min, max −sum, max −prod} Φ1 (∅, ∅) = {min − max, min −sum, min −prod} Σ1 ({1}, ∅) = {sum − prod, sum − min, sum − max}

℧(ΦM+1 (S, D)) = coNP℧(ΞM \ΦM (S,D)) ℧(ΨM+1 (S, D)) = NP

(5) (6) (7) (8)

℧(ΞM \ΨM (S,D))

℧(ΣM+1 (S ∪ {M + 1}, D)) = PP ℧(∆M+1 (S, D ∪ {M + 1})) = P

(9) (10)

℧(ΞM \ΣM (S,D))

℧(ΞM \∆M (S,D))

(11) (12)

Example 2.2. Consider the marginal-MAP inference of eq(4). ?

Here we establish the complexity of problems starting from ∆1 . Proposition 2.2. sum-sum, min-min and max-max inference are in P. Max-sum and max-prod are widely studied and it is known that their decision version is NP-complete [12]. By reduction from satisfiability we can show that max-min inference [13] is also NP-hard.

The decision version of this problem, q(∅) ≥ q, is a member of Ψ2 ({1}, ∅) which also includes max −sum − min and max −sum − max. The complexity of this class according to eq(10) is ℧(Ψ2 ({1}, ∅)) = NPPP . However, marginal-MAP is also known to be “complete” w.r.t. NPPP [17]. Now suppose that the max-marginalization over xJ2 is polynomial (e.g., |J2 | is constant). Then marginal-MAP belongs to ∆2 ({1}, {2}) with complexity PPP . This is because a Turing machine can enumerate all zJ2 ∈ XJ2 in polynomial time and call its PP oracle to see if

Theorem 2.3. The decision version of max-min inference that

?

?

asks maxx minI fI (xI ) ≥ q is NP-complete.

where

This means all the problems in Ψ1 (∅, ∅) are in NP (and in fact are complete w.r.t. their complexity class). In contrast, problems in Φ1 (∅, ∅) are in coNP.5 Note that by changing ?

?

the decision problem from q(∅) ≥ q to q(∅) ≤ q, the complexity classes of problems in Φ and Ψ family are reversed (i.e., problems in Φ1 (∅, ∅) become NP-complete and the problems in Ψ1 (∅, ∅) become coNP-complete). To our knowledge the only complexity result available for Σ1 ({1}, ∅) concerns the PP-completeness6 of sum-product inference [14], [15]. Here, we prove the same result for summin (sum-max) inference. Theorem 2.4. The sum-min decision ? P x minI fI (xI ) ≥ q is PP-complete for Y = {0, 1}.

problem

At this point we have reviewed all the problems in the hierarchy that only have a single marginalization operation and proved that ∆1 , Ψ1 , Φ1 and Σ1 are complete w.r.t. P, NP, coNP and PP respectively.

2.2 Complexity for the general case Let ℧(.) denote the complexity class of an inference class in the hierarchy. In obtaining the complexity class of problems with M > 1, we use the following fact which is also used in the polynomial hierarchy: PNP = PcoNP [16]. In A A fact PNP = PcoNP , for any oracle A. This means that by adding a polynomial marginalization to the problems in ΦM (S, D) and ΨM (S, D), we get the same complexity class ℧(∆M+1 (S, D ∪ {M + 1})). The following gives a recursive definition of complexity class for problems in the inference 5. coNP is the class of decision problems in which the “NO instances” have a polynomial time verifiable witness or proof. 6. PP is the class of problems that are polynomially solvable using a non-deterministic Turing machine, where the acceptance condition is that the majority of computation paths accept.

q(xJ0 | zJ2 ) ≥ q XY fI (xI\J2 , zI∩J2 ) q(xJ0 | zJ2 ) = xJ

2

I

and accept if any of its calls to oracle accepts and rejects otherwise. Here, fI (xI\J2 , zI∩J2 ) is the reduced factor, in which all the variables in xJ2 are fixed to zJ2 ∩I . The example above also hints at the rationale behind the recursive definition of complexity class for each inference class in the hierarchy. Consider the inference family Φ: Adding an exponential-sized min-marginalization to an inference problem with known complexity A, requires a Turing machine to nondeterministically enumerate zJM ∈ XJM possibilities, then call the A oracle with the “reduced factor-graph” – in which xJM is clamped to zJM – and reject iff any of the calls to oracle rejects. In contrast, for inference family Ψ, the Turing machine accepts iff any of the calls to the oracle accepts. A similar argument explains the recursive definition of complexity for the family Σ and ∆. Here, Toda’s theorem [18] has an interesting implication w.r.t. the hierarchy. This theorem states that PP is as hard as the polynomial hierarchy, which means min − max − min − . . .−max inference for an arbitrary, but constant, number of min and max operations appears below the sum-product inference in the inference hierarchy.

2.3 Logical sub-domain of the hierarchy By restricting the domain Y ∗ to {0, 1}, min and max become isomorphic to logical AND (∧) and OR(∨) respectively, where ∼ FALSE. By considering the restriction of the ∼ TRUE, 0 = 1= inference hierarchy to these two operations we can express quantified satisfiability (QSAT) as inference in a graphical ∼ ∃ – i.e., ∼ ∀ and ∨ = model, where ∧ = 7. We do not prove the completeness w.r.t. complexity classes beyond the first level of the hierarchy and only assert the membership.

5

^ _

xi



min max . . . max min min max xi xJ xJ xJ xJ I i∈∂I M 1 M−1 2 | {z }



∀xJ ∃xJ M

M−1

. . . ∃xJ ∀xJ 2

1

I i∈∂I

∼ =

fI (xI )

By adding the summation operation, we can express the stochastic satisfiability [14] as well. Both QSAT and stochastic SAT are PSPACE-complete. Therefore if we can show that inference in the inference hierarchy is in PSPACE, it follows that inference hierarchy is PSPACE-complete as well.8 Theorem 2.5. Inference hierarchy is PSPACE-complete.

3

C OMMUTATIVE

Definition 3.1. A commutative semiring S = (Y ∗ , ⊕, ⊗) is the combination of two commutative semigroups Ge = (⊗, Y ∗ ) and Gm = (⊕, Y ∗ ) with two additional properties (I) distributive property: a ⊗ (b ⊕ c) = (a ⊗ b) ⊕ (a ⊗ b) ∀a, b, c ∈ Y ∗ ⊗





(II) identity elements 1 and 1 such that 1⊕a = a and 1⊗a = ⊕



a. Moreover 1 is an annihilator for Ge = (⊗, Y ∗ ): a ⊗ 1 =



1

x\J

where p(x)

def

=

O  1 ⊗ fI (xI ) q(∅) I

That is, when working with normalized expanded form and ⊗ L normalized marginals, we always have x p(xJ ) = 1 J

SEMIRINGS

Our definition of inference was based on an expansion operation and one or more marginalization operations. If we assume only a single marginalization semigroup, polynomial time inference is still not generally possible. However, if we further assume that the expansion operation is distributive over marginalization and the factor-graph has no loops, exact polynomial time inference is possible.



These inference problems have different properties indirectly inherited from their commutative semirings: for example, since both operations have inverses, sum-product is a field [4]. The availability of inverse for ⊗ operation – i.e., when Ge is an abelian group – has an important implication for inference: the expanded form of eq(1) can be normalized, and we may inquire about normalized marginals M p(xJ ) = p(x) (13)

∀a ∈ Y ∗ .

The mechanism of efficient inference using distributive law can be seen in a simple example: instead of calculating min(a+b, a+c), using the fact that summation distributes over minimization, we may instead obtain the same result using a+min(b, c), which requires fewer operations. We investigate the second property in section 3.2. Example 3.1. The following are some examples of commutative semirings: • sum-product (ℜ+ , +, ×) • maxproduct (ℜ+ ∪ {−∞}, max, ×) and ({0, 1}, max, ×), • minmax (S, min, max) on any ordered set S, • minsum (ℜ ∪ {∞}, min, +) and ({0, 1}, min, +), • or-and ({TRUE, FALSE}, ∨, ∧), union-intersection (2S , ∪, ∩) for any power-set 2S , • the semiring of natural numbers with greatest common divisor and least common multiple (N , lcm, gcd), • symmetric difference-intersection semiring for any powerset (2S , ∇, ∩). Many of the semirings above are isomorphic –e.g., φ(y) = − log(y) defines an isomorphism between min-sum and maxproduct. It is also easy to show that or-and semiring is isomorphic to min-sum/max-product semiring on Y ∗ = {0, 1}. 8. Note that this result is only valid because we are ignoring the product operation for marginalization.

Example 3.2. Since Ge = (ℜ, ×) and Ge = (ℜ, +) are both abelian groups, min-sum and sum-product inference have normalized marginals. For min-sum inference this means sum

minxJ p(xJ ) = 1 = 0. However, for min-max inference, normalized marginals are not defined. 3.1 First semiring property: distributive law If the factor-graph is loop free, we can use distributive law to make inference tractable. Assuming q(xK ) (or q(xk )) is the marginal of interest, form a tree with K (or k) as its root. Then starting from the leaves using distributive law we can move the ⊕ inside the ⊗ and define “messages” from leaves towards the root as follows: O qJ→i (xi ) (14) qi→I (xi ) = J∈∂i\I

qI→i (xi ) =

M

fI (xI )

x\i

O

qj→I (xj )

(15)

j∈∂I\i

where for example eq(15) defines the message from factor I to a variable i closer to the root. By starting from the leaves, and calculating the messages towards the root, we obtain the marginal over the root node as the product of incoming messages O q(xk ) = qK→k (xk ) I∈∂k

In fact, we can designate any subset of variables xA (and factors within those variables) to be the root. Then, the set of all incoming messages to A, produces the marginal    O O q(xA ) =  fI (xI )  qJ→i (xi ) I⊆A

i∈A,J∈∂i,J6⊆A

This procedure is known as Belief Propagation (BP), which is sometimes prefixed with the corresponding semiring – e.g., sum-product BP. Even though BP is guaranteed to produce correct answers when the factor-graph is a tree and in few other cases [19], [20], [21], [22], it performs surprisingly well when applied as a fixed point iteration to graphs with loops [23], [24]. Here, when the ⊗ operation has an inverse, the messages and marginals are normalized.

6

3.1.1

⊗ ⊕

The limits of message passing

By observing the application of distributive law in semirings, a natural question to ask is: can we use distributive law for polynomial time inference on loop-free graphical models over any of the inference problems at higher levels of inference hierarchy or in general any inference problem with more than one marginalization operation? The answer to this question is further motivated by the fact that, when loops exists, the same scheme may become a powerful approximation technique. When we have more than one marginalization operations, a natural assumption in using distributive law is that the expansion operation distributes over all the marginalization operations – e.g., as in min-max-sum, min-max-min, xor-orand. 1 2 Consider the simplest case with three operators ⊕, ⊕ and ⊗, 1

2

where ⊗ distributes over both ⊕ and ⊕. Here the integration 2 1 L N L problem is q(∅) = x J fJ (xJ ). In order to apply x \I

I

1

2

distributive law for each pair (⊕, ⊗) and (⊕, ⊗), we need to 1

2

be able to commute ⊕ and ⊕ operations. That is, we require 2 1 M M

xJ

xI

g(xI∪J ) =

1 2 M M

xI

xJ

g(xI∪J ).

(16)

Now, consider a simple case involving two binary variables xi and xj , where g(x{i,j} ) is

Definition 3.2. A constraint is a factor fI : XI → { 1, 1} whose range is limited to identity and annihilator of the expansion monoid. ⊕



Here, fI (x) = 1 iff x is forbidden and fI (x) = 1 iff it is permissible. A constraint satisfaction problem (CSP) is any inference problem on a semiring in which all factors are constraints. Note that this allows definition of the “same” CSP on any commutative semiring. This has an important implication about inference on commutative semirings. Theorem 3.2. Inference in any commutative semiring is NPhard under randomized polynomial-time reduction. Example 3.3. BP can be applied to approximately solve CSPs as an incomplete solver [28]. Here we may use sum-product semiring (to define a uniform distribution over solutions) or alternatively perform BP using or-and semiring. In fact, the procedure known as warning propagation [29] is simply BP applied to or-and semiring. Example 3.4. Inference on parity-SAT semiring ({TRUE, FALSE}, xor, ∧) asks whether the number of SAT solutions is even or odd. A corollary to theorem 3.2 is that parity-SAT is NP-hard under randomized reduction, which is indeed the case [30]. If we swap the ⊗ and ⊕ in the semiring we get another inference problem, called XOR-SAT, which is not a semiring. XOR-SAT is in P as one may use Gaussian elimination to solve it in polynomial time [31].

xj xi

0 1

0 a c

1 b d

C ONCLUSION

Applying eq(16) to this simple case (i.e., I = {i}, J = {j}), we 1

2

1

2

1

2

require (a⊕b)⊕(c⊕d) = (a⊕b)⊕(c⊕d). The following theorem leads immediately to a negative result: Theorem 3.1. [25]: 1

2

1

2

1

2

(a⊕b)⊕(c⊕d) = (a⊕b)⊕(c⊕d)

1



2

⊕ = ⊕.

which implies that direct application of distributive law to tractably and exactly solve any inference problems at higher levels of inference hierarchy is unfeasible, even for tree structures. This limitation was previously known for marginal MAP inference [17]. For min and max operations, if we slightly change the inference problem (from pure assignments xJl ∈ XJl to a distribution over assignments; a.k.a. mixed strategies), as a result of the celebrated minimax theorem[26], the min and max operations commute and we can address problems with min and max marginalization operations using message-passinglike procedures. We note that [27] is indeed addressing this variation of min-max-product.

This paper addresses the following basic problems about inference in graphical models: (I) “what is an inference problem in a graphical model?” We use the combination of commutative semigroups and a factor-graph to answer this question in a broad sense. (II) “How difficult is inference?” By confining inference to four operations of min, max, sum and product on subsets of non-negative real domain we build an inference hierarchy that organizes inference problems into complexity classes. This hierarchy reduces to polynomial hierarchy and formulates quantified boolean satisfiability and stochastic satisfiability as inference problems in graphical models. Moreover, we prove that inference for any commutative semiring is NP-hard under randomized reduction, which generalizes previous results for particular semirings. (III) “When does distributive law help?” We show that application of distributive law in performing standard inference is limited to one marginalization operation.

R EFERENCES [1]

[2]

3.2 Second semiring property: identity elements

[3]

We can apply the identity and annihilator of a commutative semiring to define constraints:

[4]

M. J. Wainwright and M. I. Jordan, “Graphical Models, Exponential Families, and Variational Inference,” Foundations and Trends in Machine Learning, vol. 1, no. 12, pp. 1–305, 2007. [Online]. Available: http://www.nowpublishers.com/product.aspx?product=MAL&doi=2200000001 S. M. Aji and R. J. McEliece, “The generalized distributive law,” Information Theory, IEEE Transactions on, vol. 46, no. 2, pp. 325–343, 2000. S. Aji and R. McEliece, “The Generalized distributive law and free energy minimization,” in Allerton Conf, 2001. C. C. Pinter, A book of abstract algebra. Courier Dover Publications, 2012.

7

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

[21] [22] [23] [24] [25] [26] [27]

[28] [29] [30] [31] [32]

F. R. Kschischang and B. J. Frey, “Factor graphs and the sum-product algorithm,” Information Theory, IEEE, 2001. P. Clifford, “Markov random fields in statistics,” Disorder in physical systems, pp. 19–32, 1990. J. Pearl, “Bayesian netwcrks: A model cf self-activated memory for evidential reasoning,” 1985. G. D. Forney Jr, “Codes on graphs: normal realizations,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 520–548, 2001. L. J. Stockmeyer, “The polynomial-time hierarchy,” Theoretical Computer Science, vol. 3, no. 1, pp. 1–22, 1976. K. W. Wagner, “The complexity of combinatorial problems with succinct input representation,” Acta Informatica, vol. 23, no. 3, pp. 325–356, 1986. H. Rogers, “Theory of recursive functions and effective computability,” 1987. S. E. Shimony, “Finding maps for belief networks is np-hard,” Artificial Intelligence, vol. 68, no. 2, pp. 399–410, 1994. S. Ravanbakhsh, C. Srinivasa, B. Frey, and R. Greiner, “Min-max problems on factor-graphs,” ICML, 2014. M. L. Littman, S. M. Majercik, and T. Pitassi, “Stochastic boolean satisfiability,” Journal of Automated Reasoning, vol. 27, no. 3, pp. 251– 296, 2001. D. Roth, “On the hardness of approximate reasoning,” in Proceedings of the 13th international joint conference on Artifical intelligence-Volume 1. Morgan Kaufmann Publishers Inc., 1993, pp. 613–618. C. H. Papadimitriou, Computational complexity. John Wiley and Sons Ltd., 2003. J. D. Park and A. Darwiche, “Complexity results and approximation strategies for map explanations.” J. Artif. Intell. Res.(JAIR), vol. 21, pp. 101–133, 2004. S. Toda, “Pp is as hard as the polynomial-time hierarchy,” SIAM Journal on Computing, vol. 20, no. 5, pp. 865–877, 1991. S. M. Aji, G. B. Horn, and R. J. McEliece, “On the Convergence of iterative decoding on graphs with a single cycle,” in Proc 1998 IEEE Int Symp Information Theory, 1998. Y. Weiss, “Correctness of belief propagation in Gaussian graphical models of arbitrary topology,” Neural computation, no. June, 2001. [Online]. Available: http://www.mitpressjournals.org/doi/abs/10.1162/089976601750541769 M. Bayati, D. Shah, and M. Sharma, “Maximum weight matching via max-product belief propagation,” in ISIT. IEEE, 2005, pp. 1763–1767. A. Weller and T. S. Jebara, “On map inference by mwss on perfect graphs,” arXiv preprint arXiv:1309.6872, 2013. K. Murphy, Y. Weiss, and M. Jordan, “Loopy-belief Propagation for Approximate Inference: An Empirical Study,” in 15, 1999, pp. 467–475. R. G. Gallager, “Low-density parity-check codes,” Information Theory, IRE Transactions on, vol. 8, no. 1, pp. 21–28, 1962. B. Eckmann and P. J. Hilton, “Group-like structures in general categories i multiplications and comultiplications,” Mathematische Annalen, vol. 145, no. 3, pp. 227–255, 1962. J. Von Neumann and O. Morgenstern, Theory of Games and Economic Behavior (60th Anniversary Commemorative Edition). Princeton university press, 2007. M. Ibrahimi, A. Javanmard, Y. Kanoria, and A. Montanari, “Robust max-product belief propagation,” in Signals, Systems and Computers (ASILOMAR), 2011 Conference Record of the Forty Fifth Asilomar Conference on. IEEE, 2011, pp. 43–49. H. A. Kautz, A. Sabharwal, and B. Selman, “Incomplete algorithms.” 2009. A. Braunstein, M. Mezard, M. Weigt, and R. Zecchina, “Constraint satisfaction by survey propagation,” Physics, no. 2, p. 8, 2002. L. G. Valiant and V. V. Vazirani, “Np is as easy as detecting unique solutions,” Theoretical Computer Science, vol. 47, pp. 85–93, 1986. T. J. Schaefer, “The complexity of satisfiability problems,” in Proceedings of the tenth annual ACM symposium on Theory of computing. ACM, 1978, pp. 216–226. L. G. Valiant, “The complexity of computing the permanent,” Theoretical computer science, vol. 8, no. 2, pp. 189–201, 1979.

A PPENDIX Proof: (claim 2.1) We show this for an integration problem. Since the integral can be obtained from a polynomialsized marginal, the exponential representation of the integral implies the exponential representation of the marginal. For

example if q(xi ) has a polynomial Lrepresentation in N, then the same should hold for q(∅) = xi q(xi ). To see why this integral has an exponential representation in N, consider its simplified form Y q(xI ) q(∅) = xI

where q(x) is the result of inference up to the last marginalization step, which is product and XI grows exponentially with N. Since q(xI ) for each xI ∈ XI has a constant size, say c, the size of q(∅) is X log(q(∅)) = c = c|XI | xI

which is exponential in N.The exponential representation of the query also means the corresponding computation is outside PSPACE. Proof: (proposition 2.2) To show that these inference problems are in P, we provide polynomial-time algorithms for them: • sum − sum is short for XX q(∅) = fI (xI ) x

I

which asks for the sum over all assignments of x ∈ X , of the sum of all the factors. It is easy to see that each factor value fI (xI ) ∀I, XI is counted |X\I | times in the summation above. Therefore we can rewrite the integral above as X X  |X\I | q(∅) = fI (xI ) xI

I

where the new form involves polynomial number of terms and therefore is easy to calculate. • min − min (similar for max − max) is short for q(∅) = min min fI (xI ) x

I

where the query seeks the minimum achievable value of any factor. We can easily obtain this by seeking the range of all factors and reporting the minimum value in polynomial time. Proof: (theorem 2.3) Given x it is easy to verify the decision problem so, max-min-decision belongs to NP. To show NP-completeness, we reduce the 3-SAT to a max-min inference problem, such that 3-SAT is satisfiable iff the maxmin value is q(∅) ≥ 1 and unsatisfiable otherwise. Simply define one factor per clause of 3-SAT, such that fI (xI ) = 1 if xI satisfies the clause and any number less than one otherwise. With this construction, the max-min value maxx minI∈F fI (xI ) is one iff the original SAT problem was satisfiable, otherwise it is less than one. This reduces 3-SAT to Max-Min-decision. Proof: (theorem 2.4) Recall that PP is the class of problems that are polynomially solvable using a non-deterministic Turing machine, where the acceptance condition is that the ? P majority of paths accept. To see that x minI fI (xI ) ≥ q is in PP, enumerate all x ∈ X non-deterministically and for each assignment calculate minI fI (xI ) in polynomial time (where

8

each path accepts iff minI fI (xI ) = 1) and accept iff at least q of the paths accept. Given a matrix A ∈ {0, 1}N×N the problem of calculating its permanent N X Y

perm(A) =

Ai,zi

z∈SN i=1

of the marginalization that involves xi – that is i ∈ J(i) . Moreover let i1 , . . . , iN be an ordering of variable indices such that (ik ) ≤ (ik+1 ). Algorithm 1 uses this notation to demonstrate this procedure using nested loops. Note that here we loop over individual domains Xik rather than XJm and track only temporary tuples qik , so that the space complexity remains polynomial in N.

where SN is the set of permutations of 1, . . . , N is #Pcomplete and the corresponding decision problem is PPcomplete [32]. To prove theorem 2.4, we reduce the problem of calculating the permanent to sum-min integration in a factor-graph. There are different ways of constructing this factor-graph Here, we introduce a binary variable model, where each factor is exponential in N but due to sparsity has polynomial representation. Let x = {x1:1 , . . . , x1:N , . . . , xN:N } ∈ {0, 1} be a set of binary variables, one for every element of A. The following constraint factors, ensure that exactly one xi:j in each row and column is non-zero. In the following 1(cond) is the inverse identity function that is equal to zero if cond = TRUE and one otherwise. • row uniqueness factor: X fi,: (xi,: ) = 1( xi:j = 1) ∀1 ≤ i ≤ N

1 M−1 M L N L L . . . input : x xJ I fI (xI ) xJ JM M−1 1 output: q(xJ0 ) for each zJ0 ∈ XJ0 do // loop over the query domain

for each ziN ∈ XiN do // loop over XiN . . . for each zi1 ∈ Xi1 do // loop over Xi1 N q1 (zi1 ) := I fI (zI ); end (i L1 ) qi2 (zi2 ) := xi1 q1 (xi1 ) . . . (iL N−1 ) qN (ziN ) := xiN−1 qN−1 (xiN−1 ) end (i N) L q(zJ0 ) := xiN qN (xiN ) end Algorithm 1: inference in PSPACE

j

• column uniqueness factor: X xi:j = 1) ∀1 ≤ j ≤ N f:,j (x:,j ) = 1( i

• a local factor takes A into account f{i,j} (xi:j ) = 1(Ai,j = 1 ∧ xi:j = 1) Note that the range of all the factors is in {0, 1}. Moreover minI fI (xI ) = 1 iff all the uniqueness conditions are true and xi:j = 1 → Ai,j = 1, otherwise minI fI (xI ) = 0. QN But this means minI fI (xI ) is equivalent to i=1 Ai,zi for a permutation z and therefore X x

min fI (xI )

=

I

N X Y

=

Ai,zi

perm(A)

z∈SN i=1

This reduces the problem of calculating the permanent to that of sum-min inference. Using this reduction, the corresponding decision version, which ask whether perm(A) ≤ q (which is PP-complete) also reduces to q(∅) ≤ q. Proof: (theorem 2.5) To prove that a problem is PSPACE-complete, we have to show that 1) it is PSPACEand 2) a PSPACE-complete problem reduces to it. We already saw in section 2.3 that QSAT, which is PSPACE-complete, reduces to the inference hierarchy. But it is not difficult to show that inference hierarchy is contained in PSPACE. Let q(xJ0 ) =

M M

xJ

M

M−1 M

xJ

M−1

...

1 M

xJ

1

O

fI (xI )

I

be any inference problem in the hierarchy. We can simply iterate over all values of z ∈ X in nested loops or using a recursion. Let (i) : {1, . . . , N} → {1, . . . , M} be the index

Proof: (thorem 3.2) To prove that inference in any semir⊕ ⊗

ing S = (Y ∗ , 1, 1) is NP-hard under randomized polynomial reduction, we reduce “unique satisfiability”(USAT) to an inference problems on any semiring. USAT is a so-called “promise problem”, that asks whether a satisfiability problem that is promised to have either zero or one satisfying assignment is satisfiable. [30] prove that a polynomial time randomized algorithm (RP) for USAT implies a RP=NP. For this reduction consider a set of binary variables x ∈ {0, 1}N , one per each variable in the given instance of USAT. For each clause, define a constraint factor fI such ⊗



that fI (xI ) = 1 if xI satisfies that clause and fI (xI ) = 1 otherwise. This means, x is a satisfying assignment for USAT ⊗ N iff q(x) = I fI (xI ) = 1. If the instance is unsatisfiable, ⊕ ⊕ L ⊕ the integral q(∅) = x 1 = 1 (by definition of 1). If the instance is satisfiable there is only a single instance x∗ for ⊗

which q(x∗ ) = 1, and therefore the integral evaluates to ⊗

1. Therefore we can decide the satisfiability of USAT by performing inference on any semiring, by only relying on

9

the properties of identities. The satisfying assignment can be recovered using a decimation procedure, assuming access to an oracle for inference on the semiring.