Approximation Algorithms for Probabilistic Decoding - Semantic Scholar

2 downloads 0 Views 326KB Size Report
for solving the most probable explanation (MPE) problem, developed within the re- ... experiments suggest that decoders based on nding an MPE assignment ...
Approximation Algorithms for Probabilistic Decoding Irina Rish, Kalev Kask and Rina Dechter

3

Department of Information and Computer Science University of California, Irvine

firinar,kkask,[email protected]

Abstract

It was recently shown that the problem of decoding messages transmitted through a noisy channel can be formulated as a belief updating task over a probabilistic network [13]. Moreover, it was observed that iterative application of the (linear time) belief propagation algorithm designed for polytrees [14] outperformed state of the art decoding algorithms, even though the corresponding networks may have many cycles. Much of recent research in coding focuses on explaining this phenomenon. This paper demonstrates empirically that an approximation algorithm approx-mpe for solving the most probable explanation (MPE) problem, developed within the recently proposed mini-bucket elimination framework [3], outperforms iterative belief propagation on classes of coding networks that have bounded induced width. Our experiments suggest that decoders based on nding an MPE assignment (block-wise decoding) can be superior to the commonly used belief updating decoders (bit-wise decoders).

1 Introduction In this paper we evaluate the quality of a recently proposed approximation scheme, called mini-bucket, for probabilistic decoding. Recently, a class of parameterized approximation algorithms based on the bucket elimination framework was proposed and analyzed [3]. The approximation scheme uses as a controlling parameter a bound on size of probabilistic functions created during variable elimination, allowing a trade-o between accuracy and eciency [7, 5]. The algorithms were presented and analyzed for several tasks, such as: nding the most probable explanation (mpe), nding the maximum a posteriori hypothesis (map), and belief updating. Encouraging results were 3 This work was partially supported by NSF grant IRI-9157636 and by Air Force Oce of Scienti c Research grant, AFOSR 900136, Rockwell International and Amada of America.

1

obtained on randomly generated noisy-or networks and on the CPCS networks [15]. Clearly, more testing is necessary to determine the regions of applicability of this class of algorithms. In particular, testing on realistic applications is mandatory. When it was recently observed that the problem of decoding noisy messages can be described as a belief updating task over a probabilistic network, we decided to use this domain as our next benchmark. This domain was particularly interesting since it was observed that an iterative application of the (linear time) belief propagation algorithm (IBP) designed for probabilistic polytrees [14], outperformed state of the art decoding algorithms. Since the belief propagation algorithm was designed for singly-connected networks and since the probabilistic networks that correspond to coding instances may have many cycles, this empirical observation is quite intriguing. Indeed, recent research in coding centers on explaining this phenomenon. In this paper we evaluate and compare the performance of the mini-bucket approximation algorithms and the Iterative Belief Propagation (IBP) algorithm on several classes of codinglike networks, varying the length of the input codeword, the family size of the coding belief network, and the channel noise. The algorithms tested are the iterative belief propagation, the exact bucket elimination elim-mpe for computing the mpe, the exact bucket-elimination elim-map for computing the maximum aposteriori hypothesis, and aprox-mpe(i), the minibucket approximation algorithm for nding the mpe. For each simulation we computed the bit error rate (BER), the performance measure commonly used by the coding community, namely, the fraction of incorrectly decoded information bits. We ran our experiments on several classes of parity-check coding networks, including structured parity-check codes, Hamming codes, and randomly generated parity-check codes. On structured parity-check code, our mini-bucket approximation algorithm signi cantly outperformed iterative belief propagation. On the Hamming codes and random parity-check codes the IBP algorithm was better. As it is expected, exact elim-mpe was the most accurate algorithm in all cases. However, on random networks running exact elim-mpe was not feasible due to the large induced width (as it is known, variable elimination algorithms are exponential in the induced width of network). Our experiments suggest that decoders based on nding an MPE assignment (blockwise decoders) can be sometimes superior to the commonly used belief updating decoders (bit-wise decoders).

2 Notation and de nitions De nition 1 : [graph concepts] A directed graph is a pair, G = fV; E g, where V = fX1; :::; X g is a set of elements and E = f(X ; X )jX ; X 2 V g is the set of edges. If (X ; X ) 2 E , we say that X points to X . For each variable X , pa(X ) or pa , is the set of variables pointing to X in G, while the set of child nodes of X , denoted ch(X ), comprises the variables that X points to. The family of X , F , includes X and its child variables. A directed graph is acyclic if it has no directed cycles. An ordered graph is a pair (G; d) where G is an undirected graph and d = X1 ; :::; X is an ordering of the nodes. The width n

i

i

j

i

j

i

j

j

i

i

i

i

i

i

n

2

i

i

i

i

in an ordered graph is the number of the node's neighbors that precede it in the ordering. The width of an ordering d, denoted w(d), is the maximum width over all nodes. The induced width of an ordered graph, w 3 (d), is the width of the induced ordered graph obtained by processing the nodes from last to rst; when node X is processed, all its neighbors that precede it in the ordering are connected. The induced width of a graph, w3, is the minimal induced width over all its orderings; it is also known as the tree-width [1]. For more information see [8, 6]. A polytree is an acyclic directed graph whose underlying undirected graph (ignoring the arrows) has no loops. The moral graph of a directed graph G is the undirected graph obtained by connecting the parents of all the nodes in G and removing the arrows.

of a node

De nition 2: [belief networks] Let X = fX1 ; :::; X g be a set of random variables over multivalued domains D1; :::; D . A belief network (BN) is a pair (G; P ) where G is a directed acyclic graph and P = fP g. P = fP (X jpa )g are the conditional probability matrices associated with X . An assignment (X1 = x1; :::; X = x ) can be abbreviated to x = (x1; :::; x ). The BN represents a probability distribution P (x1; ::::; x ) = 5 =1P (x jx ( i)), where, x is the projection of x over a subset S . An evidence set e is an instantiated subset n

n

i

i

i

i

i

n

n

n

n i

n

i

pa X

S

of variables.

Given a belief network and an evidence, the following tasks can be de ned. The most probable explanation (mpe) task is to nd an assignment x = (x 1; :::; x ) such that p(x ) = max P (xje) The belief assessment of X = x is to nd bel(x ) = P (X = x je). Given a set of hypothesized variables A = fA1; :::; A g, A  X , the maximum a posteriori P hypothesis (map) task is to nd an assignment a = (a 1 ; :::; a ) such that p(a ) = maxk X 0A P (xje), where a = fA1 = a1; :::; A = a g. o

x

i

i

o

o

i

i

o

n

i

k

o

k

k

o

o

k

o

a

x

k

3 Noisy channel coding The goal of channel coding is to provide reliable communication through a noisy channel. A commonly used Additive White Gaussian Noise (AWGN) channel model implies that independent Gaussian noise with variance 2 is added to each transmitted bit, thus introducing errors to the input signal, namely, to a vector of information bits U = (u1; :::; u ); u 2 f0; 1g. The error can be decreased by transmitting additional code bits X = (x1; :::; x ), obtained from the input signal using some encoding scheme. A systematic error-correcting encoding [13] is a mapping C : U ! (U; X ). The transmitted vector (U,X) of length N=K+m is called the codeword. The vector Y obtained by transmitting the codeword through the channel is called the channel output. The ratio R = K=N , called the code rate, is the ratio of information bits to the total number of bits transmitted through the channel. The decoding task is to correctly restore the input information bits from the channel output Y . The common decoding error measure is the bit error rate (BER), i.e. the percentage of incorrectly decoded bits. The goal of coding is to reduce the BER as much as possible, while keeping code rate relatively low. Given the channel noise level , and given a xed code rate R, there K

k

m

3

is a theoretical limit on the performance of a decoder, called Shannon's limit, that yields the smallest achievable BER. Unfortunately, Shannon's results [17] are non-constructive; he proved the existence of codes that achieve the limit, but did not provide such codes. Finding a code that achieves the Shannon's limit, is still an open problem. Also, \as far as is known, a code with a low-complexity optimal decoding algorithm cannot achieve high performance" [13]. Therefore, ecient approximation decoding algorithms are needed. Recently, several low-error coding schemes have been proposed, such as turbo-codes [9], low-density parity-check codes [11] and low-density generator matrix codes [2], that achieve a near-Shannon-limit performance. This is considered as \the most exciting and potentially important development in coding theory in many years" [13]. It was also shown, that the decoding problem can be formulated as probabilistic inference over a belief network, and that the decoding algorithms used by all these codes correspond to the iterative application of the belief propagation algorithm developed by Pearl [14]. Since Pearl's algorithm is guaranteed to give an exact answer only on singly-connected networks, it is quite surprising that applying it to the multiply-connected coding networks yields such good results. A satisfactory theoretical explanation of this fact is still lacking. Some initial analysis on particular network classes was done by [18]. We experimented with several types of parity-check codes. Parity-check codes compute a vector X of parity-check bits, given the information vector U , using a parity-check matrix H . The codeword X 0 = (U; X ) must satisfy the equation HX 0 = 0, where summation modulo two is used in matrix multiplication. For example, the parity-check matrix for Hamming code with N=7 and K=4 is 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1

Given m, the number of parity-check bits, the parity-check matrix H contains m rows and 2 0 1 columns, representing all possible 0-1 tuples, excluding the tuple of all zeros. Therefore, the corresponding codeword length N is 2 0 1, and the length of of the information vector, K, is 2 0 1 0 m [12]. A more general class of parity-check codes is called generator matrix codes [13]. Given a generator matrix G, a codeGU = X . Note, that no restrictions on K and N are imposed. The Hamming code can be viewed as a particular type of such codes, where the generator matrix can be easily obtained from the Hamming parity-check matrix. A code can be represented by a belief network, where nodes represent the bits in U , X , and Y bit vectors. A belief network for parity-check code with K=5 and N=10 is shown Figure 1. Nodes u ; i = 1; :::; K correspond to the input bits, nodes x ; i = 1; :::; m correspond to the parity-check bits, and nodes y and y correspond to the output bits. The generator matrix determines the set of parents for each parity-check bit, and the corresponding (deterministic) conditional probability function. For each output bit y there is only one parent, the corresponding codeword bit x, and the conditional probability density P (yjx = ) is Gaussian distribution (; ), where  is 0 or 1. Note, that another interpretation of the m

m

m

i

i

i

u i

4

y0u

y1u

yu2

yu3

y4u

u0

u1

u2

u3

u4

x0

x1

x2

x3

x4

y0

y1

y2

y3

y4

Figure 1: Structured parity-check code with K=5, P=3 generator matrix codes is to view each parity-check bit x as a result of binary XOR operation (i.e. summation modulo two) sequentially computed on its parents in the network. Such structure is called causal independence and can be exploited in probabilistic inference [10, 19, 16]. However, this property will not be exploited here. The purpose of decoding is to nd an assignment to the information bits that matches the original input. Namely, the probabilistic decoding can be formulated as solving an optimization task on a belief network. There are two approaches to decoding, that we will call bit-wise and block-wise decoding. The rst aims at nding the most probable assignment to each information bit, while the second looks for the most probable assignment to the whole information sequence (block), given the observed signal. The decoding task is then either 1. (bit-wise decoding) to nd, for each k, P (u jY ); 1  k  K; u0 = arg max k k

k

u

where Y is the observed output, or 2. (block-wise decoding) to nd a most probable assignment (also called maximum aposteriory probability, or MAP, assignment) to the input information bits given the output U 0 = arg max P (U jY ); U

or, alternatively, to nd a most probable explanation (MPE) assignment to all input bits (U 0; X 0) = arg (max) P (U; X jY ): U;X

Note that the bit-wise task requires nding posterior probabilities for each information bit, and can be solved by belief updating algorithms. The block-wise task can be solved by elim-map and elim-mpe algorithms, or by the corresponding mini-bucket approximations, described next.

5

4 Algorithms In this section we present the algorithms to be evaluated in subsequent sections. We start with a brief overview of the bucket-elimination framework and present algorithms elim-mpe for computing the mpe and algorithm approx-mpe(i) which is an approximation algorithm for the same task. Finally we present Iterative belief propagation (IBP) which will be compared to the rst two. 4.1

Bucket-elimination algorithms

Bucket elimination, was introduced recently as a unifying algorithmic framework for variable

elimination algorithms for probabilistic and deterministic reasoning [4]. In particular it was shown that many algorithms for probabilistic inference, such as belief updating, nding the most probable explanation, nding the maximum a posteriori hypothesis, and calculating the maximum expected utility, can be expressed as bucket-elimination algorithms [3]. Normally, the input to a bucket-elimination algorithm consists of a knowledge-base theory speci ed by a collection of functions or relations, (e.g., clauses for propositional satis ability, constraints, or conditional probability matrices for belief networks). Given a variable ordering, the algorithm partitions the functions into buckets, each associated with a single variable, and process the buckets in reverse order from top to bottom. Each bucket is processed by a variable elimination procedure over the functions in its bucket and the resulting new function is placed in a lower bucket. The performance of bucket-elimination algorithms is time and space exponential in the induced width (also known as tree-width) of the problem's interaction graph [8, 1]. Depending on the variable ordering, the size of the induced width will vary and this leads to di erent performance guarantees. Figure 2 shows elim-mpe, the bucket-elimination algorithm for computing mpe [3]. Given a variable ordering, the conditional probability matrices are partitioned into buckets, where each matrix is placed in the bucket of the highest variable it mentions. When processing the bucket of X , a new function is generated by taking the maximum relative to X , over the product of functions in that bucket. The resulting function is placed in the appropriate lower bucket. The complexity of the algorithm which is determined by the complexity of processing each bucket (step 2), is exponential in the number of variables in the bucket. This number is captured by the induced-width of the ordering, yielding exponential time and space complexity the induced-width w3 of the network's moral graph [3, 6]. For example, the induced width of the network in Figure 4 after moralization is 3. Clearly, in the moral graph all the variables of a given family are fully connected. Consequently the induced width is always larger than the largest family size, m. If we want to distinguish the complexity of processing a network from the complexity of specifying the network we should compare w3 relative to m. p

p

6

Algorithm elim-mpe Input: Output: Initialize:

A belief network =f 1 g; an ordering of the variables, ; observations . The most probable assignment. 1. Partition into , where contains all matrices 1 , . . ., whose highest variable is . Put each observed variable into its appropriate bucket. Let be the subset of variables in the processed bucket on which matrices (new or old) 1 are de ned. 2. Backward: For downto 1, do for 1 2 in , do  (bucket with observed variable) If contains = , assign = to each and put each resulting function into its appropriate bucket.  Else, generate the functions : = and = to p 5 =1 p . Add S the bucket of the largest-index variable in 0 f g. =1 3. Forward: Assign values in the ordering using the recorded functions in each bucket. BN

P ; :::; Pn

BN

d

bucket

bucketn

e

bucketi

Xi

S ; :::; Sj

p

h ; h ; :::; hj

n

bucketp

bucketp

h

p

h

p

Xp

j

maxX

o

hi

i

j

Up

xp

xp

Si

i

Xp

xp

hi

p

argmaxX h

h

p

Xp

d

x

o

Figure 2: Algorithm elim-mpe 4.2

Mini-bucket algorithms

Since processing a bucket may create functions having a much larger arity than the input functions, we proposed in [7] to approximate these functions by a collection of smaller arity functions. Let h1; :::; h be the functions in the bucket of X , and let S1; :::; S be the variable subsets on which those functions are de ned. When elim-mpe processes the bucket of X , it computes the function h : h = max p 5 =1h . A brute-force approximation method involves migrating the maximization operator inside the multiplication, generating instead a new function g : g = 5 =1max p h . Obviously, h  g . Each maximized function will have the arity lower than the arity of h , and each of these functions can be moved, separately, to a lower bucket. When the algorithm reaches the rst variable, it has computed an upper bound on the mpe. A maximizing tuple can then be generated by instantiating the variables going from rst bucket to last, using the information that is recorded in each bucket. This idea was extended to yield a collection of parameterized approximation algorithms by partitioning the bucket into mini-buckets of varying sizes and applying the elimination operator on each mini-bucket rather then on single function. This yields approximations of varying degrees of accuracy and eciency. Let Q0 = fQ1; :::; Q g be a partitioning into mini- buckets of the functions h1; :::; h in X 's bucket. If the mini-bucket Q contains the functions h ; :::; h r , the approximation will compute g = 5 =1max p 5 i h i . Clearly, as the partitionings are more coerced, both the complexity and the accuracy of the algorithm increase. Algorithm approx-mpe(i,m) is described in Figure 3. It is parameterized by two indexes that control the partitionings. A partitioning Q into mini-buckets is an (i; m)-partitioning if each mini-bucket has at most m (nonsubsumed) functions and the total number of variables in a mini-bucket does not exceed i. A function f subsumes the function g if each argument of g is also an argument of f . j

p

j

p

p

p

p

j

p

X

i

j i

X

i

p

i

p

i

r

j

l1

p

l

p

l

7

r l

X

l

l

It was shown [7] that algorithm approx-mpe(i; m) computes an upper bound to the mpe in time O(m 1 exp(2i)) and space O(m 1 exp(i)), where  n and m  2 . i

Algorithm approx-mpe(i,m) Input: BN P1 ; :::; Pn Output: Initialize: bucket1

A belief network =f g; and an ordering of the variables, ; An upper bound on the most probable assignment, given evidence . 1. Partition into , . . ., , where contains all matrices whose highest variable is . Let 1 be the subset of variables in on which matrices (old or new) are de ned. 2. (Backward) For downto 1, do  If contains = , assign = to each and put each in appropriate bucket.  else, for 1 2 in , do: Generate an ( )-mini-bucket-partitioning, 0 = f 1 g. For each 2 0 containing t do, Generate function , = to the bucket of the largest-index varip 5 =1 i Add S able in g. =1 i 0 f 3. (Forward) For = 1 to do, given 1 that maximizes 01 choose a value of the product of all the functions in 's bucket. Xi

p

d

e

bucketn

bucketi

S ; :::; Sj

bucketp

n

bucketp

Xp

h ; h ; :::; hj

xp

Xp

xp

hi

bucketp

i; m

Q

Q ; :::; Qr

Ql

Q

hl1 ; :::hl

h

Ul

j i

l

h

l

maxX

Sl

t i

hl :

h

l

Xp

i

n

x ; :::; xp

xp

Xp

Xp

Figure 3: algorithm approx-mpe(i,m) Example 1: Consider the network in Figure 4. Assume we use the ordering (B; C; D; E; F; G; H; I ) to which we apply both algorithm elim-mpe and its simplest approximation where m = 1 and i = n. Initially the bucket of each variable will have at most one conditional probability: bucket(I) = P (I jH; G), bucket(H) = P (H jE; F ), bucket(G) = P (GjE; D), bucket(F) = P (F jB ), bucket(E) = P (E jC; B ), bucket(D) = P (DjC ), bucket(C) = P (C ), bucket(B) = P (B ). Processing the buckets from top to bottom by elim-mpe generates functions that we denote by h functions: bucket(I) = P (I jH; G) bucket(H) = P (H jE; F ); h (H; G) bucket(G) = P (GjE; D); h (E; F; G) bucket(F) = P (F jB ); h (E; F; D) bucket(E) = P (E jC; B ); h (E; B; D) bucket(D) = P (DjC ); h (C; B; D) bucket(C) = P (C ); h (C; B ) bucket(B) = P (B ); h (B ) Where h (H; G) = max P (I jH; G), h (E; F; G) = max P (H jE; F ) 1 h (H; G), and so on. In bucket(B ) we compute the mpe value max P (B ) 1 h (B ), and then we can generate the I

H

G

F

E

D C

I

H

I

H

B

I

C

mpe tuple while going forward. If, instead, we process by approx-mpe(n,1) instead, we get (we denote by the functions computed by approx- elim(n; 1) that di er from those generated by elim-mpe): bucket(I) = P (I jH; G) bucket(H) = P (H jE; F ); h (H; G) I

8

B

C

D F

E

H

G

I

Figure 4: A belief P (i; h; g; e; d; c; b)= P (ijh; g )P (hje; f )P (g je; d)P (ejc; b)P (djc)P (b)P (c)

network

bucket(G) = P (GjE; D); (G) bucket(F) = P (F jB ); (E; F ); bucket(E) = P (E jC; B ); (E ); (E; D) bucket(D) = P (DjC ); (D) bucket(C) = P (C ); (C; B ); (C ) bucket(B) = P (B ); (B ); (B ). Algorithms elim-mpe and approx-mpe(n,1) rst di er in their processing of bucket(G). There, instead of recording a function on three variables, h (E; F; G), just like elim-mpe, approx-mpe(n,1) records two functions, one on G alone and one on E and F . Once approxmpe(n,1) has processed all buckets, we can generate a tuple in a greedy fashion as in elimmpe: we choose the value of B that maximizes the product of functions in B 's bucket, then a value of C maximizing the product-functions in bucket(C), and so on. H

H

F

G

E

E C

D

F

H

4.3

Iterative belief propagation

Iterative Belief Propagation (IBP) computes the belief for every variable in the network. This is done by iteratively executing Pearl's belief propagation algorithm for polytrees while ignoring the cycles in the network [14]. The algorithm works by computing causal and diagnostic support ( and  matrices respectively) for each variable in the network. Belief for variable x can be then computed as BEL(x) =  1  . In addition to having  and  matrices, each variable computes  and  messages that it sends to its parents and children respectively. These messages are used to propagate information in the network. Belief propagation was de ned for networks that have polytree structure. An activation schedule speci es the order in which variables are updated. The algorithm can be applied even if the network is cyclic, although in this case there is no guarantee that the algorithm will ever converge to the correct values. A formal description of the algorithm is given in gure 5. It takes an activation schedule as an input. We have used an activation schedule that rst updates the input variables of the coding network and then updates the parity-check variables. Notice that evidence variables are not updated. x

x

9

Iterative Belief Propagation (IBP) for Bayesian Networks : Input: A Belief Network BN = fP1 ; :::; P g, observations E , an activation schedule A, number of iterations I . n

Output: Belief in every variable.

1. Initialize: Compute initial  and  matrices for all variables. For an evidence variable x = j , set  i (k) equal to 1 for j = k and 0 for j 6= k. For non-evidence variables, set the  matrix equal to identity matrix. Set  matrix of each variable equal to the prior probability of the variable. Set  and  messages of all variables equal to identity. 2. Non-activation variables: Compute  and  messages for each variable in the network, but not in the activation schedule. Normally, all variables other than evidence variables should be in the activation schedule. 3. Iterations: Execute the following steps I times. 4. One iteration: For v 1 to j A j do  Let x be variable in the activation schedule corresponding to v.  Compute a new matrix  v according to the formula  (j ) = Q  i (j ) where  i are  messages sent to x by its children y . PCompute a new matrix  v Qaccording to the formula  (j ) =  i , where  i are  messages sent m P (x = j j u1; :::; u ) to x by its parents u .  Compute new  messages  v i that x sends to its parents u1; :::; u according toPthe formula  v i (k) = Q P  v (j ) l : 6= P (x = j j u1 ; :::; u ) 6=  l where is a normalization constant.  Compute new  messages  v i that x sends to its children y according to the formula  v i (j ) = ( v (j ) v (j ))= i v (j ). i

x

v

x

x

y ;x

i

y ;x

i

x

x

m

u1 ;:::;u

i

u ;x

u ;x

i

x ;u

v

m

x ;u

j

x

u

l

m

i

l

x ;y

x ;y

i

u ;x

v

x

i

x

y ;x

Figure 5: Iterative Belief Propagation Algorithm

10

5 Experimental Methodology We experimented with several types of coding problems, which include parity-check codes with both structured and random parent sets, and Hamming codes. Hamming codes are introduced in the previous section. Both structured and random parity-check code networks have K information bits and N transmitted bits, and each of the (N-K) parity-check bits has a xed number of parents, P. So far, we experimented only with parity-check networks having N=2K, namely, having code rate R = 1=2. In the class of random parity-check codes, each parity bit has P parents chosen randomly from K input bits, whereas in the class of structured parity-check codes, a parity-check bit x has the parents fu( + ) ; 0  j < P g. A belief network for a structured parity-check code with K=5 and P=3 is shown in Figure 1. Thus, structured codes are modeled as four-layer belief networks having K nodes in each layer. The two inner layers (channel input nodes) correspond to the input information bits u ; 0  i < K , and to the the parity-check bits, x ; 0  i < K . The two other layers (channel output nodes) correspond to the transmitted information and parity-check bits (y and y , correspondingly). The input nodes have binary domain (0 and 1), while the output nodes are real-valued variables (input bits plus Gaussian noise). Given K, P, and a Gaussian noise parameter , a sample coding network is generated as follows. First, the appropriate belief network structure is created. Then, we simulate an input signal (an assignment to the information bits) assuming uniform random distribution. Next, we compute the values of the parity-check bits, and simulate the corresponding channel output by adding Gaussian noise with given  to each channel input bit. The decoding algorithm receives the network whose evidence is the observed output nodes, and it has to recover the input bits. We experimented with the following decoding algorithms: iterative belief propagation (IBP), the exact elimination algorithms for nding MAP and MPE (elim-map and elimmpe, respectively), and approx-mpe(i,m) algorithm. We experimented only with di erent values of the parameter i, assuming m = 1, so that no bound was imposed on the number of functions in each mini-bucket. Therefore, from now on, we use a shorter notation approxmpe(i). Also, we sometimes use parameter i which is less than the number of parents, P (e.g., approx-mpe(1)). In this case, the number of variables in a mini-bucket is bounded by max(i; P + 1), rather than by i. The BER of each algorithm is plotted versus the Gaussian noise . Since in all our experiments the BER of elim-map coincided with that of elim-mpe, only the results for elim-mpe are reported. i

i

j modK

i

i

u i

i

6 Results and Discussion The results on the three classes of coding networks are summarized in Figures 6-8. For more details, see tables in Appendix. 11

Structured parity check code K=25, P=4, R=1/2 0

10

0

10 -1

10 -1

10 -2

10 -2

BER

BER

10

Structured parity-check code K=50, P=4, R=1/2

10 -3

10

10 -3 BP(1) BP(10) elim-mpe, approx-mpe(7) approx-mpe(1) approx-mpe(7)

-4

10 -5 0.2

0.3

0.4

0.5

sigma

0.6

10

10 -5 0.2

0.7

(a)

10

10

-3

10

-4

10 -5 0.2

0.5

0.6

0.7

(b)

0

10 -1

BER

BER

10 -1 -2

0.4

Structured parity-check code K=50, P=7, R=1/2

0

10

0.3

noise

Structured parity-check code K=25, P=7, R=1/2 10

BP(1) BP(10) elim-mpe, approx-mpe(7) approx-mpe(1)

-4

BP(1) BP(10) elim-mpe approx-mpe(1) approx-mpe(7)

0.3

0.4

0.5

0.6

0.7

10

-2

10

-3

10

-4

10 -5 0.2

sigma

BP(1) BP(10) elim-mpe approx-mpe(1) approx-mpe(7)

0.3

0.4

noise

0.5

0.6

0.7

(c) (d) Figure 6: The bit error rate (BER) of decoding algorithms for structured parity-check codes with R=1/2 and (a) K=25, P=4, (b) K=50, P=4, (c) K=25, P=7, and (d) K=25, P=7, as a function of the channel noise level sigma (). Belief propagation (BP) with one and ten iterations (IBP(1) and IBP(10)) is compared to the elim-mpe, approx-mpe(1) and approxmpe(7) on 1000 instances per point. The induced width of the networks was 6 for (a) and (b), and 12 for (c) and (d).

12

Random parity-check code K-50, P=7, R=1/2 0

10

BER

10 -1

10 -2 BP(1) BP(10) approx-mpe(1) approx-mpe(7)

10 -3

10

-4

0.2

0.3

0.4

0.5

0.6

0.7

sigma

Figure 7: The bit error rate (BER) of decoding algorithms for random parity-check codes with K=50, P=4 and R=1/2 as a function of the channel noise level sigma (). Belief propagation (BP) with one and ten iterations (IBP(1) and IBP(10)) is compared to the elim-mpe, approx-mpe(1) and approx-mpe(7) on 1000 instances per point. The induced width of the networks varied from 30 to 40.

(7,4) Hamming code 10

(15,11) Hamming code

0

10 -1

BER

BER

10 -1

10

BP(1) BP(5) elim-mpe approx-mpe(1) approx-mpe(7)

0

10 -2

10 -3 0.2

BP(1) BP(5) elim-mpe approx-mpe(1) approx-mpe(7)

10 -2

10 -3 0.3

0.4

0.5

0.6

0.28

sigma

0.32

0.40

sigma

0.45

0.50

(a) (b) Figure 8: The bit error rate (BER) of decoding algorithms for Hamming codes with (a) K=4, N=7 and (b) K=11, N=15, as a function of the channel noise level sigma (). Belief propagation (BP) with one and ve iterations (IBP(1) and IBP(5)) is compared against the elim-mpe, approx-mpe(1) and approx-mpe(7) on 10000 instances per each point. The induced width of the coding networks was (a) 3 and (b) 9, respectively. 13

Iterative belief propagation (IBP) with m iterations is denoted as IBP (m). On the class of parity-check codes we ran up to 10-15 iterations of IBP. The results usually converge after 10 iterations. We report the results on IBP(1) and IBP(10). On Hamming codes, even less iterations (m=5) were sucient. Algorithm approx-mpe(i) was tested for several values of i. Its accuracy increases with increasing i, and the algorithm becomes exact for i > w3, where w3 is the induced width of the network. 6.1

Structured parity-check codes

Comparison of the algorithms on the structured parity-check code networks with K=25 and 50, and P=4 and 7 ( gure 6) shows that: 1. as expected, algorithm elim-mpe (exact MPE decoder) is always the most accurate decoder; 2. not surprisingly, 10 iterations of IBP result into a much lower BER than just one iteration. 3. approximation algorithm elim-mpe(i) is close to elim-mpe, which is also expected due to low induced width of the networks (w3 = 6 in case of P=4, and w3 = 12 in case of P=7). 4. a more interesting observation is that elim-mpe(i) outperforms IBP on all those networks. With increasing the number of parents from P=4 ( gures 6(a) and 6(b)) to P=7 ( gures 6(c) and 6(d)), the di erence between IBP and elim-mpe(i) becomes even more pronounced. On the networks with P=7 both approx-mpe(1) and approx-mpe(7) achieve an order of magnitude smaller error than IBP(10). Next, we compared the results for each algorithm separately, increasing number of parents from P=4 to P=7 (see also Table 1). We see that the error of IBP(1) practically does not change, the error of the exact elim-mpe changes only slightly, while the error of IBP(10) and approx-mpe(i) increases. However, the BER of IBP(10) increased more dramatically with increased parent set. The induced width of the network was increasing with the increasing parent set size, and a ected the quality of both IBP and approx-mpe(i). In case of P=4 (induced width 6), approx-mpe(7) coincides with elim-mpe; in case of P=7 (induced width 12) neither of approimation algorithms was exact. Still it was closer to exact than IBP. In summary, the results on structured parity-check codes demonstrate that approxmpe(i), even for i=1, was always at least as accurate as belief propagation with 10 iterations. On networks with larger parent sets approx-mpe(1) and approx-mpe(7) were an order of magnitude better than IBP(10), and about two orders of magnitude better than IBP(1). These results indicate that approx-mpe(i) may be a better decoder than IBP on the networks having relatively small induced width. 6.2

Random parity-check code

On randomly generated parity-check networks (Figure 7) the picture was reversed: approxmpe(i) was worse than IBP(10), although as good as IBP(1). Elim-mpe always ran out of memory on those networks (the induced width exceeded 30). The results are not surprising 14

since approx-mpe(i) is not likely to be accurate if the bound i is much lower than the induced width. However, it is not clear why IBP was much better in this case. Of course, realistic codes are carefully designed rather than just generated randomly, since devising good codes is the central issue in the coding theory. 6.3

Hamming codes

We tested the belief propagation and the mini-bucket approximation algorithms on two Hamming code networks, one with K = 4; N = 7, and the other one with K = 11; N = 15. The results are shown in gure 8. Again, the most accurate results are produced by elim-mpe decoding. Since the induced width of the (7,4) Hamming network is only 3, approx-mpe(7) coincides with the exact algorithm. IBP(1) is much worse than the rest of the algorithms, while IBP(5) is very close to the exact elim-mpe. Algorithm approx-mpe(1) is slightly worse than IBP(5). On the larger Hamming code network, the results are similar, except that both approx-mpe(1) and approx-mpe(7) are inferior to IBP(5), which is inferior to elimmpe. These results are expected since the induced width of the network is larger (w3 = 9), so that approximation with bound 7 or less is suboptimal. Since the networks were quite small, the runtime of all algorithms was less than a second, and the time of IBP(5) was comparable to the time of exact elim-mpe. In summary, we see that MPE decoding is always superior to IBP decoding. It is expected that an optimal algorithm will be superior, however, since elim-mpe minimizes the block error, while IBP minimizes the bit error, this was not completely clear. For structured parity-check networks having bounded induced width, approx-mpe(i) also outperforms IBP by order(s) of magnitude. However, when the induced width is large (e.g., random paritycheck codes), approx-mpe(i) is less accurate than IBP.

7 Conclusions The paper studies empirically the performance of several approximation algorithms, which were developed for probabilistic inference, using probabilistic decoding as a test bed. Specifically, the mini-bucket algorithm for approximating the most-probable-explanation (mpe) is compared against the following algorithms: exact-mpe, elim-map and a variation of Pearl's belief-propagation algorithm. The latter was shown recently to be highly e ective for decoding if applied iteratively. We evaluated these algorithms on coding-like problems using parity-check and Hamming codes. Our results show that in coding networks inducing small width, the mini-bucket algorithms outperformed the iterative belief propagation algorithms. However, for networks having either a large induced-width or for Hamming networks the iterative approach outperformed the basic mini-bucket algorithm (see random network). In all cases, and as is quite expected, decoding using the optimal elim-mpe algorithm was best. As expected, we observe dependence between the network's induced-width and the quality of the mini-bucket's approach; such dependence is less clear for iterative belief propagation. 15

We also observe increased accuracy as we employ more powerful mini-bucket algorithms. Our experiments were restricted to networks having small parent sets since even the mini-bucket approach and the belief propagation approach are time and space exponential in the parent set. This limitation can be eliminated since coding-like networks possess causal independence [10, 19], and they can be transformed into networks having families of size three only. Indeed, in coding practice, the belief propagation algorithm is linear in the family size, thus allowing processing networks of arbitrary family size. We plan to extend our algorithms to exploit causal independence as well, and hope to extend our experimental evaluation accordingly.

Acknowledgments We wish to thank Padhraic Smyth and Robert McEliece for insightful discussions and providing various information about the coding domain.

16

Appendix Table 1: Bit error rate and runtime of the algorithms on structured parity-check codes. Notation: (1) is IBP(1), (2) is IBP(10), (3) is elim-mpe, (4) is approx-mpe(1), (5) is approxmpe(7). 

(1)

(2)

BER (3)

(4)

(5)

(1)

(2)

Time (3)

(4)

(5)

0.01 0.02 0.02 0.02 0.01 0.01 0.01 0.01

0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04

0.07 0.10 0.06 0.06 0.06 0.06 0.06 0.06

0.13 0.19 0.12 0.12 0.12 0.12 0.12 0.12

0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03

0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09

0.12 0.08 0.12 0.08 0.12 0.12 0.13 0.12

0.24 0.16 0.24 0.16 0.24 0.24 0.25 0.24

K=25, P=4, R=1/2, 1000 experiments per row

0.28 0.32 0.35 0.40 0.45 0.50 0.56 0.63

1.8e-2 3.0e-2 3.8e-2 5.2e-2 6.6e-2 8.0e-2 9.0e-2 1.0e-1

1.0e-3 4.1e-3 7.4e-3 1.7e-2 3.2e-2 4.8e-2 6.4e-2 8.4e-2

2.8e-4 1.1e-3 3.2e-3 1.0e-2 2.3e-2 3.9e-2 6.2e-2 8.4e-2

induced width = 6 8.8e-4 2.8e-4 0.02 2.5e-3 1.1e-3 0.02 5.4e-3 3.2e-3 0.02 1.4e-2 1.0e-2 0.02 2.9e-2 2.3e-2 0.02 4.6e-2 3.9e-2 0.02 6.4e-2 6.2e-2 0.02 8.3e-2 8.4e-2 0.02

0.16 0.16 0.16 0.16 0.16 0.16 0.16 0.16

0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03

K=25, P=7, R=1/2, 1000 experiments per row

0.28 0.32 0.35 0.40 0.45 0.50 0.56 0.63

1.8e-2 3.0e-2 3.8e-2 5.2e-2 6.6e-2 8.0e-2 9.0e-2 1.0e-1

6.0e-3 1.1e-2 1.7e-2 3.3e-2 4.9e-2 6.5e-2 8.1e-2 1.0e-1

0.28 0.32 0.32 0.40 0.45 0.50 0.56 0.63

1.9e-2 3.0e-2 3.9e-2 5.3e-2 6.8e-2 8.0e-2 9.4e-2 1.1e-1

1.2e-3 3.7e-3 7.8e-3 1.8e-2 3.3e-2 5.1e-2 7.1e-2 9.2e-2

2.4e-4 1.2e-3 2.0e-3 8.8e-3 2.5e-2 4.2e-2 6.3e-2 8.9e-2

induced width = 12 8.0e-4 8.0e-4 0.36 3.9e-3 3.9e-3 0.53 6.1e-3 6.1e-3 0.33 2.1e-2 2.1e-2 0.33 3.7e-2 3.7e-2 0.33 5.9e-2 5.9e-2 0.33 7.4e-2 7.4e-2 0.33 9.3e-2 9.3e-2 0.33

3.89 5.78 3.55 3.55 3.55 3.55 3.56 3.56

1.17 1.73 1.07 1.07 1.07 1.07 1.07 1.07

K=50, P=4, R=1/2, 1000 experiments per row

1.4e-4 1.3e-3 3.9e-3 1.1e-2 2.5e-2 4.1e-2 6.3e-2 8.9e-2

induced width = 6 4.3e-4 1.4e-4 0.03 2.0e-3 1.3e-3 0.03 5.4e-3 3.9e-3 0.03 1.3e-2 1.1e-2 0.03 2.8e-2 2.5e-2 0.03 4.4e-2 4.1e-2 0.03 6.8e-2 6.3e-2 0.03 9.5e-2 8.9e-2 0.03

0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33

0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06

K=50, P=7, R=1/2, 1000 experiments per row

0.32 0.35 0.40 0.45 0.50 0.56 0.59 0.63

3.0e-2 3.9e-2 5.5e-2 6.8e-2 8.1e-2 9.5e-2 1.0e-1 1.1e-1

1.1e-2 1.8e-2 3.3e-2 5.1e-2 6.9e-2 8.8e-2 9.6e-2 1.1e-1

9.6e-4 2.7e-3 1.0e-2 2.4e-2 4.4e-2 7.3e-2 8.3e-2 9.7e-2

induced width = 12 2.4e-3 2.4e-3 0.66 5.3e-3 5.3e-3 0.43 1.6e-2 1.6e-2 0.66 3.2e-2 3.2e-2 0.43 5.2e-2 5.2e-2 0.66 7.7e-2 7.7e-2 0.66 8.8e-2 8.8e-2 0.69 1.0e-1 1.0e-1 0.67

7.15 4.71 7.15 4.71 7.15 7.15 7.47 7.22

2.82 1.86 2.82 1.86 2.82 2.82 2.94 2.85

Table 2: Bit error rate and runtime of the algorithms on random parity-check codes. Notation: (1) is IBP(1), (2) is IBP(10), (3) is elim-mpe, (4) is approx-mpe(1), (5) is approxmpe(7), (6) is approx-mpe(13). 

(1)

(2)

(3)

BER (4)

(5)

(6)

(1)

(2)

Time (3) (4)

(5)

(6)

0.10 0.10 0.10 0.10 0.10

1.59 1.59 1.59 1.59 1.58

K=50, P=4, R=1/2, 1000 experiments per row

0.22 0.32 0.40 0.50 0.63

5.8e-3 3.0e-2 5.4e-2 7.9e-2 1.1e-1

1.1e-4 1.4e-3 8.2e-3 3.7e-2 8.7e-2

{ { { { {

induced width 3.1e-3 2.5e-3 2.4e-2 2.5e-2 5.3e-2 5.8e-2 8.7e-2 9.2e-2 1.2e-1 1.3e-1

from 30 to 40 7.6e-4 0.04 1.6e-2 0.04 4.9e-2 0.04 9.0e-2 0.04 1.2e-1 0.04

17

0.34 0.34 0.34 0.34 0.34

{ { { { {

0.03 0.03 0.03 0.03 0.03

References [1] S. Arnborg, D.G. Corneil, and A. Proskurowski. Complexity of nding embedding in a k-tree. Journal of SIAM, Algebraic Discrete Methods, 8(2):177{184, 1987. [2] J.-F. Cheng. Iterative decoding. PhD Thesis, 1997. [3] R. Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Proc. Twelfth Conf. on Uncertainty in Arti cial Intelligence, pages 211{219, 1996. [4] R. Dechter. Bucket elimination: a unifying framework for processing hard and soft constraints. CONSTRAINTS: An International Journal, 2:51 { 55, 1997. [5] R. Dechter. Mini-buckets: A general scheme for generating approximations in automated reasoning. In Proc. Fifteenth International Joint Conference of Arti cial Intelligence (IJCAI-97), Japan, pages 1297{1302, 1997. [6] R. Dechter. Bucket elimination: A unifying framework for probabilistic reasoning. In M. I. Jordan (Ed.), Learning in Graphical Models, Kluwer Academic Press, 1998. [7] R. Dechter and I. Rish. A scheme for approximating probabilistic inference. In Proc. Thirteenth Conf. on Uncertainty in Arti cial Intelligence, 1997. [8] Rina Dechter and Judea Pearl. Network-based heuristics for constraint-satisfaction problems. Arti cial Intelligence, 34:1{38, 1987. [9] P. Thitimajshima G. Berrou, A. Glavieux. Near shannon limit error-correcting coding: Turbo codes. In Proc. 1993 International Conf. Comm. (Geneva, May 1993), pages 1064{1070, 1993. [10] D. Heckerman and J. Breese. A new look at causal independence. In Proc. Tenth Conf. on Uncertainty in Arti cial Intelligence, pages 286{292, 1994. [11] D.J.C. MacKay and R.M. Neal. Near shannon limit performance of low density parity check codes. Electronic Letters, 33:457{458, 1996. [12] R.J. McEliece. Private communication. 1998. [13] R.J. McEliece, D.J.C. MacKay, and J.-F.Cheng. Turbo decoding as an instance of pearl's belief propagation algorithm. To appear in IEEE J. Selected Areas in Communication, 1997. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo, California, 1988. [15] M. Pradhan, G. Provan, B. Middleton, and M. Henrion. Knowledge engineering for large belief networks. In Proc. Tenth Conf. on Uncertainty in Arti cial Intelligence, 1994. 18

[16] I. Rish and R. Dechter. On the impact of causal independence. Technical report, Information and Computer Science, UCI, 1998. [17] C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379{423,623{656, 1948. [18] Y. Weiss. Belief propagation and revision in networks with loops. In NIPS-97 Workshop on graphical models, 1997. [19] N.L. Zhang and D. Poole. Exploiting causal independence in bayesian network inference. Journal of Arti cial Intelligence research, 5:301{328, 1996.

19