Parallel Graduated Assignment Algorithm for Multiple

Parallel Graduated Assignment Algorithm for Multiple Graph Matching Based on a Common Labelling* David Rodenas, Francesc Serratosa, and Albert Solé Universtitat Rovira i Virgili, Department d'Enginyeria Informàtica i Matemàtiques 43007 Tarragona, Spain [email protected], {francesc.serratosa,albert.sole}@urv.cat

Abstract. This paper presents a new parallel algorithm to compute multiple graph-matching based on the Graduated Assignment. The aim of developing this parallel algorithm is to perform multiple graph matching in a current desktop computer, but, instead of executing the code in the generic processor, we execute a parallel code in the graphic processor unit. Our new algorithm is ready to take advantage of incoming desktop computers capabilities. While comparing the classical algorithm (executed in the main processor) respect our parallel algorithm (executed in the graphic processor unit), experiments show an important speed-up of the run time. Keywords: Graph Common Labelling, Graduated Assignment, Parallel architecture, Low-cost computer, CUDA, Multiple Graph Matching.

1 Introduction Classification is a task of pattern recognition that attempts to assign each input value to one of a given set of classes. Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to match inputs with classes. Pattern recognition is studied in many fields such as psychology, cognitive science, computer science and so on. Depending on the application, inputs of the pattern recognition model or objects to be classified are described by different representations. The most usual representation is a set of real values but other common ones are strings, trees or graphs. Graph structures have more capacity to capture the knowledge of the model but their comparison or matching is also computationally more expensive. Sometimes in graph based pattern recognition applications, given a set of graphs, which all represent equivalent or related structures, it is required to find global consistent correspondences among all those graphs. These correspondences are called a Common Labelling (CL). Algorithms like [1] and [2] does pair matching and *

This research was partially supported by Consolider Ingenio 2010; project CSD2007-00018 and by the CICYT project DPI 2010-17112.

X. Jiang, M. Ferrer, and A. Torsello (Eds.): GbRPR 2011, LNCS 6658, pp. 132–141, 2011. © Springer-Verlag Berlin Heidelberg 2011

Parallel Graduated Assignment Algorithm for Multiple Graph Matching

133

reconstructs a general correspondence, other algorithms like [3] uses Graduated Assignment [4] to generate the CL by matching all graph nodes to a virtual node set in a polynomial time. Nowadays desktop computer architectures have evolved towards supercomputing architectures. These architectures generally provide multiple processors and complex memory hierarchy. A simple desktop computer may contain tens of small processors called cores, some of them present at main processor [5], but most of them are present as auxiliary coprocessors like graphical processors [6]. Most algorithms are designed to be executed on a single core and general-purpose processor. Consequently, they are not designed to take advantage of all available resources on current desktop computers. By not taking into account available resources, it appears a gap between effective algorithm performance and potential algorithm performance. This gap will increase at the same rate that cores count on desktop computers increases. This paper presents a new research project that aims to adapt classical graph algorithms to a up-to-date desktop computer. Intensive computation tasks are computed in the Graphic Processor, such as the graduated assignment algorithm [3]. In this framework, to compute the common labelling algorithm in desktop computers can make use available existing resources. The bases of our work are commented in the section 2 (original algorithm) and section 3 (computer architecture and parallel programming model). The new parallel algorithm is explained in section 4. Section 5 shows the runtime of the sequential algorithm in comparison to the new parallel algorithm with two well known graph databases. Section 6 concludes the paper.

2 Multiple Graph Matching and Computer Architecture In this section, we introduce a common-labelling algorithm and its behaviour. We also present a set of equations which are used to transform the algorithm to be executed in multi-core computers. 2.1 Attributed Graphs and Multiple Graph Matching Graduated Assignment [3] is one of the algorithms considered to have a good run-time performance between most popular common labelling algorithms. This algorithm approximates a distance and a labelling between many graphs using a polynomial time method respect the order of the graphs. The result of the CL algorithm is a set of prob1 2 N ability matrices { P h , P h , ... , P h } that represents, for each matrix, the probability of p matching a node of one of p graph to a virtual node. Since any p matrix P h values are continuous, a discretisation process of the probability matrix [7] is applied to obtain the final labelling between graphs nodes. This process is out of the scope of this paper. Given a set of graphs {G1, G2, ...,GN} (that have R vertices) and their respective adjacency matrices {A1, A2, ...,AN}, the general outline of the Graduated Assignment CL is shown in algorithm 1 and 2.

134

D. Rodenas, F. Serratosa, and A. Solé

Algorithm 1. General diagram of the Gradu- Algorithm 2. Approx_Q function description ated Assignment Common Labelling

!"#

$%% !& #%%'( #%% & #%%'' )( #%% )

pq C aibj represents the compatibility of labelling edge (a,b) of graph Gp to edge (i,j) pq of graph Gq and their respective ending nodes. In order to optimize C aibj computation it is defined as:

? ?

(1)

C aipq is the precomputed distance between vertex a from graph p and vertex i from graph q, dist function determines the distance defined by the existence of graph p ab edges and graph q ij edges. Function Stochastic obtains a double stochastic matrix [3] using the Sinkhorn method [8] as follows, begin Do until convergence

? ?

?

?

? ?

(2)

?

?

(3)

?

end

Sinkhorn method has been parallelised for many high-performance architectures, such as vector machines [9] and connection machines [10].

3 Computer Architecture and Programming Model In this section, we introduce a desktop computer architecture and we relate it to a parallel programming model. We also present a set of directives to specify parallel algorithms.


135

Generic Processor

GPU / GPGPU

Data Bus

Main Memory

Fig. 1. General view of a desktop computer architecture Generic CPU

Fig. 2. Generic GPGPU Architecture

GPGPU

Block (1, 0)

Grid generic instructions

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Kernel

generic instructions

Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1)

Block Memory

Fig. 3. CUDA Logical Execution Space

3.1 Desktop Computer Architecture The current desktop computers are composed by 2 processors: A generic multi-core processor (composed by few cores) and a Graphics Processing Unit (GPU, composed by tens of small cores). Both processors have access to main memory (Figure 1). Current GPUs are General Purpose GPUs (GPGPU) dedicated to intensive computations, mainly addressed to graphic tasks. They are able to execute simple functions usually called kernels. GPGPUs are massively multi-threaded architectures. They are composed by several multiprocessors (figure 2), each of which has multiple cores and a shared memory. Cores are the processing units that compute thread instructions. Shared memory has multiple banks, they can serve data simultaneously to multiple threads. This shared memory has small size but very low latency. 3.2 Parallel Programming Model Our parallel programming model is CUDA [11]. This programming framework allows to mix sequential C code, executed in the generic processor, with kernels, executed in the GPGPU. When the sequential code reaches a kernel, it configures a logical grid of blocks (figure 3) and launches its execution on the GPGPU. A kernel code is executed concurrently by all threads of the grid of blocks. Each block is physically mapped into a GPGPU multiprocessor. The threads of a block are executed in the cores of one multiprocessor and the block memory is mapped into the shared memory of the multiprocessor (figure 2 and 3).

136

D. Rodenas, F. Serratosa, and A. Solé

We define the following four directives to express parallel tasks: #$ !$ !" #" &$ '%&(+1 !& 5 & 4 1 !4 2 4!"#" 4!$#$, !$ #$ !" #" ) '%&(+1 !& 5 & 1 !1 & 4 2 1 !"#" 1 !$#$, *

%&!& ! !4, !$ #$ % !" #" ' 3 5' 1 . !1 ! !4 &.5! !' 2 !"#" !$#$ 2 1 1 ! 1 4, $ 1 !1 !1 4, % ' 3 5' 1 !1 ! . !4 1 ' 2 1 1 !1 4,

4 A Parallel Solution for the Graph Matching Problem This section shows a parallel solution of the six main functions of the Graduated Assignment Common Labelling (algorithm 1 and 2): Pf computation, Approx_Q, Exponentiation, Stochastic and Convergence computation. 4.1 Parallel Pf Computation Pf computation is very close to traditional matrix multiplication, but it is simpler because all accesses are ordered as consecutive row accesses. In the classical algorithm it computes the following equation:

% ,

A ? ? ?

21

?

(4)

but, with the aim of implementing a parallel algorithm, loops q and a are reordered:

% ,

A ? ? ?

21

?

(5)

We assume that R elements can be stored into block memory. The new algorithm that computes (5) parallelises loops p and a through blocks, thus, there is only one k p p row of P h [ a , k ] for each block. We fetch this k row of P h [ a , k ] into block memory and by this way, all threads of the same block are using the same data. The parallel algorithm is the following: Algorithm 3. Parallel algorithm for Pf Computation

A ?

?

?

%

, % ?


137

4.2 Parallel Computation of Approximate Q Matrix Approx_Q function (presented at algorithm 2) can be rewritten as the following:

* ? L)B A ? ? ?

% , ? ?

B - - *? , - ? -??

?

?

21

(6)

If we convert v1 into a hyper-matrix, we can split (6) as two expressions:

*? % ,

A ? ? ?

? ?

L)B - - *? , - ? ?

A ? ? ? - ??

21

?

(7)

21

(8)

Loops q and i in (8) can be reordered to convert Q computation into summations:

- *? , - ?

A ? -"?

?

21

? ?

(9)

(9) is discussed in section 4.3. (7) is the most complex task of the whole algorithm. We assume that multiple matrices of B x B elements can be stored into block memory. We apply loop tiling technique [12] in order to expose sub-matrices small enough to fit on block memory:

.

.

.

.

.

.

.

.

* ,

A ? ) ) ? ?

?

) & ) % ? * ?

%

21

, . , . % , . & , .*

(10)

If we apply (1) to (10) we obtain the following expression:

.

.

.

.

.

.

.

%

.

*? A ? ) ) ? ? ) & ) % ? * ?

?

21

, . , . % , . & , .*

(11)

We parallelise p, q, c and k through blocks. In order to avoid parallel reductions, we parallelise only d and l through threads. As a result each thread computes a unique v 1pq [ a , i] element. Loop tiling is also applied to summations, b and j iterations are decomposed. The objective is that all operations inside f, v summations share the pq pq pq p q same data (sub-matrices f, v, d, l from P f [ b , j ] ,C ai , C bj , A ab and Aij ) between all d, l threads of the same block. For each e, u iteration, a new set of sub-matrices of data is fetched and all threads are synchronized in order to use the same data. As an optimipq zation, data fetch is performed as soon as possible; for example: C ai indexes are constant for each thread for all summations, consequently it is fetched before any summation and reused for all summations of all threads of the same block. The parallelised algorithm is the following:

138

D. Rodenas, F. Serratosa, and A. Solé Algorithm 4. Parallel algorithm for V1 Computation (11, 1, and 7)

. . A ? ) )

. .? ? *)+,-*) / ? )

&%%. "

.% ) , . % &% %.

*.) " & , .*

. . % * ) )

" , . % &, .* " , . % &, .* %

&%% *),/ *) % * ? *? ? . . % * ) )

4.3 Parallel Computation of Ph Matrix We have grouped all Ph computation on the same parallel kernel: Q Computation (9), Exponentiation, Stochastic (2 and 3) and Convergence test (see algorithm 1). Exponentiation, and Convergence test algorithm expressions are:

% 5 1 ,

A ? ?

* +

A ? ?

-"

(12)

% % 0234 )

(13)

Convergence test is split in 2 parts in order to parallel compute local convergences:

% %

A

)

? ?

B * +

A

0234

(14)

All grouped computations presents an initial loop for p, and two more loops for a and i (or w1) indexes. In order to minimize copies from main memory to block memory we assume that each block of threads will process a single and unique p value. As p a result, all threads of the same block will keep the same P h and no communication p is required of P h temporal values to other blocks. Block threads are parallelised through a and i (or w1) indexes. The new algorithm is:


139

Algorithm 5. Parallel algorithm for Ph Computation (9, 12, 2, 3, 14)

A

)

? ?

% 5 1 ,

? ?

* ? , - ?

- ?? ?

-"

?

?

0

? ?

?

&

?

0

? ?

&

% % )

? ?

+

A

0234

5 Experimental Evaluation We have implemented the sequential algorithm [3] and the proposed parallel algorithm. Both algorithms have been tested over a GPGPU parallel architecture and over a generic Intel architecture. Table 1 shows the architecture characteristics in which the serial algorithm (first row) and the parallel algorithm (second row) are executed. Both are executed on the same desktop computer but the serial algorithm is executed on the main processor and the parallel algorithm is executed on the GPGPU. We have used two databases in which nodes are defined over a two-dimensional domain that represents its plane position (x,y). Edges have binary attribute that represents the existence of a line between two terminal points. The first dataset is a subset of high noise level of the Letter dataset created the University of Bern [13]. This data set is composed of 15 classes and 150 graphs per class representing the Roman Table 1. Architectures used in the comparative and their characteristics Alg. Computer Proc. / GPGPU Serial [3] ViewSonic VT132 Intel Atom 330 New Parallel ViewSonic VT132 NVIDIA 9400M

1

GHz Power Cores Threads Bandwith 1.6 8W 2 41 5 GB/s 1.1 10W 16 1536 5 GB/s

There are 4 threads available but algorithm [3] uses only one thread and one core.

140

D. Rodenas, F. Serratosa, and A. Solé -.

-.

4

#

# $$,

$$,

#,$,

#,$,

#$+

#$+

*

*

#

#

%

(

#

#(

$(

(

)(

#

#(

%( $( #)

# + 3 #$ #%

#$ 32+ +2# ,2% %2 %

(

#

#(

$(

(

)(

#

%

#(

(

#

#(

$(

(

)(

#

#(

/0 1

/0 1

/0 1

Fig. 4. Letter run time of Serial and Parallel and speedup respect to the number of graphs for each selected class. Vertical axis are in log. scale. -.

-.

#

#

$$,

$$,

#,$,

#,$,

#$+

#$+

*

*

#

#

4 %( $( #)

+2# ,2% %2

%

(

#

#(

/0 1

$(

(

( 3 #, #( $#

#$ 32+

%

(

#

#(

$(

(

%

(

#

#(

$(

(

/0 1

/0 1

Fig. 5. GREC run time of Serial and Parallel and speedup respect to the number of graphs for each selected class. Vertical axis are in log. scale.

alphabet i.e. A, E, F, .., X, Y, and Z. The second dataset, called GREC dataset, created at the Universitat Autònoma de Barcelona [13], is composed of 22 classes and 50 graphs per class representing symbols from architectural and electronic drawings. We have selected 5 classes of each dataset to compare execution speed. For each [5, 10, 15, 25, 50, 75, class we have randomly selected a number of graphs for N 100, 150] for Letter dataset, and N [5, 10, 15, 25, 50] for GREC dataset. Letter dataset classes selected are {1, 6, 8, 12, 13} each one with a mean number of nodes of {5.3, 5.3, 5.3, 5.4, 4.4}. GREC dataset classes selected are {5, 8, 14, 15, 21} each one with a mean of {19.4, 8.6, 12.7, 20.7, 17.14}. Figure 4 shows the mean run time for each one of the five Letter dataset classes for serial and parallel algorithm experiments for a given different number of the graphs. Figure 5 shows the mean run time for each one of the five GREC dataset classes for serial and parallel algorithm experiments for a given different number of the graphs. The obtained distance is not shown since the sequential and parallel algorithm obtains exactly the same result. It can be observed a clear improvement on the run time when the parallel algorithm is used.

∈

∈

6 Conclusions and Future Work We have presented a parallel algorithm which can take advantage of present computational resources on current desktop computers. Results show a significant speed-up of the run time of multiple graph-matching algorithms. The aim is to demonstrate that it is possible to perform some modifications to the classical algorithm in order to take advantage of existing resources and use low-cost computers to perform pattern recognition processes based on graphs. Future efforts will focus on parallelise other


141

graph-matching algorithms, apply new solutions to other architectures (like common multi-core and vector operations) and identify a subset of common operations to simplify future algorithm adaptations.

References [1] Bonev, B., Escolano, F., Lozano, M.A., Suau, P., Cazorla, M.A., Aguilar, W.: Constellations and the unsupervised learning of graphs. In: Escolano, F., Vento, M. (eds.) GbRPR. LNCS, vol. 4538, pp. 340–350. Springer, Heidelberg (2007) [2] Solé-Ribalta, A., Serratosa, F.: On the Computation of the Common Labelling of a set of Attributed Graphs. In: 14th Iberoamerican Congress On Pattern Recognition, pp. 137– 144 (2009) [3] Solé-Ribalta, A., Serratosa, F.: Graduated Assignment Algorithm for Finding the Common Labelling of a Set of Graphs. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR&SPR 2010. LNCS, vol. 6218, pp. 180–190. Springer, Heidelberg (2010) [4] Gold, S., Rangarajan, A.: A Graduated Assignment Algorithm for Graph Matching. IEEE TPAMI 18(4), 377–388 (1996) [5] Gochman, S., Mendelson, A., Naveh, A., Rotem, E.: Introduction to Intel Core Duo Processor Architecture. Intel Technology Journal, 10(2), May 15 (2006) [6] Owens, J.: Streaming architectures and technology trends. GPU Gems 2, 457–470 (2005) [7] Kuhn, H.W.: The Hungarian method for the assignment problem Export. Naval Research Logistics Quarterly 2(1-2), 83–97 (1955) [8] Sinkhorn, R.: A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices. The Annals of Mathematical Statistics 35(2), 876–879 (1964) [9] Zenios, S.A., Iu, S.-L.: Vector and parallel computing for matrix balancing. Annals of Operations Research 22, 161–180 (1990) [10] Zenios, S.A.: Matrix balancing on a massively parallel Connection Machine. ORSA Journal on Computing 2, 112–125 (1990) [11] NVIDIA CUDA http://developer.nvidia.com/object/cuda.html [12] Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers, Dordrecht (2000) [13] Riesen, K., Bunke, H.: IAM graph database repository for graph based pattern recognition and machine learning. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR 2008. LNCS, vol. 5342, pp. 287–297. Springer, Heidelberg (2008)