Block constrained versus generalized Jacobi ... - NUS Engineering

0 downloads 0 Views 484KB Size Report
Aug 24, 2004 - It suffices to note here that the number ..... Note that storing these global sparse matrices in the. CSC or ..... Lecture Notes in Computer Science,.
Computers and Structures 82 (2004) 2401–2411 www.elsevier.com/locate/compstruc

Block constrained versus generalized Jacobi preconditioners for iterative solution of large-scale Biots FEM equations K.K. Phoon a

b

a,*

, K.C. Toh

b,*

, X. Chen

a

Department of Civil Engineering, National University of Singapore, Block E1A, #07-03, 1 Engineering Drive 2, Singapore 117576, Singapore Department of Mathematics, National University of Singapore, 2 Science Drive 2, Singapore 117543, Singapore Received 7 October 2003; accepted 30 April 2004 Available online 24 August 2004

Abstract Generalized Jacobi (GJ) diagonal preconditioner coupled with symmetric quasi-minimal residual (SQMR) method has been demonstrated to be efficient for solving the 2 · 2 block linear system of equations arising from discretized Biots consolidation equations. However, one may further improve the performance by employing a more sophisticated non-diagonal preconditioner. This paper proposes to employ a block constrained preconditioner Pc that uses the same 2 · 2 block matrix but its (1, 1) block is replaced by a diagonal approximation. Numerical results on a series of 3-D footing problems show that the SQMR method preconditioned by Pc is about 55% more efficient time-wise than the counterpart preconditioned by GJ when the problem size increases to about 180,000 degrees of freedom. Over the range of problem sizes studied, the Pc-preconditioned SQMR method incurs about 20% more memory than the GJ-preconditioned counterpart. The paper also addresses crucial computational and storage issues in constructing and storing Pc efficiently to achieve superior performance over GJ on the commonly available PC platforms.  2004 Elsevier Ltd. All rights reserved. Keywords: Block constrained preconditioner; Generalized Jacobi preconditioner; Biots consolidation equations; Three-dimensional finite-element discretization; Symmetric quasi-minimal residual (SQMR) method

1. Introduction The finite-element discretization of Biots consolidation equations typically give rises to a large symmetric indefinite linear system of equations of the form

*

Corresponding authors. Tel.: +65 6874 6783; fax: +65 6779 1635 (K.K. Phoon), fax: +1 6172589214 (K.C. Toh). E-mail addresses: [email protected] (K.K. Phoon), [email protected] (K.C. Toh).

 Ax ¼ b;

with A ¼

K

B

BT

C

 ð1Þ

where b 2 Rmþn ; K 2 Rmm is symmetric positive definite, B 2 Rmn has full column rank, and C 2 Rnn is symmetric positive semi-definite. A more detailed description of Biots consolidation equations is given elsewhere [1]. It suffices to note here that the number of displacement degrees of freedom (DOFs) m and the number of pore pressure DOFs n has a ratio of about 0.1 (see Table 1 for examples). The matrix A is generally sparse. Fig. 1 illustrates the sparsity pattern of A from a 5 · 5 · 5 meshed footing problem. A wide class of

0045-7949/$ - see front matter  2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compstruc.2004.04.012

2402

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

Table 1 3-D finite-element meshes Mesh size 5·5·5

12 · 12 · 12

16 · 16 · 16

20 · 20 · 20

24 · 24 · 24

Number of elements (ne) Number of nodes

125 756

1728 8281

4096 18785

8000 35721

13824 60625

DOFs Pore pressure (n) Displacement (m) Total (m + n) n/m (%)

180 1640 1820 11.0

2028 21576 23604 9.4

4624 50656 55280 9.1

8820 98360 107180 9.0

15000 169296 184296 8.9

No. non-zeros (nnz) nnz(B) nnz(B)/(nm) (%) nnz(C) nnz(C)/(n2) (%) ^ nnzðSÞ 2 ^ nnzðSÞ=ðn Þ (%)

22786 7.72 3328 10.27 10944 33.78

375140 0.86 46546 1.13 187974 4.57

907608 0.39 110446 0.52 461834 2.16

1812664 0.21 215818 0.28 921294 1.18

3174027 0.125 373030 0.17 1614354 0.72

Fig. 1. Sparsity structure of the coefficient matrix A arising from finite-element solution of Biots consolidation problem for a 3-D mesh (5 · 5 · 5) shown in the left panel (quadrant symmetric).

constrained problems involving mixed finite-element formulations also produces the above 2 · 2 block matrix structure [2]. When Eq. (1) is large scale, say arising from 3-D soil–structure interaction problems, storing the matrix A in explicit global form may be very expensive. Furthermore, solving Eq. (1) via sparse direct method such as sparse LU factorization can also be prohibitively expensive. To avoid these difficulties, solving Eq. (1) via a Krylov subspace iterative solver becomes a necessity. The obvious advantage of an iterative solver is that the global matrix A need not be assembled explicitly

since the matrix-vector multiplications required in each step of an iterative solver can be computed at the element-by-element (EBE) level. Iterative solvers, when appropriately preconditioned, may produce a sufficiently accurate approximate solution to Eq. (1) in a moderate number of iterative steps that are much smaller than the dimension of the system. If each preconditioning step is not too expensive, and the number of steps to get convergence is moderate, the combined effect may allow an iterative solver to solve Eq. (1) much faster than a direct solver can and with less memory demand.

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

There are several popular symmetric iterative methods for solving Eq. (1), such as SYMMLQ, MINRES, and SQMR. The first two solvers require the use of symmetric positive definite preconditioners, which is an unnatural restriction for an indefinite system. The SQMR method on the other hand, allows the use of arbitrary symmetric preconditioners [3] and is thus more versatile. Thus in this paper, we shall use SQMR as the iterative solver for Eq. (1). It is well known that to successfully solve a linear system such as Eq. (1) by an iterative solver within a reasonable amount of time, preconditioning the solver is crucial. For the linear system arising from Biots consolidation equations, there is currently only one preconditioner that has been demonstrated to be effective in terms of time and storage for solution of large-scale 3-D problems on a modest PC platform. This preconditioner is the Generalized Jacobi (GJ) diagonal preconditioner that has the following global form [1,4]: " # diagðKÞ 0 ^ Pa ¼ ð2Þ 0 adiagfC þ BT ½diagðKÞ1 Bg The motivation for the construction of the above GJ preconditioner comes from a theoretical eigenvalue clustering result developed by Murphy et al. [5] for a linear system of the form given by Eq. (1) but with C = 0. In a practical implementation, the above global form of P^ a is replaced by an element-by-element (EBE) form constructed from the following pseudo-code: For i ¼ 1 to m; p^ii ¼ k ii

ð3aÞ

For i ¼ m þ 1 to n; "

! # m X b2ji þ cii k jj j¼1 20 1 3 ne P ð beji Þ m X 6B C 7 e B C þ cii 7 ¼ a6 2 4@ A 5 k jj j¼1

^ pii ¼ a

1 3 ne P e 2 ðb Þ m ji C 7 6B X e C þ cii 7 B  a6 A 5 4@ k jj j¼1 20

ð3bÞ

where beij is the entry in the B-block of the eth finite-element referenced globally and ne is the total number of elements in the finite-element mesh. The constant a is chosen to be 4 and the motivation comes from the theoretical result developed in [1]. In recent years, explicit approximate inverses have been derived as preconditioners and used in conjunction with explicit preconditioned conjugate gradient methods for parallel computing (e.g., [6–8]). However, such preconditioners

2403

are designed for implementation on parallel computers and are not suitable for the PC environment assumed in this paper. Although the GJ-preconditioned SQMR method has been demonstrated to have very good performance in [1,4] in that it converges in a moderate number of iterations compared to the dimension of the linear system, it is highly desirable to find other more sophisticated preconditioners that can cut down the iteration count further while at the same time keeping the cost of each preconditioning step at a reasonable level. To design such a preconditioner, it is useful to keep in mind the following three basic criteria: (a) The preconditioned system should converge rapidly, i.e., the preconditioned matrix should have a good eigenvalue clustering property. (b) The preconditioner should be cheap to construct (this fixed overhead is mainly related to problem size) and to ‘‘invert’’ within each iteration (this variable cost is mainly related to iteration count). (c) Last but not least, the preconditioner should not consume a large amount of memory (a practical constraint for computations on PCs). It is also preferable to avoid massive indirect memory addressing operations to exploit the cache architecture in CPUs. Toh et al. [9] systematically analyzed three types of block preconditioners to solve the linear system given by Eq. (1) and evaluated their practical effectiveness using the first two criteria. It was assumed that sufficient memory is available to store the entire global coefficient matrix A in random access memory (RAM). On a limited set of numerical experiments on problems with dimensions less than 24,000 (so that global A can be stored), the block constrained preconditioner (Pc) considered in that paper was demonstrated to have superior performance (time-wise) compared to the block diagonal and block triangular preconditioners. However, the feasibility of implementing Pc to satisfy criterion (c) was not addressed. This issue is of paramount practical importance if the block constrained preconditioner is to be applicable for solving very large linear system of equations on PC platforms commonly found in most engineering design offices. The first objective of this paper is to address the efficient implementation and memory management of Pc. The second objective is to push the envelope of problem sizes studied by Toh et al. [9] to ascertain the generality of the comparative advantages Pc has over GJ on significantly larger linear systems of equations. The last objective is to explain semi-empirically why a Pc-preconditioned system is expected to converge faster than one that is preconditioned by GJ based on eigenvalue distributions.

2404

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

2. Block constrained preconditioners

2.2. Implementation details

2.1. Overview

For large finite-element problems, the data related to the linear system can be stored in an unassembled form (e.g. [10,11]), and the matrix-vector multiplications involved in the solution process can be carried out using an element-by-element (EBE) implementation. For the GJ preconditioner, the multiplication of A with a vector in each iterative step of SQMR is implemented at the element level as follows: ! ne X T e Av ¼ J i Ai J i v ð7Þ

Recall that a block constrained preconditioner is a 2 · 2 block matrix of the form " Pc ¼

^ K BT

B C

# ð4Þ

^ is a symmetric positive definite approximation where K of K. To solve a very large linear system of equations, ^ at the moment is the only practical choice for K ^ K ¼ diagðKÞ. Approximations that are based on incomplete Cholesky factorizations of K would be extremely expensive to compute (in terms of storage and time) because K needs to be assembled and stored explicitly. Throughout this paper, we shall use the approximation ^ ¼ diagðKÞ. K To apply the preconditioner Pc within an iterative solver, it is not necessary to compute its LU factorization explicitly––it can be applied efficiently by observing that its inverse has the following analytical form: " P 1 c

¼

^ 1 ^ 1 BS^1 BT K ^ 1  K K 1 T 1 ^ S^ B K

^ 1 BS^ 1 K 1 S^

# ð5Þ

^ 1 B is the Schur complement matrix where S^ ¼ C þ BT K associated with Pc. With the expression in Eq. (5), the preconditioning step in each iterative step can be implemented efficiently via the following pseudo-code (see [9]):

i¼1

where A is the global coefficient matrix, J i 2 RpðmþnÞ is a connectivity matrix transforming local to global DOFs [12], p is the number of DOFs in one element, ne is the number of elements, and Aei 2 Rpp is the ith element stiffness matrix, which can be expressed further as  e  Ki Bei Aei ¼ ð8Þ BeT C ei i In the preconditioning step described in Eq. (6), it would be too expensive from the perspective of CPU time if BTw and Bz are implemented at the EBE level because this requires scanning through all the element stiffness matrices for assembly. Given the relatively smaller dimensions of B, BT and C relative to K (i.e., n  m), it is more reasonable to assemble these sparse matrices in global form prior to applying the SQMR solver. In addition, the availability of the global matrices B, BT and C allows the computation of Am to be done efficiently through the following procedure: Given ½u; v

Given displacement vector u and pore pressure vector v; ^ 1 u; Compute w ¼ K

Compute z1 ¼ Bv;

z2 ¼ BT u;

z3 ¼ Cv

Compute w ¼ Ku at EBE level

1

Compute z ¼ S^ ðBT w  vÞ;

ð9Þ

Compute A½u; v ¼ ½w þ z1; z2  z3 

1

^ Compute P 1 c ½u; v ¼ ½K ðu  BzÞ; z: ð6Þ Assuming that the sparse Cholesky factorization S^ ¼ LLT has been computed, we readily see that one pre^ 1 conditioning step involves two multiplications of K with a vector, the multiplications of the sparse matrices B and BT with a vector, and the solution of two triangu^ 1 with a veclar linear systems. The multiplication of K ^ tor can be done efficiently since K is a diagonal matrix, and this involves only m scalar multiplications. In the next subsection, we shall give implementation details on the construction of S^ as well as operations involving the sparse matrices B, BT, and C. Given the small number of non-zeros in B and S^ (see Table 1 for examples), sparsity has to be exploited to maximize computational speed and minimize memory usage.

where only the multiplication of Ku is done at the EBE level in a manner analogous to Eq. (7). From Eqs. (6) and (9), we see that in each SQMR step, there are two pairs of sparse matrix-vector multiplications involving B and BT. In assembling the sparse global matrices B, BT and C, careful memory management is a must to avoid incurring excessive additional memory allocation. In our implementation, we first store all the non-zero entries of B at the element level into three vectors, where the first and second store the global row and global column indices, and the last stores the non-zero element-level value. Next, we sort the three vectors by the row and column indices, and then add up the values that have the same row and column indices. With the sorted vectors, a final step is performed to store the sparse matrix B in a Compressed Sparse Row (CSR) format with asso-

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

ciated 1-D arrays ibr(m + 1), jbr(bnz), csrb (bnz), where bnz is the total number of non-zero entries in sparse B matrix ([11]). The Compressed Sparse Column (CSC) format of B can be obtained readily from the CSR format (via Sparskit at http://www-users.cs.umn.edu/~saad/ software/home.html). We denote the arrays associated with the CSC format of B by jbc(n + 1), ibc(bnz), cscb(bnz). Given the CSR and CSC format of B, the multiplications BTw and Bz can then be efficiently computed via Algorithms 1 and 2 described in the Appendix A. In our implementation, the global sparse matrices B and BT are stored implicitly as the CSC and CSR format of B respectively, and C is stored in the CSC format. Note that storing these global sparse matrices in the CSC or CSR format requires less RAM than storing their unassembled element versions because overlapping degrees of freedom and zero entries within element matrices are eliminated. The next issue we need to address in constructing Pc is the efficient computation of the Schur complement ^ 1 B. The way it is done is shown in S^ ¼ C þ BT K Algorithm 3 of the Appendix A. Once the n · n matrix S^ is computed, we need to compute its sparse Cholesky 1 factorization so as to compute S^ v for any given vector v. It is well known that directly factorizing a sparse matrix may lead to excessive fill-ins and hence uses up a large amount of memory space. To avoid excessive fillins, the matrix is usually re-ordered by the reverse Cuthill–McKee (RCM) or the Multiple Minimal Degree (MMD) algorithms (see [13]) prior to applying the Cholesky factorization. In this paper, we adopt the MMD method, which is an effective re-ordering algorithm that usually leads to a sparser factor than other re-ordering algorithms. For readers who are not familiar with the solution process of a sparse symmetric positive definite linear system, Hx = b, we note that it is generally divided into four stages (e.g. [13,14]):

(http://cran.r-project.org/). The algorithm was developed by Ng and Peyton [15].

3. Numerical studies 3.1. Convergence criteria An iterative solver typically produces increasingly accurate approximate solution when the iteration progresses, and thus can be terminated when the approximate solution is deemed sufficiently accurate. A standard measure of accuracy is based on the residual norm. Suppose x(i) is the approximate solution at the ith iterative step. Let r(i) = b  Ax(i). Given an initial guess of the solution x(0) an accuracy tolerance stop_tol, and the maximum number max_it of iterative steps allowed, we stop the iterative solver if or

i > max it

krðiÞ k2 < stop tol krð0Þ k2

ð10Þ

Here kÆk2 denotes the 2-norm. In this paper, the initial guess x(0) is taken to be the zero vector, stop_tol = 106, and max_it = 2000. More details about various stopping criteria can be found in [16]. 3.2. Problem descriptions Fig. 2 shows a sample finite-element mesh of a flexible square footing resting on homogeneous soil subjected to a uniform vertical pressure of 0.1 MPa. Symmetry consideration allows a quadrant of the footing to be analysed. Mesh sizes ranging from 12 · 12 · 12 to 24 · 24 · 24 were studied. These meshes

(a) Re-ordering: Permutate symmetrically the columns and rows of matrix H using one re-ordering method. Suppose the permutation matrix is P. (b) Symbolic factorization: Set up a data structure for the Cholesky factor L of PHPT. (c) Numerical factorization: Perform row reductions to find L so that PHPT = LLT. (d) Triangular solution: Solve LLTPx = Pb for Px by solving two triangular linear systems. Then recover x from Px. In applying the Pc preconditioner in the SQMR solver, the re-ordering and Cholesky factorization of S^ are performed only once before calling the SQMR solver. However, in the preconditioning step within the SQMR solver, the triangular solves must be repeated for each iterative step. We use the sparse Cholesky factorization subroutine from the SparseM package

2405

0.1 MPa

10 m

Z Y

10 m

10 m

X

Fig. 2. Finite-element mesh (24 · 24 · 24) of a quadrant symmetric shallow foundation problem.

2406

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

result in linear systems of equations with DOFs ranging from about 20,000 to 180,000, respectively. The largest problem studied by Toh et al. [9] contains only about 24,000 DOFs. Twenty-noded brick elements were used. Each brick element consists of 60 displacement degrees of freedom (8 corner nodes and 12 mid-side nodes with 3 spatial degrees of freedom per node) and 8 excess pore pressure degrees of freedom (8 corner nodes with 1 degree of freedom per node). Details of the 3-D finite-element meshes are given in Table 1. The base of the mesh is assumed to be fixed in all directions and impermeable, side face boundaries are constrained in the transverse direction, but free in in-plane directions (both displacement and water flux). The top surface is free in all direction and free-draining with pore pressures assumed to be zero. The ground water table is assumed to be at the ground surface and is in hydrostatic condition at the initial stage. The materials used in the analysis are assumed to behave in a linear elastic manner, with a constant effective Poissons ratio (m 0 ) of 0.3. The parameters that are varied are effective Youngs modulus (E 0 ) and coefficient of hydraulic permeability (k). Footing load is applied ‘‘instantaneously’’ over the first time step of 1 s. Subsequent dissipation of the pore water pressure and settlement beneath the footing are studied by using a backward difference technique with Dt = 1 s. All the numerical studies are conducted using a Pentium IV, 2.0 GHz desktop PC with a physical memory of 1 GB. 3.3. Comparison between GJ and Pc The performances of GJ and Pc are evaluated over a range of mesh sizes and material properties and summa-

rized in Tables 2 and 3, respectively, based on the following indicators: (a) RAM usage during iteration. (b) Number of iterations required to achieve a residual norm below 106 (iteration count). (c) CPU time spent prior to execution of SQMR solver (overhead) includes time spent on computation of element stiffness matrices and construction of preconditioners. For GJ, construction costs entail assembling diagonal entries of K from element stiffness matrices and computation of diag{C + BT[diag(K)]1B} in an approximate way using Eq. 3b. For Pc, construction costs are associated with the steps discussed in Section 2.2. (d) CPU time spent within the SQMR solver (iteration time), which is a function of the iteration count and the time consumed per iteration. Fig. 3a shows that both GJ and Pc are very efficient for this class of problem from an iteration count point of view. For practical applications, it is worthy to note that these preconditioners actually become even more efficient when the problem size increases. For example, iteration count for GJ decreases from about 3% of the problem dimension for the smaller 12 · 12 · 12 footing problem to only 1% of the problem dimension for the 24 · 24 · 24 footing problem. Despite this efficiency, Pc is able to out-perform GJ on two counts. First, the iteration count for Pc is almost the same for the two material types studied (Fig. 3a). This implies better scaling with respect to the effective Youngs modulus and hydraulic permeability. Second, iteration count for Pc is less than half of that for GJ and more significantly, this ratio decreases with problem size as shown in

Table 2 Performance of the GJ preconditioner over different mesh sizes and material properties Mesh size 12 · 12 · 12

16 · 16 · 16

20 · 20 · 20

24 · 24 · 24

RAM (MB)

65

153

297

513

Material 1: E 0 = 1 MPa, k = 109 Iteration count Overhead (s) Iteration time (s) Total runtime (s) Total/iteration time

m/s (typical of soft clays) 666 1003 38.8 91.6 247.2 879.4 288.5 976.8 1.16 1.10

1463 178.9 2519.7 2709.8 1.07

1891 312.9 5620.5 5952.8 1.06

Material 2: E 0 = 100 MPa, k = 106 m/s (typical of dense sands) Iteration count 582 871 Overhead (s) 38.8 91.8 Iteration time (s) 214.2 764.9 Total runtime (s) 255.5 862.6 Total/iteration time 1.18 1.12

1260 179.0 2161.4 2351.7 1.08

1654 309.7 4877.8 5206.9 1.06

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

2407

Table 3 Performance of the Pc preconditioner over different mesh sizes and material properties Mesh size 12 · 12 · 12

16 · 16 · 16

20 · 20 · 20

24 · 24 · 24

68

167

365

610

m/s (typical of soft clays) 310 412 44.2 115.1 132.7 441.1 179.3 561.8 1.33 1.26

515 267.5 1151.9 1430.3 1.23

613 614.9 2580.3 3213.9 1.24

Material 2: E 0 = 100 MPa, k = 106 m/s (typical of dense sands) Iteration count 307 406 Overhead (s) 43.7 117.0 Iteration time (s) 130.6 430.7 Total runtime (s) 176.7 553.3 Total/iteration time 1.33 1.27

507 268.3 1136.4 1415.5 1.24

606 614.3 2554.6 3187.8 1.25

Iteration count/DOFs (%)

Material 1: E 0 = 1 MPa, k = 10 Iteration count Overhead (s) Iteration time (s) Total runtime (s) Total/iteration time

9

5

GJ (Material 1) GJ (Material 2) Pc (Material 1) Pc (Material 2)

4 3 2 1 0 0

(a)

50000

100000

150000

60

Iteraion (Pc )/Iteration (GJ) (%)

RAM (MB)

40 30 20

200000

Degrees of freedom (DOFs)

Material 1 Material 2

50

0

(b)

50000

100000

150000

200000

Degrees of freedom (DOFs)

Fig. 3. (a) Iteration count as a percentage of DOFs, and (b) comparison of iteration count between GJ and Pc (‘‘Material 1’’ and ‘‘Material 2’’ refers to soft clays and dense sands, respectively). Note that the curves for open and filled triangles coincide with each other.

Fig. 3b. The ratio for the largest problem studied is only about one-third. Reductions in iteration count do not translate in a straightforward way to savings in total runtime. Overhead and CPU time consumed per iteration are additional factors to consider. For the former, it is not surprising that Pc is more expensive given the fairly elaborate steps discussed in Section 2.2 (Fig. 4a). Computa^ 1 B tion of the n · n Schur complement S^ ¼ C þ BT K and the ensuing sparse Cholesky factorization are significant contributors to this overhead expense. Nevertheless, the increase in overhead with DOFs could have been more onerous than power of 1.27 if sparsity of B were not exploited for computation of S^ and Cholesky factorization. For fully dense matrices, one expects the overhead to grow with DOFs to the power of 3. As for time per iteration, one preconditioning Pc step in^ 1 with a vector, the mulvolves two multiplications of K tiplications of the sparse matrices B and BT with a

vector, and the solution of two triangular linear systems. The corresponding GJ step only involves multiplication 1 of P^ a with a vector, which can be done efficiently in (m + n) scalar multiplications since P^ a is a diagonal matrix. Nevertheless, Pc is only marginally more expensive within each SQMR iteration because available sparse matrix-vector multiplications and triangular solves are very efficient (growing with DOFs to the power of 1.11 as shown in Fig. 4b). Although overhead and time per iteration are more costly for Pc, it is still possible to achieve total runtime savings over GJ as shown in Fig. 4c. This may be illustrated using the 24 · 24 · 24 footing problem with 184,296 DOFs for material 1. The reduction in iteration count is about 32% (Fig. 3b). However, time per iteration and total runtime/iteration runtime for Pc are about 1.42 (Tables 2 and 3) and 1.18 (Fig. 4d) that of GJ, respectively. Hence, it can be deduced that the reduction in total runtime would be 0.32 · 1.42 · 1.18=0.54, which

2408

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411 20

GJ (Material 1) GJ (Material 2) Pc (Material 1) Pc (Material 2)

1

100 1.27

1.11

0.1

100000 200000 Degrees of freedom (DOFs)

0.8

Material 1 Material 2

0.7 0.6 0.5 0.4 0

50000

100000

150000

200000

Degrees of freedom (DOFs)

100000 200000

10000

(b) Total/iter time between Pc and GJ

10000

(a) Total time (Pc)/total time (GJ)

1

1

10

(c)

GJ (Material 1) GJ (Material 2) Pc (Material 1) Pc (Material 2)

10

Time per iteration (s)

Overhead (s)

1000

Degrees of freedom (DOFs)

1.3

Material 1 Material 2

1.2

1.1

1.0 0

(d)

50000

100000

150000

200000

Degrees of freedom (DOFs)

Fig. 4. (a) Rate of increase in overhead with DOFs, (b) rate of increase in time/iteration with DOFs, (c) total runtime ratio between Pc and GJ, and (d) total/iteration time ratio between Pc and GJ (‘‘Material 1’’ and ‘‘Material 2’’ refers to soft clays and dense sands, respectively). Note that in (a) and (b), the curves for open and filled circles coincide with one another.

3.4. Eigenvalue distribution of preconditioned matrices and convergence rates Fig. 6 shows the convergence history of the relative residual norm kr(i)k2/kr(0)k2 as a function of iteration number i for the solution of Fig. 1 using SQMR preconditioned by GJ and Pc. It is clear that the Pc-preconditioned method converges much faster than the GJ-

600

RAM (MB)

reproduces the result shown in Fig. 4c. An interesting question is how much savings in total runtime is achievable (if any) when the DOFs increase by one order of magnitude (i.e., in the millions). It is not unreasonable to extrapolate from Figs. 3b, 4b and d that iteration count, time per iteration and total runtime/iteration runtime for Pc are about 0.3, 1.8 (1.42 · 00.11), and 1.25 that of GJ, respectively, in this case. These crude extrapolated data imply a reduction in total runtime to be about two-thirds. The final practical aspect that needs to be considered is to compare RAM usage between Pc and GJ as shown in Fig. 5. For the largest 24 · 24 · 24 footing problem, Pc requires about 20% more RAM than GJ (Tables 2 and 3). For a problem with one order of magnitude larger DOFs, Pc would probably require about 1.2 · 100.081.45 times the RAM required by GJ. This requirement may be reasonable when PCs with few GB RAM become more commonly available in engineering design offices.

GJ Pc

1.0 1.08

100 50 10000

100000 200000 Degrees of freedom (DOFs)

Fig. 5. Rate of increase in RAM usage during iteration with DOFs.

preconditioned counterpart, with the former terminating at 105 steps while the latter at 192 steps. It is possible to explain semi-empirically this difference on the convergence rate if we look at the eigenvalue distributions of the preconditioned matrices. Fig. 7 shows these eigenvalue distributions. We note that the real part of the eigenvalues of the GJ-preconditioned matrix are contained in the interval [0.0132, 5.4476], while those of the Pc-preconditioned matrix are contained in the interval [0.0131, 5.0270]. Thus the real part of the eigenvalues of both preconditioned matrices has very similar distributions. But there is an important difference when one

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

where k (condition number) = 5.0270/0.01321 = 383.741. The estimation of the convergence rate associated with the eigenvalue distribution for the GJ-preconditioned system is less straightforward, but is achievable with the help of Schwarz–Christoffel mapping for polygonal region. The idea is to approximately enclose the eigenvalues in a polygonal region, and find the Schwaz–Christoffel map U that maps the exterior of the polygonal region to the unit disk. Then the convergence rate for an eigenvalue distribution that has eigenvalues densely distributed throughout the polygon is given by:

0

10

Relative residual norm

-2

10

GJ

-4

10

2409

Pc -6

10

-8

10

0

50

100 Iteration number

150

qGJ ¼ Uð0Þ

200

ð12Þ

The polygonal region that we have used is shown in Fig. 8, where the extreme left vertex of the polygon coincides with the eigenvalue 0.0132. Based on that polygonal approximation, the estimated convergence rate is qGJ = 0.958. To compute the mapping U, we used the highly user friendly MATLAB software package developed by Driscoll [17]. We will now use the estimated asymptotic convergence rates to predict how the number of required preconditioned SQMR steps would differ. Recall that for a method with convergence rate q to achieve:

Fig. 6. Convergence history of relative residual norms for SQMR solution of 5 · 5 · 5 meshed footing problem (Fig. 1).

look at the imaginary part of the eigenvalues. While the Pc-preconditioned matrix has only real eigenvalues (follows from Theorem 2 in [9]), the GJ-preconditioned matrix has 288 eigenvalues that have imaginary parts (appear as ‘‘wings’’ near the origin in Fig. 7). Based on the eigenvalue distribution, it is possible to get an estimate of the asymptotic convergence rate of the SQMR method when applied to the preconditioned linear system. For the Pc-preconditioned system, the convergence rate is roughly given by: pffiffiffi k1 ¼ 0:903 ð11Þ qPc ¼ pffiffiffi kþ1

krðiÞ k2 < 106 krð0Þ k2

ð13Þ

The number of steps i required is given by: i

6 log10 q

ð14Þ

0.2 (a) 0.1 0.0 -0.1 -0.2 0

1

2

3

4

5

6

0.2 (b) 0.1 0.0 -0.1 -0.2 0

1

2

3

4

5

6

Fig. 7. Eigenvalue distribution of: (a) GJ-precondtioned matrix and (b) Pc-precondtioned matrix.

2410

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411 0.20 0.15

Imaginary

0.10 0.05 0.00 -0.05 -0.10 -0.15 -0.20 -1

0

1

2

3

4

5

6

Real

Fig. 8. A polygonal region that approximately contains the eigenvalues of the GJ precondtioned matrix.

Thus the estimated ratio of the number of steps required for convergence for the GJ and Pc-preconditioned iterations is given by log10 qPc ¼ 2:38 log10 qGJ The actual ratio we have is 192/105=1.83. Based on the asymptotic convergence rates estimated, there is a strong reason to assume that the significant reductions in iteration count shown in Fig. 3b are applicable to problem sizes beyond those presented in this paper.

ing problem. The Pc preconditioner reduces these iteration counts by a further 50% and 35% for the small and large footing problem, respectively. In other words, iteration count for Pc only constitutes 1.3% and 0.3% of the problem dimension for the small and large footing problem, respectively. Based on the asymptotic convergence rates estimated from eigenvalues of both preconditioned matrices, there is a strong reason to assume that these significant reductions in iteration count are applicable to problem sizes beyond those presented in this paper. In addition, the iteration count for Pc is almost the same for the two material types studied, indicating better scaling with respect to the effective Youngs modulus and hydraulic permeability. The key disadvantage to the Pc preconditioner is the additional RAM required for implementation. This paper presents the application of the Compressed Sparse Column (CSC) and Compressed Sparse Row (CSR) formats for efficient storage of global sparse matrices appearing in the construction of Pc and pseudo-codes for sparse matrix-vector multiplications and computation of the sparse Schur complement. Using these sparse formats, numerical results show that overhead costs (construction of Schur complement and sparse Cholesky factorization), time per iteration (triangular solves in preconditioning step), and RAM usage only grow at a power of 1.27, 1.11, and 1.08, respectively with DOFs over the range from about 24,000 to 180,000. A crude extrapolation to problem dimensions with one order of magnitude larger DOFs indicates that Pc is still practical and preferable over GJ.

4. Conclusions

Appendix A

This paper compares the performance of the Generalized Jacobi (GJ) preconditioner and the block constrained preconditioner (Pc) for solving large linear systems produced by finite-element discretization of Biots consolidation equations on the commonly available PC platform. The GJ preconditioner is very cheap to construct and apply in each iteration because it is a diagonal matrix. The Pc preconditioner uses the same 2 · 2 block structure of the coefficient matrix but its (1, 1) block is replaced by a diagonal approximation. Due to its non-diagonal nature, it is obviously more costly to construct and apply in each iteration. However, it is able to out-perform GJ in total runtime primarily because of significant reductions in iteration count. Note that GJ is already very efficient from an iteration count point of view. Numerical studies indicate that iteration count for GJ decreases from about 3% of the problem dimension for the smaller 12 · 12 · 12 footing problem to only 1% of the problem dimension for the 24 · 24 · 24 foot-

Algorithm 1. Computation of y = BTx, given the CSR format of B, ibr(m + 1), jbr(bnz), csrb(bnz). Suppose m and n are the number of rows and columns of B respectively. y = zeros(n, 1) for i = 1:m if x(i) 5 0 for k = ibr(i):ibr(i + 1)  1 r = jbr(k) y(r) = y(r) + x(i) * csrb(k) end end if end Algorithm 2. Computation of y = Bx, given the CSC format of B, jbc(n + 1), ibc(bnz), cscb(bnz). Suppose m and n are the number of rows and columns of B respectively.

K.K. Phoon et al. / Computers and Structures 82 (2004) 2401–2411

y = zeros(m, 1) for j = 1:n if x(j) 5 0 for k = jbc(j):jbc(j + 1)  1 r = ibc(k) y(r) = y(r) + x(j) * cscb(k) end end if end

[4]

[5]

[6]

Algorithm 3. Computing G = BTdiag(K)1B + C given the CSC and CSR format of B, and the CSC format of C. Let d be the diagonal of diag(K)1. Allocate integer arrays rowG, colG and double array nzG. set w = zeros(m, 1), z = zeros(n, 1); for j = 1:n (1) initialize w = zeros(m, 1), z = zeros(n, 1); (2) extract the jth column of B from the CSC format and put it in w (3) for i = 1:m,w(i) = w(i) * d(i), end (4) compute z = BTw (5) scan through the vector z to add the jth column of C, and at the same time store the row and columns indices and the values of the non-zero elements of z in the arrays rowG, colG, nzG. end

[7]

[8]

[9]

[10]

[11] [12]

[13]

References

[1] Phoon KK, Toh KC, Chan SH, Lee FH. An efficient diagonal preconditioner for finite element solution of Biots consolidation equations. Int J Numer Methods Eng 2002;55(4):377–400. [2] Zienkiewicz OC, Vilotte JP, Toyoshima S, Nakazawa S. Iterative method for constrained and mixed approximation. An inexpensive improvement of FEM performance. Comput Methods Appl Mech Eng 1985;51:3–29. [3] Freund RW, Nachtigal NM. A new Krylov-subspace method for symmetric indefinite linear system. In: Pro-

[14]

[15]

[16]

[17]

2411

ceedings of the 14th IMACS World Congress on Computational and Applied Mathematics, Atlanta, USA, 1994. p. 1253–6. Phoon KK, Chan SH, Toh KC, Lee FH. Fast iterative solution of large undrained soil–structure interaction problems. Int J Numer Anal Methods Geomech 2003;27(3):159–81. Murphy MF, Golub GH, Wathen AJ. A note on preconditioning for indefinite linear systems. SIAM J Sci Comput 2000;21(6):1969–72. Huckle T. Approximate sparsity patterns for the inverse of a matrix and preconditioning. Appl Numer Math 1999;30:291–303. Lipitakis EA, Gravvanis GA. Explicit preconditioned iterative methods for solving large unsymmetric finite element systems. Computing 1995;54(2):167–83. Gravvanis GA. Generalized approximate inverse finite element matrix techniques. Neural Parallel Sci Comput 1999;7(4):487–500. Toh KC, Phoon KK, Chan SH. Block Preconditioners for symmetric indefinite linear systems. Int J Numer Methods Eng, in press. Van der Vorst H. Iterative methods for large linear systems. Cambridge: Cambridge University Press; 2003. Saad Y. Iterative methods for sparse linear systems. Boston: PWS Publishing Company; 1996. Dayde MJ, LExcellent, Gould NIM. Element-by-element preconditioners for large partially separable optimization problems. SIAM J Sci Comput 1997;18(6): 1767–87. George A, Liu J. Computer solution of large sparse positive definite systems. Englewood Cliffs, NJ: Prentice Hall; 1981. Lee LQ, Siek JG, Lumsdaine A. Generic graph algorithms for sparse matrix ordering. In: Third International Symposium ISCOPE 99. Lecture Notes in Computer Science, vol. 1732. Springer-Verlag; 1999. Ng EG, Peyton BW. Block sparse Cholesky algorithms on advanced uniprocessor computers. SIAM J Sci Comput 1993;14:1034–56. Barrett R, Berry M, Chan T, Demmel J, Donato J, Dongarra J, et al. Templates for the solution of linear systems: building blocks for iterative methods. Philadelphia: SIAM Press; 1994. Driscoll TA. A MATLAB toolbox for Schwarz–Christoffel mapping. ACM Trans Math Software 1996;22:168–86.