MULTIGRID SMOOTHERS FOR ULTRA-PARALLEL ... - Computation

MULTIGRID SMOOTHERS FOR ULTRA-PARALLEL COMPUTING: ADDITIONAL THEORY AND DISCUSSION ALLISON H. BAKER∗ , ROBERT D. FALGOUT∗ , TZANIO V. KOLEV∗ , AND ULRIKE MEIER YANG∗ Abstract. This paper investigates the properties of smoothers in the context of algebraic multigrid (AMG) running on parallel computers with potentially millions of processors. The development of multigrid smoothers in this case is challenging, because some of the best relaxation schemes, such as the Gauss-Seidel (GS) algorithm, are inherently sequential. Based on the sharp two-grid multigrid theory from [17, 18] we characterize the smoothing properties of a number of practical candidates for parallel smoothers, including several C-F , polynomial, and hybrid schemes. We show, in particular, that the popular hybrid GS algorithm has multigrid smoothing properties which are independent of the number of processors in many practical applications, provided that the problem size per processor is large enough. This is encouraging news for the scalability of AMG on ultra-parallel computers. We also introduce the more robust `1 smoothers, which are always convergent and have already proven essential for the parallel solution of some electromagnetic problems [23].

1. Introduction. Multigrid (MG) linear solvers are optimal methods because they require O(N ) operations to solve a sparse system with N unknowns. Consequently, multigrid methods have good scaling potential on parallel computers, since we can bound the work per processor as the problem size and number of processors are proportionally increased (weak scaling). Near ideal weak scaling performance has been demonstrated in practice. For example, the algebraic multigrid (AMG) solver BoomerAMG [20] in the hypre software library [21] has been shown to run effectively on more than 125 thousand processors [16, 5]. One critical component of MG is the smoother, a simple iterative method such as Gauss-Seidel (GS). In the classical setting, the job of the smoother is to make the underlying error smooth so that it can be approximated accurately and efficiently on a coarser grid. More generally, the smoother must eliminate error associated with large eigenvalues of the system, while the coarse-grid correction eliminates the remaining error associated with small eigenvalues. Some of the best smoothers do not parallelize well, e.g., lexicographical GS. Others used today, while effective on hundreds of thousands of processors, still show some dependence on parallelism and may break down on the millions of processors expected in the next generation machines (we use the term processor here in a generic sense, and distinguish it from cores only when necessary). One such smoother is the hybrid GS smoother used in BoomerAMG, which uses GS independently on each processor and updates in a Jacobi-like manner on processor boundaries. In practice hybrid GS is effective on many problems. However, because of its similarity to a block Jacobi method, there is no assurance of obtaining the good convergence of lexicographical GS. In fact, Hybrid GS may perform poorly or even diverge on certain problems, and its scalability has often been cited as a concern as the number of blocks increase with increasing numbers of processors or as block sizes decrease (see, e.g. [1, 15, 31]). For these reasons, previous papers have studied alternatives such as using polynomial smoothers [1] or calculating weighting parameters for hybrid GS [32]. Yet despite its shortcomings, hybrid GS remains the default option in hypre because of its overall ∗ Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, P.O. Box 808, L-561, Livermore, CA 94551. This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC5207NA27344 (LLNL-TR-489114).

1

efficiency and robustness. Therefore, one of the main purposes of this paper is to better understand the potential of block smoothers like hybrid GS on millions of processors. We show that these hybrid smoothers can in fact exhibit good smoothing properties independent of parallelism, as long as the blocks satisfy certain properties (e.g., the blocks have some minimal size). There are many other well-known smoothers that exhibit parallel-independent smoothing properties. In particular, methods like weighted Jacobi (both pointwise and blockwise), red/black GS, Chebyshev and Krylov-based polynomial methods have been extensively studied in classical works such as [12, 29, 19, 6]. In practice, each of these methods have their drawbacks. For example, weighted Jacobi requires the estimation of an ideal weight [32] and Chebyshev involves estimating an eigenvalue interval [1]. For multi-colored GS [1], the number of parallel communications required per iteration is proportional to the number of colors, hence it tends to be slow, especially on coarser grids in AMG where the number of colors is difficult to control. Therefore, the secondary purpose of this paper is to study and identify smoothers that are practical for AMG in the context of millions of processors. To this end, we analyze a variety of candidates for smoothing, revisiting some of the classics as well, under a common framework based on the recent two-grid theory in [17, 18]. Numerical results complementing the theory can be found in [4]. The structure of the paper is as follows. In Section 2, we introduce our approach for doing smoothing analysis in general, and we then analyze several specific classes of smoothers in Section 3 through Section 6, including C-F , polynomial, hybrid, and `1 smoothers. We make concluding remarks in Section 7. 2. Smoothing Analysis. Our smoothing analysis is based on the two-grid variational multigrid theory from [17], which was developed for general relaxation and coarsening processes. In this section, we first summarize this theory and then describe our general approach for applying it to smoother analysis. We represent the 1/2 standard Euclidean inner product by h·, ·i with associated norm, k·k := h·, ·i . 1/2 The A-norm (or energy norm) is defined by k·kA := hA ·, ·i for vectors, and as the corresponding induced operator norm for matrices. Consider solving the linear system of equations (2.1)

Au = f ,

where u, f ∈ Rn and A is a symmetric positive definite (SPD) matrix. Define the smoother (relaxation) error propagator by (2.2)

I − M −1 A,

and assume that the smoother is convergent (in energy norm k · kA ), i.e. assume that M T + M − A is SPD. Note that we often refer to the matrix M as the smoother. Denote the symmetrized smoother by (2.3)

f = M T (M T + M − A)−1 M, M

f−1 A = (I − M −1 A)(I − M −T A). Let P : Rnc 7→ Rn be the interpolation so that I − M (or prolongation) operator, where Rnc is some lower-dimensional (coarse) vector space of size nc . The two-grid multigrid error transfer operator with no post-smoothing steps is then given by (2.4)

ET G = (I − P (P T AP )−1 P T A)(I − M −1 A), 2

where P T is the restriction operator and Ac = P T AP is the Galerkin coarse-grid operator. Note that coarse-grid correction involves an A-orthogonal projection onto range(P ). Let R : Rn 7→ Rnc be any matrix for which RP = Ic , the identity on Rnc , so that P R is a projection onto range(P ). We can think of R as defining the coarse-grid variables, i.e., uc = Ru. Also, let S : Rns 7→ Rn be any full-rank matrix for which RS = 0, where ns = n − nc . Here, the unknowns us = S T u are analogous to the finegrid-only variables (i.e., F -points) in AMG. In addition, R and S form an orthogonal decomposition of Rn : any e can be expressed as e = Ses + RT ec , for some es and ec . The next theorem summarizes one of the main convergence results in [17]. Theorem 2.1. (see Theorem 2.2 in [17]) (2.5)

2

kET G kA ≤ 1 −

1 , K

2

where K = sup

k(I − P R)ekM f 2

kekA

e

≥ 1.

Theorem 2.1 gives conditions that P must satisfy in order to achieve a fast uniformly convergent multigrid method. It is clear that to make K small, eigenvectors of A belonging to small eigenvalues must either be interpolated accurately by P or else attenuated efficiently by the smoother (since the denominator is small for these eigenvectors). For brevity, we refer to these as small eigenvectors. The choice of which small eigenvectors to eliminate by smoothing and which to eliminate by coarse-grid correction depends on the “localness” of the modes. Essentially, modes that can be eliminated by a local process (i.e., one that is equivalent to applying an operator with a comparable sparse nonzero structure to A) should be handled by the smoother. 2.1. Smoothing Analysis with Ideal Interpolation. One approach for using the above theory to do smoothing analysis is to consider the best K in Theorem 2.1 by substituting the P that minimizes the following for a given R 2

(2.6)

K? =

inf

P : RP =Ic

sup e

k(I − P R)ekM f 2

kekA

.

The following theorem evaluates this inf-sup problem. Theorem 2.2. (see Theorem 3.1 in [17]) Assume that R, S, and P satisfy RS = 0 and RP = Ic as above. Then K? in (2.6) is given by (2.7)

K? = sup es

fSes , es i 1 hS T M = , T f hS T ASes , es i λmin ((S M S)−1 (S T AS))

and the corresponding minimizer is (2.8)

P? = (I − S(S T AS)−1 S T A)RT .

Equation (2.8) defines the so-called ideal interpolation operator. Notice that, if K? is uniformly bounded with respect to parameters such as the mesh spacing, then using P? as the interpolation operator results in a uniformly convergent two-grid method. Since the inverse of S T AS may not be sparse, this is generally not a good practical choice for interpolation. However, it is reasonable to use P? (and hence K? ) to analyze smoothing. 3

We will consider two settings in the analysis that follows, depending on the particular smoother. The first is the classical AMG setting where the coarse-grid variables Ru are a subset of the fine-grid variables: (2.9)

RT =

0 Ic

;

S=

If 0

;

P? =

−A−1 f f Af c Ic

.

The second setting corresponds more closely to the classical smoothing factor analysis [12], where the coarse-grid variables span the space of the nc “smallest” (they do not have to strictly be the smallest as we discuss later) eigenvectors of A: (2.10)

RT = [v1 , . . . , vnc ] ;

S = [vnc +1 , . . . , vn ] ;

P? = R T .

2.2. Comparative Smoothing Analysis. Direct evaluation of K? in (2.6) is not always straightforward. However, one useful technique that we use below is to compare the K? for one smoother to that of another with well-known smoothing properties (e.g., Gauss-Seidel). Writing K = K(M ) in (2.5) as a function of the smoother (similarly for K? ), we articulate this approach in the next lemma. Lemma 2.3. Suppose that M1 and M2 are two convergent smoothers for A that satisfy (2.11)

f1 x, xi ≤ chM f2 x, xi hM

for all x, with a fixed constant c. Then, for any choice of the interpolation operator in the two-grid multigrid method, we have that K(M1 ) ≤ cK(M2 ), and in particular, K? (M1 ) ≤ cK? (M2 ). In other words, multigrid methods using M1 and M2 will have comparable parallel scalability properties, provided c is independent of the problem size and the number of processors. Therefore, when (2.11) holds, we say that M1 has multigrid smoothing properties comparable to M2 . Proof. The proof follows immediately from (2.5) and (2.6). Remark 2.1. Note that the above result can also be analogously stated in terms of the sharp two-grid theory of [18] since we can write the constant K] in that theory as

(I − π f)e 2f fw, wi hM M M K] = sup = sup inf , 2 e kekA v∈range(I−πA ) w : v=(I−πA )w hAv, vi where πX = P (P T XP )−1 P T X denotes the X-orthogonal projection onto range(P ) for any SPD matrix X. In some cases, we can directly determine the constant c in Lemma 2.3. However, we can also bound c in terms of a few general, yet insightful, constants as shown in the theorem below. First, we state a useful Lemma, which is of general interest. Lemma 2.4. Suppose that A is SPD and B is arbitrary. Then hAx, xi ≤ chBx, xi

implies

4

hB −1 x, xi ≤ chA−1 x, xi.

Proof. Note that by the given inequality B is invertible, and B −1 + B −T is SPD. Using Cauchy-Schwarz and the assumption above, we have hB −1 x, xi2 = hA1/2 B −1 x, A−1/2 xi2 ≤ hAB −1 x, B −1 xihA−1 x, xi ≤ c hB −1 x, xihA−1 x, xi. Dividing both sides by hB −1 x, xi gives the desired result. The above Lemma implies, in particular, that if B is positive definite, i.e. its symmetric part σ(B) = (B T + B)/2 is SPD, then hB −1 x, xi ≤ hσ(B)−1 x, xi. This inequality has appeared previously and can be found for example in [2], Lemma 3.5. Theorem 2.5. Suppose that M1 and M2 are two convergent smoothers. Then, K(M1 ) ≤

(2.12)

2 δ∆2 K(M2 ), 2−ω

where ∆, ω and δ are given by 1

1

(2.13) ∆ = kσ(M1 )− 2 M1 σ(M1 )− 2 k, ω = λmax (σ(M1 )−1 A), δ = sup v

hM1 v, vi . hM2 v, vi T

M1 Sv, This also holds with K replaced by K? and with δ replaced by δs = supv hS hS T M2 Sv, Proof. From Lemma 2.3 in [17], and since hσ(M )x, xi = hM x, xi, we have

K(M1 ) ≤

vi . vi

∆2 δ∆2 Kσ (M1 ) ≤ Kσ (M2 ) , 2−ω 2−ω

where 2

Kσ (M ) = sup

k(I − P R)ekσ(M ) 2

kekA

e

.

Since A is SPD, from (2.3) we have f−1 x, xi = h(M −1 + M −T − M −1 AM −T )x, xi ≤ 2hM −1 x, xi. hM Hence, Lemma 2.4 implies that (2.14)

fx, xi , hM x, xi ≤ 2hM

which completes the proof of (2.12). The result for K? follows similarly from (2.7) and the definition of δs . The quantity ∆ in (2.13) measures the deviation of M from its symmetric part, while ω ∈ (0, 2) should be bounded away from two. In particular, ω ≤ 1 is equivalent to (2.15)

hAx, xi ≤ hM x, xi, 5

for all x. When M is symmetric, this is a seemingly natural multigrid smoother condition, since it implies that I − M −1 A will damp the (high-frequency) components of the error corresponding to the large eigenvalues of M −1 A. This is in contrast with the condition hAx, xi ≤ 2hM x, xi which is equivalent with M being convergent but allows ω to be close to 2, leading to minimal damping of the corresponding eigenvector. The difference is clearly illustrated in the case of Richardson’s smoother M = rI, where r = (λmin + λmax )/2 is optimal in terms of convergence, but r = λmax has significantly better smoothing properties. An inequality like (2.15) also holds f we when M is not symmetric, in the sense that for any symmetrized smoother M have (2.16)

fx, xi. hAx, xi ≤ hM

This can be seen in a couple of ways, for example, by introducing the SPD matrix DM = M + M T − A and noting that f = M T D−1 M = (DM + A − M )D−1 (DM + A − M T ) M M M −1 = A + (A − M )DM (A − M T ). In particular, (2.15) holds for M defined as two sweeps of any convergent symmetric smoother, such as Jacobi for diagonally dominant and irreducible A. Since two sweeps of Jacobi is really no better as a smoother than one sweep, this example also illustrates the fact that ω alone is not in general a good measure of smoothing properties. 2.3. Historical Notes on Smoothing Analysis. Our approach for analyzing smoothers has many similarities with previous approaches. As mentioned in Section 2.1, the idea of measuring (or bounding) the two-grid convergence factor by assuming an ideal interpolation operator is essentially what is done in classical smoothing factor analysis introduced in [12]. The smoothing factor measures the effectiveness of relaxation on the oscillatory Fourier modes, which is motivated by the assumption that interpolation (our ideal interpolation) eliminates the smooth Fourier modes. An important aspect of this approach is that it is explicitly tied to the (ideal) coarse-grid correction. The approach described in Section 2.2 is similar to most other smoother analyses, where either weighted Richardson or Jacobi relaxation is used for M2 in Lemma 2.3 [19, 8, 9, 26, 27, 25, 28, 10, 11]. A general comparison lemma was stated in [25]. One limitation of this approach is that coarse-grid correction is not explicitly taken into account, so in cases such as Maxwell’s equation, care must be taken to compare with a suitable smoother. For example, a number of multilevel smoothing conditions for multigrid were considered in Appendix B of [11]. The first smoothing condition there, (SM.1), combines (2.16) with a comparative condition of the form (2.11), where M2 is a Richardson smoother. Since (SM.1) is the only requirement for the smoother (on each level) in results such as the classical Braess-Hackbusch Theorem 3.1 in [11], it is reasonable to expect that any analysis based on Lemma 2.3 will be applicable to the full multilevel multigrid algorithm (even though (2.11) was motivated by a two-grid theory). Another condition from [11] is (SM.2), which is a weighted version of (2.15). We note that (2.15) is not a new condition, and has been imposed on symmetric smoothers in previously published theories, e.g. [22]. 3. The C-F Smoother. In this section, we apply the smoothing analysis theory from the previous section to the so-called C-F smoother. C-F smoothing corresponds 6

to applying an AMG smoother first to the coarse points (C-points) and then to the fine points (F -points). That C-F smoothers can be effective in practice is evident if one considers, for example, that C-F smoothing with Gauss-Seidel on a structured grid is equivalent to red-black Jacobi. More formally, the C-F smoother is defined by Mf f Af c −1 (3.1) I − MCF A; MCF = . 0 Mcc This smoother converges if and only if the following are convergent: If − Mf−1 f Af f ;

−1 Ic − Mcc Acc .

Therefore, one can consider using any of the convergent smoothers discussed in the following sections as the Mf f and Mcc matrices of a C-F smoother. This is typically advantageous since the principle submatrices Af f and Acc have better properties than A in terms of conditioning and diagonal dominance. The following theorem shows that C-F smoothing is good if F -relaxation is fast to converge. Theorem 3.1. Define S as in (2.9). Then K? in (2.6) for the C-F smoother satisfies K? =

1 ; 1 − %f 2

%f = kIf − Mf−1 f Af f kAf f .

Proof. Similarly to (2.3), define T T −1 g M Mf f . f f = Mf f (Mf f + Mf f − Af f )

(3.2)

From (2.3) and the definition of MCF above, we have T (MfTf + Mf f − Af f )−1 0 0 Mf f f = Mf f M T T 0 + Mcc − Acc )−1 0 (Mcc Acf Mcc

Af c Mcc

,

fS = M g and therefore S T M f f . This implies by Theorem 2.2 K? =

1 g λmin (M ff

−1

= Af f )

1 . −T 1 − λmax [(I − Mf−1 A f f )(I − Mf f Af f )] f

Let Ef f = If f − Mf−1 f Af f and let ρ(·) denote the spectral radius of a matrix. Then, using the definition of %f and the fact that kBk = kB T k for any matrix B, we have −1/2 2

1/2

%f 2 = kEf f k2Af f = kAf f Ef f Af f 1/2

−1/2

k = kAf f

1/2

EfTf Af f k2

1/2

−1 T T = ρ(Af f Ef f A−1 f f Ef f Af f ) = ρ(Ef f Af f Ef f Af f ) −T = ρ[(I − Mf−1 f Af f )(I − Mf f Af f )] ,

which completes the proof. From the above, we see that C-F smoothing is a natural smoother to use when coarse grids are selected based on compatible relaxation (CR) [13, 17], because %f is estimated as part of the CR coarsening algorithm. 7

4. Polynomial Smoothers. Polynomial smoothers are of practical interest for parallel computing for a couple of reasons. First, their application requires only the matrix-vector multiply routine, which is often highly-optimized on modern parallel machines. Second, they are unaffected by the parallel partitioning of the matrix, the number of parallel processes, and the ordering of the unknowns. However, as mentioned previously, one drawback is the need to calculate eigenvalue estimates. Unlike the smoothed aggregation variant of AMG, eigenvalue estimates are not needed by classical AMG, so this computational cost is extra. We now apply the smoothing analysis from Section 2 to polynomial smoothers. Let pν (x) be a polynomial of degree ν ≥ 0 such that pν (0) = 1, and consider the smoother I − M −1 A = pν (A).

(4.1)

The following theorem gives conditions for a good polynomial smoother. Theorem 4.1. Let A = V ΛV T be the eigen-decomposition of A with eigenvectors vk and associated eigenvalues λk , and define S as in (2.10). Then K? in (2.6) for the polynomial smoother satisfies K? =

1 − max pν (λk )2

−1

k>nc

.

Minimizing K? over all pν , we have 2 !−1 1 − min max |pν (x)| ;

min K? ≤ pν

pν x∈[α,β]

α ≤ λnc +1 ≤ λn ≤ β.

Proof. Order the eigenvectors in V so that we can write S = V Si , Si = [Is , 0]T . f−1 A = (I − M −1 A)(I − M −T A), we have Then, since I − M fS = S T (A−1 − (I − M −1 A)A−1 (I − M −1 A)T )−1 S ST M = SiT V T (A−1 − pν (A)2 A−1 )−1 V Si = SiT (Λ−1 − pν (Λ)2 Λ−1 )−1 Si 2 −1 −1 = (Λ−1 . s − pν (Λs ) Λs ) Since S T AS = Λs , then fS)−1 (S T AS) = Is − pν (Λs )2 , (S T M and the first result follows from Theorem 2.2. The second result follows trivially from the first since we are maximizing over a larger set [α, β] containing λk , k > nc . In the following two subsections, we first discuss the optimal polynomial smoother according to Theorem 4.1 and then briefly overview several other choices of polynomials that may also be good smoothers for AMG in practice. 4.1. Chebyshev Smoothers. The min-max problem in Theorem 4.1 has a classical solution qν (x) in terms of Chebyshev polynomials (see, e.g., [2]). Let Tk (t) be the Chebyshev polynomial of degree k defined by the recursion (4.2)

T0 (t) = 1;

T1 (t) = t;

Tk (t) = 2tTk−1 (t) − Tk−2 (t), k = 2, 3, . . . 8

By letting t = cos(ξ) ∈ [−1, 1], it is easy to show that the explicit form of these polynomials is Tk (t) = cos(kξ). The polynomial qν (x) is given by Tν β+α−2x β−α , (4.3) qν (x) = β+α Tν β−α and has the required property that qν (0) = 1. It also satisfies −1 < qν (x) < 1 for x ∈ (0, β], which implies that the smoother (4.1) with pν = qν is convergent as long as the spectrum of A is contained in the interval (0, β]. To show the above inequality with α, β > 0 observe that the Chebyshev polynomial Tν (x) equals 1 for x = 1 and is strictly monotonically forx > 1 (see, e.g., (5.28) in [2]). Therefore, x ∈ [α, β] increasing β+α β+α−2x β+α > β+α−2x ≥1 implies Tν β−α > 1 ≥ Tν , while |qν (x)| < 1 due to β−α β−α β−α for x ∈ (0, α]. Since K? is a measure of the smoothing properties of the smoother (4.1), then Theorem 4.1 shows that a good choice for polynomial smoothing is qν (x) where the interval [α, β] contains the “large” eigenvalues of A. The upper bound β can easily be estimated using a few iterations of conjugate gradient (CG), but choosing a suitable α is not obvious in general. It is clear that α depends on the coarse-grid size, but it should also depend on the distribution of eigenvalues for the problem and possibly even the nature of the associated eigenvectors. To see this, consider a simple Laplace example on a unit domain discretized by standard finite differences. Assume full coarsening so that nc /n = 1/2d where d is the dimension. We discuss three possible choices for α below. First, note that the analysis above does not require that R be made up of the strictly smallest eigenvectors of A. Consider instead that R contains the smooth Fourier modes used in standard local Fourier analysis. In this case, it is easy to see from standard Fourier diagrams that α should be chosen such that (4.4)

α/β = 1/2 (1D), 1/4 (2D), 1/6 (3D).

The resulting Chebyshev polynomial smoothers were first derived almost 30 years ago in [29]. Now consider letting R contain the actual nc smallest eigenvectors for the Laplace equation. Using Matlab, we get the estimates (4.5)

α/β ≈ 0.5 (1D), 0.32 (2D), 0.28 (3D).

Consider again letting R contain the actual nc smallest eigenvectors, but assume that the eigenvalues are distributed uniformly. Then, we have (4.6)

α/β = 1/2 (1D), 1/4 (2D), 1/8 (3D).

In practice, we set β by estimating λmax with several iterations of CG and set α = aβ for some fraction 0 ≤ a ≤ 1. We use a = 0.3 in the numerical experiments in [4]. A similar approach is used in [1], but with a small a = 1/30. It is not vital to estimate λmin unless it is large. In that case, no coarse grid is needed, and the smoother should damp all eigenvectors equally well, i.e., α should approximate λmin . 9

4.2. Other Polynomial Smoothers. Although the above theory leads naturally to the Chebyshev polynomial in (4.3), there are several other polynomials in the literature that are also good smoothers. We briefly summarize some of the most notable here. A related smoother to the Chebyshev polynomial in (4.3) is the following shifted and scaled Chebyshev polynomial used in the AMLI method [3] 1 + Tν β+α−2x β−α . (4.7) qν+ (x) = β+α 1 + Tν β−α This has the required property that qν+ (0) = 1, but satisfies 0 < qν+ (x) < 1 for x ∈ (0, β]. This implies that (2.15) holds, a sometimes desirable property for smoothers. Another polynomial smoother of interest is used in both the smoothed aggregation (SA) and cascadic multigrid methods [7, 14, 30], and is given by √ √ 1 β x ν √ (4.8) φν (x) = (−1) T2ν+1 √ . 2ν + 1 x β Note that (4.8) does not require the estimation of α. It can be shown that φν is the minimizer of √ (4.9) min max | x pν (x)|. pν

x∈[0,β]

The weak approximation property in (2.5) shows that coarse-grid correction must eliminate eigenvectors √ with accuracy proportional to the square root of their associated eigenvalue. The x term in (4.9) serves the role of coarse-grid correction, so the polynomial φν has a certain optimality with respect to the weak approximation property. However, (4.9) does not account for the fact that coarse-grid correction only operates on a subspace of size nc , so the resulting smoother φν does not damp the largest eigenvectors (i.e., those not damped at all by coarse-grid correction) as much as it otherwise would. It could be modified to satisfy (4.9) over the interval [α, β] to improve its properties as a smoother, but it is not clear that this is better than using the Chebyshev polynomial in (4.3). Note that the polynomial φν makes perfect sense for smoothing a tentative interpolation operator, which is its primary purpose. The MLS smoother in [1] is the product of φν and a complementary (post) smoother of the form ω I− φ2 A. λmax (φ2ν A) ν It has better overall smoothing properties than φν alone, and it is particularly advantageous when using aggressive coarsening. The polynomial √ smoother in [24] minimizes an equation like (4.9) over the interval [α, β], but with x in the equation replaced by 1/x. This means that the amplitude of the polynomial increases over the interval [α, β]. The polynomial is computed through a three-term recurrence. The conjugate gradient method is also a good smoother [6]. Note that it converges to the Chebyshev polynomial in (4.3), but over the entire eigenvalue interval 10

1

1

0.5

0.5

0

0

−0.5

−1 0

−0.5

CG Chebyshev SA Richardson 2

4

6

−1 0

8

CG Chebyshev SA Richardson 2

4

6

8

Fig. 4.1. Various polynomials of order two (left) and four (right). The CG polynomial was generated by solving a 2D Laplace problem on a 25 × 25 grid with a random initial error. The Chebyshev polynomial (4.3) uses a = 0.3. The SA polynomial is given by (4.8).

[α, β] = [λmin , λmax ]. Even though this is not a good value for α, it is only relevant asymptotically; for small ν, CG has good smoothing properties. Other Krylov methods such as conjugate residual (also called minimum residual or MINRES) typically have good smoothing properties as well [6]. In Figure 4.1 we plot several polynomials over the eigenvalue interval. Focusing on the fourth-order figure, note that the polynomial tails for x > β = 8 turn up steeply. For this reason, it is important not to underestimate λmax in practice. Note also that the CG polynomial closely approximates Chebyshev. As previously mentioned, the SA polynomial does not damp the large eigenvectors as well as the others. The Richardson ν polynomial is given by (I −λ−1 max ) . We include it in the figure because it is the simplest smoother to understand and it is used in most classical smoothing analysis. In the interest of keeping the figure readable, we do not plot all of the polynomials in this section. Note, however, that they all have good smoothing properties, with mostly minor differences between them as noted in the text. 5. The Hybrid Smoother. The class of so-called hybrid smoothers can be viewed as the result of the straightforward parallelization of a smoother. For example, the easiest parallelization of GS is to have each process independently use GS on its domain and then exchange information from neighbor processors after each iteration, resulting in a Jacobi-like update at the processor boundaries. As noted in Section 1, hybrid smoothers, like hybrid GS in particular, are of interest because they are easy to implement and often quite effective in practice, even though convergence may not be guaranteed. In this section, we first formally define hybrid smoothers and apply the smoothing analysis theory from Section 2. We then discuss two particular hybrid smoothers, hybrid GS and Block Jacobi, in more detail, and, finally, we discuss the use of weights with hybrid smoothers. We define the hybrid smoother to be essentially an inexact block Jacobi method. Specifically, let Ω = {1, . . . , n} and consider the non-overlapping partition of Ω, Ω=

p [

Ωk .

k=1

Of particular practical interest in this paper is the case where Ωk represents the unknowns on processor k so that p is the total number of processors, but the analysis 11

below is for the general setting. Let A be partitioned into blocks Akl of size nk × nl where the rows of Akl are in Ωk and the columns are in Ωl . Let I − Bk−1 Akk be a smoother for Akk . Then, the hybrid smoother is defined by −1 I − MH A;

(5.1)

MH = diag{Bk },

where diag{Bk } denotes the block-diagonal matrix with blocks Bk . If Bk = Akk , then (5.1) is block Jacobi. As p increases, the convergence of block Jacobi approaches that of pointwise Jacobi. However, although (unweighted) pointwise Jacobi is often not a good smoother, we show below that block Jacobi and other hybrid smoothers can have good smoothing properties independent of p, as long as the blocks are sufficiently large. We also show that this threshold block size can be quite small. We first discuss the convergence properties of the hybrid smoother. Assume that the block smoothers are convergent in the sense of (2.15), that is hBk vk , vk i ≥ hAkk vk , vk i . Then, h(BkT + Bk − Akk )vk , vk i ≥ hAkk vk , vk i. To show that the hybrid smoother is T convergent, we need to show that MH + MH − A is SPD. With v composed of blocks vk ∈ Rnk , we have that X XX T h(MH + MH − A)v, vi = h(BkT + Bk − Akk )vk , vk i − hAkl vl , vk i k

≥

X

k

hAkk vk , vk i −

k

XX k

l6=k

hAkl vl , vk i.

l6=k

One class of matrices for which the latter is positive is the class of block red-black matrices, i.e., when A admits the following two-by-two form Arr Arb A= , Abr Abb with block-diagonal matrices Arr and Abb . To see this, note that SPD A implies XX hAkl vl , vk i > 0. k

l

Replacing vk with k vk for k = 1 or k = −1, we obtain X XX hAkk vk , vk i > − k l hAkl vl , vk i k

k

=

l6=k

XX k

hAkl vl , vk i,

l6=k

where the last equality holds by choosing k = 1 for the “red” blocks and k = −1 for the “black” blocks. In that case, k l = −1 for any k 6= l where Akl 6= 0. As a practical example of a block red-black matrix, consider a structured (i.e. topologically Cartesian) partitioning of a 5-point discretization in 2D. To analyze the smoothing properties of the hybrid smoother, we introduce a constant, θ ≥ 0, which is a measure of the relative size of the block off-diagonal portion of A. First, define the sets (5.2)

Ω(i) = {j ∈ Ωk : i ∈ Ωk };

Ω(i) / Ωk : i ∈ Ωk }. o = {j ∈ 12

(i)

Hence, Ω(i) is the set of columns in the diagonal block for row i while Ωo contains the remaining “off-diagonal” columns in row i. Now, with aij denoting the coefficients of A, define θ such that X |aij | for all rows i. (5.3) aii ≥ θ (i)

j∈Ωo

Under weak scaling, θ will quickly stabilize to a value independent of the number of processors. In many applications this value will satisfy θ > 1. This, for example, is the case when A is diagonally dominant and each Akk has at least two non-zero entries per row (in particular the block sizes are large enough). Another example is the 5-point discretization of the Laplacian in 2D, where θ = 2. In general θ is large whenever most of the strong connections for each i (relatively large |aij |) are contained inside its block. For finite element discretizations, better values for θ are obtained when the blocks correspond to element partitioning (as opposed to random partitioning of the degrees of freedom, see Section 7.3 in [4]. 5.1. Hybrid Gauss-Seidel. In this section we consider the hybrid Gauss-Seidel smoother MHGS , which is obtained when the blocks Bk in (5.1) are chosen to be Gauss-Seidel sweeps for Akk . This smoother is of practical importance, for example, because it is the default option in the BoomerAMG code. Let A = D + L + LT , where D is the diagonal of A and L and LT are its strictly lower and upper triangular parts. We first remark that MHGS is convergent if θ > 1 or if A is red-black both with and without the block partitioning. Indeed, the θ condition implies (5.4)

hDv, vi ≤

θ T h(MHGS + MHGS − A)v, vi , θ−1

while in the red-black case we have that both regular and block Jacobi are convergent, and therefore X 2hMHGS v, vi = hAkk vi , vk i + hDv, vi > hAv, vi . k

Note that if A has large positive off-diagonal entries, such as in discretizations of definite Maxwell problems, MHGS may be divergent, even for large block sizes. This was the motivation in [23] to develop the `1 smoothers considered in the next section. In the next two theorems, we compare the smoothing properties of MHGS to that of the standard Gauss-Seidel smoother MGS = D + L. We first estimate the constants in Theorem 2.5 to do the comparison. Theorem 5.1. Assume that A is diagonally dominant and MHGS corresponds to hybrid red-black Gauss-Seidel. Then, K(MHGS ) ≤

4(3θ − 1) K(MGS ). 3(θ − 1)

Proof. We use Theorem 2.5. First, note that if γ satisfies hAv, vi ≤ γhDv, vi for all v, then ω≤

2 . 1 + θ−1 γθ 13

2

m 512 256 128 32 16 4 2 1

p 1 2 4 16 32 128 256 512

kET G kA BJac HGS 0.00 0.20 0.50 0.32 0.50 0.32 0.51 0.32 0.53 0.32 0.56 0.41 0.56 0.39 1.00 1.00

K? BJac 1.00 65.12 110.62 418.96 834.93 3334.24 6667.23 26664.93

HGS 1.25 1.81 1.81 1.81 1.81 1.81 2.33 26664.93

Table 5.1 Convergence factors and constants from Theorem 2.1 for (unweighted) block Jacobi (BJac) and hybrid GS (HGS) for a 1D Laplace problem with m unknowns per block and p blocks.

T + This follows from the definition of ω in (2.13), the fact that 2σ(MHGS ) = A+(MHGS MHGS − A), and (5.4). From the assumptions, δ ≤ 2 and γ ≤ 2, so ω ≤ (4θ)/(3θ − 1). It is not difficult to show that ∆2 ≤ 4/3 for hybrid red-black GS. Next, we compute the constant in Lemma 2.3 directly. Note that this approach requires less assumptions, but also gives a worse estimate when θ is close to one. Theorem 5.2. Assume that θ > 1. Then 2 θ 2 K(MHGS ) ≤ 1+ K(MGS ). θ−1 θ

Proof. Analogous to Theorem 6.2 from the next section, using (5.4) and the fact that k(MHGS − MGS )xk2D−1 ≤

1 4 ] hDx, xi ≤ 2 hM GS x, xi. 2 θ θ

By Theorem 5.1 and Theorem 5.2 we can conclude that hybrid Gauss-Seidel will be a convergent smoother with smoothing properties comparable to full Gauss-Seidel provided that θ > 1, e.g. if A is diagonally dominant and each block is large enough to have at least two non-zero entries per row. 5.2. Block Jacobi. As mentioned in the beginning of Section 5, the hybrid smoother can be thought of as an inexact block Jacobi method. Since hybrid GS can be shown to have smoothing properties comparable to GS under certain conditions, it seems plausible that (unweighted) block Jacobi might have even better smoothing properties. In fact, block Jacobi is not a particularly good smoother, though it can exhibit smoothing properties independent of the number of blocks (processors). As an example, consider again a standard Laplace problem on a unit domain 2 with homogeneous Dirichlet boundary conditions . In Table 5.1, we report kET G kA from Theorem 2.1 for the ideal interpolation operator P? in (2.9) for a coarsening factor of two in 1D. We also report the corresponding K? . From the table, we see that hybrid GS is a better smoother than block Jacobi, while both methods appear to have p-independent convergence factors for m > 1 (this is easily confirmed by fixing m and increasing p; not shown). At m = 1, both methods degenerate into unweighted 14

pointwise Jacobi, which is known to have poor smoothing properties. We also see from the table that K? is stable for hybrid GS but unbounded for block Jacobi (additional numerics shows that K? depends on both m and p). This implies that the theoretical tools in Section 2 are not adequate for analyzing block Jacobi. Although not the best smoother choice in practice, we would like to get a deeper understanding of block Jacobi’s smoothing properties. One approach might be to base the analysis on the sharp theory in [18], but we have not yet pursued this. The observations from Table 5.1 also carry over to 2D (we have not done 3D experiments), but they are more pronounced. In particular, the convergence factor for m ≥ (2 × 2) approaches 0.76 for block Jacobi instead of 0.56 as in 1D, while hybrid GS stays at 0.39. Another item worth noting is that the convergence of both methods degrades for larger coarsening factors, as one would expect. In addition, for block Jacobi, the minimum block size needed to yield good smoothing properties increases with increasing coarsening factor. It remains the same for hybrid GS as indicated by the theory. 5.3. Using Weights in Hybrid Smoothers. While we have shown that for many problems hybrid smoothers converge well, there are various situations where this is not the case, see e.g. Section 7.3 in [4]. Convergence can be achieved by multiplying MH with a weight ω as follows: Mω = ωMH . −1/2

−1/2

If MH is SPD and ω = λmax (MH AMH ), we immediately get (2.15). In practice, ω can be obtained by the use of Lanczos or CG iterations. For further details on the use of relaxation weights in hybrid smoothers, see [32]. 6. The `1 Smoother. While weighted hybrid smoothers are an attempt to fix hybrid smoothers by multiplying them with a suitable parameter, `1 smoothers do so by adding an appropriate diagonal matrix, which also leads to guaranteed convergence. They have the additional benefit of not requiring eigenvalue estimates. The `1 smoother is defined by (6.1)

A; I − M`−1 1

M`1 = MH + D`1 = diag{Bk + Dk`1 },

where D`1 is a diagonal matrix with entries X d`ii1 = |aij | . (i)

j∈Ωo

Note that with this notation (5.3) is simply D ≥ θD`1 . Furthermore, D`1 has the important property that X (6.2) hAv, vi ≤ hAkk vk , vk i + hD`1 v, vi , k

which follows from the Schwarz inequality 2|aij vi vj | ≤ |aij |vi2 + |aij |vj2 . We first show that M`1 is A-convergent, i.e., that M`T1 + M`1 − A is SPD. In the case where Bk = Akk , we can actually show more, since (6.2) implies (2.15). In general, if the block smoothers Bk are non-divergent in the Akk -norm with at least one of them being convergent, then hAkk vk , vk i ≤ h(BkT + Bk )vk , vk i 15

with strict inequality holding for at least one k. Hence, from (6.2), X hAv, vi < h(BkT + Bk + Dk`1 )vk , vk i ≤ h(M`T1 + M`1 )v, vi. k

Remark 6.1. The following scaled `1 smoother is also A-convergent: 1 M`1 = diag{Bk + Dk`1 }. 2 6.1. `1 Jacobi. We first consider the `1 point Jacobi smoother M`1 J = D + D`1 with blocks of size one. From above, this smoother is always convergent and satisfies (2.15). In the next theorem we compare M`1 J to standardP GS using Theorem 2.5. Note that since the blocks are of size one, θ satisfies aii ≥ θ j6=i |aij |. Theorem 6.1. Without any restrictions, we have 1 K(MGS ). K(M`1 J ) ≤ 4 1 + θ In particular, `1 Jacobi has multigrid smoothing properties comparable to full GaussSeidel for any A, for which θ is bounded away from zero. Proof. Since M`1 J is symmetric and satisfies (2.15) we can take ∆ = 1 and ω = 1. To estimate δ we observe that 1 1 hM`1 J x, xi ≤ 1 + hDx, xi ≤ 1 + 2hMGS x, xi . θ θ

6.2. `1 Gauss-Seidel. Finally, let M`1 GS = MHGS +D`1 be the `1 Gauss-Seidel smoother. This is the default smoother used in the AMS code [23]. As shown earlier in this section, this smoother is always convergent, and we analyze it by directly computing the constant in Lemma 2.3. Theorem 6.2. Without any restrictions, we have K(M`1 GS ) ≤

1+

4 θ

2 K(MGS ).

In particular, `1 Gauss-Seidel has multigrid smoothing properties comparable to full Gauss-Seidel for any A, for which θ is bounded away from zero, independently of the number of blocks (processors) or the block sizes. Proof. First, observe that hM`1 GS x, xi ≥ hMGS x, xi implies hDx, xi ≤ h(M`T1 GS + M`1 GS − A)x, xi. Therefore, T −1 ^ hM M`1 GS x, M`1 GS xi ≤ kM`1 GS xk2D−1 . `1 GS x, xi = h(M`1 GS + M`1 GS − A)

By the triangle inequality in the D−1 -inner product, kM`1 GS xkD−1 ≤ kMGS xkD−1 + k(M`1 GS − MGS )xkD−1 . 16

1/2 ] The first term above is simply hM , while the second can be estimated as GS x, xi follows using the Schwarz inequality (in lines 2 and 4), the symmetry of A (in line 5), and the fact that A is SPD together with (2.14) (in line 6): 2 X X 1 X k(M`1 GS − MGS )xk2D−1 = |aij |xi − aij xj aii (i) (i) i j∈Ωo j