An efficient parallel high-order compact scheme for ...

International Journal of Computational Fluid Dynamics

ISSN: 1061-8562 (Print) 1029-0257 (Online) Journal homepage: http://www.tandfonline.com/loi/gcfd20

An efficient parallel high-order compact scheme for the 3D incompressible Navier–Stokes equations S. Abide, M. S. Binous & B. Zeghmati To cite this article: S. Abide, M. S. Binous & B. Zeghmati (2017): An efficient parallel high-order compact scheme for the 3D incompressible Navier–Stokes equations, International Journal of Computational Fluid Dynamics, DOI: 10.1080/10618562.2017.1326592 To link to this article: http://dx.doi.org/10.1080/10618562.2017.1326592

Published online: 29 May 2017.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gcfd20 Download by: [University of Cambridge]

Date: 30 May 2017, At: 07:21

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS,  https://doi.org/./..

An efficient parallel high-order compact scheme for the D incompressible Navier–Stokes equations S. Abide, M. S. Binous and B. Zeghmati Laboratoire de Mathématiques et de Physique, Université de Perpignan Via Domitia, Perpignan, France

ABSTRACT

ARTICLE HISTORY

This article provides a strategy for solving incompressible turbulent flows, which combines compact finite difference schemes and parallel computing. The numerical features of this solver are the semi-implicit time advancement, the staggered arrangement of the variables and the fourth-order compact scheme discretisation. This is the usual way for solving accurately turbulent incompressible flows. We propose a new strategy for solving the Helmholtz/Poisson equations based on a parallel 2dpencil decomposition of the diagonalisation method. The compact scheme derivatives are computed with the parallel diagonal dominant (PDD) algorithm, which achieves good parallel performances by introducing a bounded numerical error. We provide a new analysis of its effect on the numerical accuracy and conservation features. Several numerical experiments, including two simulations of turbulent flows, demonstrate that the PDD algorithm maintains the accuracy and conservation features, while conserving a good parallel performance, up to 4096 cores.

Received  February  Accepted  April 

1. Introduction Over the last three decades, numerical simulations contribute to a better understanding of the turbulence physics. Nowadays, this approach is hard to avoid because it has proved its ability to accurately describe further turbulence features, and also, because of the continuous growth of computational resources. Simulations of turbulent flows rely on the combination of several components such as time and space discretisations or linear/ nonlinear solvers. The choices of these methods are motivated by the minimisation of the computational cost. From the abundant available literature, two trends can be drawn: high-order discretisations or/and highperformance computation. Early works were based on pseudo-spectral methods to circumvent the low computational resource (Fox and Orszag 1973), and nowadays some simulations using finite volumes are performed on 106 cores (Bermejo-Moreno, Bodart, and Larsson 2013). The combination of high-order schemes and highperformance computing still remains an active research field (Lee and Moser 2015; Oguic, Viazzo, and Poncet 2015; Laizet and Li 2011). Compact finite difference schemes achieve a higher formal accuracy and a higher wavenumber performance for a given stencil size than centred finite difference schemes (Lele 1992). This could explain their popularity in computational aeroacoustic or turbulence simulation, for instance (Knikker 2009; CONTACT S. Abide

[email protected]

©  Informa UK Limited, trading as Taylor & Francis Group

KEYWORDS

Numerical simulation of turbulent flows; compact finite difference schemes; high-performance computing; diagonalisation method; kinetic energy conservation

Laizet and Li 2011; Oguic, Viazzo, and Poncet 2015). Moreover, compact schemes possess a linear algorithmic complexity which is suitable for large-scale computing (Laizet and Li 2011). However, the implicit nature of compact schemes leads to some difficulties with the parallel implementation, and more particularly with the elliptic/parabolic equations arising from discretisations of the incompressible Navier–Stokes equations. This work proposes and assesses a new strategy to parallelise an incompressible Navier–Stokes solver dedicated to the turbulence simulation and using compact scheme discretisation. The projection schemes, which aim to uncouple the velocity and the pressure, still remain a usual approach to simulate turbulent incompressible flows (Moin and Kim 1982). These methods reduce the original mathematical formulation to a Poisson problem associated to the pressure correction if explicit time schemes are considered, and an additional vectorial Helmholtz problem for the provisional velocity in the case of semi-explicit time schemes (Moin and Kim 1982). The higher-order space discretisation of the Poisson and Helmholtz problems leads to linear systems that could be extremely tricky to solve. In addition to the Poisson/Helmholtz problems, an important number of derivatives and interpolations are required to compute nonlinear terms, gradients or divergences of any variables. Within the framework of high-order space discretisation, these elements have to

2

S. ABIDE ET AL.

be considered for an efficient parallelisation. Solving the incompressible Navier–Stokes equations in this way and on a staggered arrangement of variables ensures to the have a numerical method which conserves the mass, the kinetic energy. Moreover, the use of a semi-explicit time schemes is motivated by the need to relax the time-step constraint. In a similar way as Knikker (2009), these are the basic requirements of the present solver. If the Poisson/Helmholtz equations are discretised with compact schemes, the resulting linear systems are no longer sparse. Few methods are available, and typically they can be classified into direct and iterative methods. In the former case, methods mainly try to reveal the linear algorithmic complexity of the compact finite difference schemes even if an auxiliary sparse linear system is introducer. The defect correction method belongs to this category (Knikker 2009; Schiestel and Viazzo 1995). It consists in solving the second-order finite differences discretisation of the Poisson’s equation, and in correcting the residual, which is computed using a higher order discretisation (Auzinger and Stetter 1982). Hence, efficient linear solvers can be retained to solve the correction equation. Schiestel and Viazzo (1995) considered successfully this method for simulation of turbulent flows in a biperiodic channel. Knikker (2009) suggests to use a defect correction procedure based on a multigrid method, and demonstrated the robustness of this approach. As alternative to the defect correction method, Brüger et al. (2005) based their work on an iterative Krylov method preconditioned with an incomplete lower upper (ILU) factorisation. In this approach, the ILU preconditioner is built on the second-order finite difference of the Poisson’s equation while a free matrix form of the Krilov method is used to take into account the compact scheme discretisation. The analysis of the convergence factor showed a condition number which behaves as o(h−1 ), h being the grid space. Recently, Abide, Chesneau, and Zeghmati (2012) proposed an iterative solver combining a mixed formulation of the Poisson equation and a multigrid preconditioner. The main finding is that the number of iterations is independent of the problem size. Concerning direct methods, the diagonalisation method is used in combination with the high-order spatial discretisations (Canuto et al. 1988; Abide and Viazzo 2005; Oguic, Viazzo, and Poncet 2015). A multidomain version is detailed by Abide and Viazzo (2005) for the cartesian system coordinates and by Oguic, Viazzo, and Poncet (2015) for the cylindrical system coordinates. Laizet and Lamballais (2009) proposed an efficient method based on the equivalency of the derivation in physical and spectral spaces for the compact scheme. Hence, an efficient parallel implementation based on 2d-pencil decomposition allows them to perform simulations with large core numbers (Laizet and Li 2011).

The numerical evaluation of spatial derivatives and interpolations is also a computationally expensive part of a DNS code. Compact schemes have good accuracy and wavenumbers representation which come at the expense of solving multiple tridiagonal/pentadiagonal linear systems. The solutions of such linear systems are efficiently computed with sequential algorithms. However, a special attention must be paid in their parallelisation. According to Kim (2013), two approaches can be defined: the algorithmic and the boundary approximation approach. The boundary approximation approach relies on the specific design of the boundary schemes at the interface subdomains combined with an overlap of the subdomains (Gaitonde and Visbal 1999; Sengupta, Dipankar, and Rao 2007; Kim and Sandberg 2012). This approach introduces local modifications in the dispersion and dissipation features of the discretisation at the interface subdomains. These can be controlled by the scheme itself, an overlap size at the subdomain interfaces and the eventual use of a filter. Thus, Gaitonde and Visbal (1999) shown that an overlap size of several nodes combined to an one-sided compact filter leads to an effective approach. Sengupta, Dipankar, and Rao (2007) improved the procedure by tuning the one-sided compact scheme at the interface subdomains. For small subdomain overlaps, Sengupta, Dipankar, and Rao (2007) report spurious oscillations which are avoided by means of an one-sided filter. Following the same way, Kim and Sandberg (2012) introduced an optimisation of the boundary scheme via Fourier analysis and a predictor-corrector step. Then, Kim (2013) proposed further improvements to achieve the necessary accuracy to simulate acoustics phenomena. In the boundary approximation approach, the properties of the finite differences schemes are modified. The algorithmic approach is an another way to address the interface subdomains issue. Indeed, regardless the discretisation, this approach focuses in how to compute efficiently parallel solutions of tridiagonal and pentadiagonal linear systems. The early developments of the library ScaLAPACK have provided a versatile tool to deal with the distributed linear algebra, and particularly with parallel tridiagonal solvers (Choi et al. 1996). Specific methods have been designed such as the pipelined Thomas algorithm (PTA) (Povitsky and Morris 2000), the parallel diagonal dominant (PDD) algorithm (Sun 1995) or more recently the SPIKE algorithm (Polizzi and Sameh 2007). Povitsky and Morris (2000) combined a sophisticated time-staggering with PTA to minimise the idle time, which is inherent to this method. The PTA ensures that the parallel solutions are identical to serial ones. The PDD algorithm has been designed by Sun (1995). The algorithm consists of three steps: the partial solutions are computed in each subdomain, then the first halo node of each subdomain is determined by solving small

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

linear system, and next the final solution is assembled. In comparison with serial one, there is an important computational overhead. But, if judicious intermediate coefficients are dropped in the second step, the PDD achieves a very good parallel scaling (Sun 1995). The algorithmic error introduced depends on the diagonal dominance of the matrix and on the grid size of the subdomains (Sun 1995). Laizet and Li (2011) proposed a parallel strategy which involves only serial algorithm in the compact scheme evaluations. The fields are distributed over a two-dimensional processor grid (2d-pencil), and thus there is one direction for which the compact scheme is computed in a serial manner. The global data transpose is then required to allow the compact scheme evaluations in the other directions. This approach shows an impressive scalability when the important amount of data communication is properly achieved by the network interconnect devices. Ladeinde et al. (2003) compared the boundary approximation and the algorithmic approaches. They show that the one-sided approach of Gaitonde and Visbal (1999) produces accurate results with good parallel performances. They also outline the poor parallel performances of the PTA, and that the PDD is very computationally expensive. Moreover, the authors show that the PDD can provide wrong results (Ladeinde et al. 2003). This can be explained by the poor diagonal dominance of the compact filter that has been used. In this paper, we describe a strategy to parallelise a high-order compact scheme DNS code for incompressible flows. Some basic features are retained. A staggered arrangement of variables is considered to ensure the kinetic energy conservation (Knikker 2009). The diffusive terms are advanced in time with an implicit scheme to relax the time-step restriction. And we consider geometries having three inhomogeneous directions. Due to staggered layout, all the compact schemes are evaluated with the algorithmic approach, and more precisely the PDD algorithm. Since the PDD algorithm drops intermediate coefficients during the solving stage, its effect on the accuracy, the mass conservation and the kinetic conservation are studied here. According to our knowledge, this is the first study of the algorithmic approach within the context of the incompressible fluid flow. The second aspect of this work relates the parallelisation of the diagonalisation method. Here, we detail an implementation based on a 2d-pencil decomposition. The remainder of this article is organised as follows. Section 2 is concerned with the numerical methods. Mathematical model and time discretisation are detailed. The PDD algorithm is developed in the context of compact scheme discretisations, and a discussion of its influence on the mass and the kinetic energy is proposed. Then, the

3

diagonalisation method is reminded, and its parallel implementation is described. Section 3 is devoted to the numerical experiments. Accuracy and conservation properties are verified with an emphasis on the influence of the PDD algorithm. The parallel performance is also investigated. The simulation of the turbulent flows in a channel and in a lid-driven cavity is performed to illustrate the ability and the potentiality of our strategy to simulate incompressible turbulent flows.

2. Numerical methods 2.1. Mathematical model An incompressible fluid is considered within a parallelepipedic domain = Lx × Ly × Lz of boundary = . The governing equations of the flow are the three-dimensional incompressible Navier–Stokes equations. The prescribed boundary conditions can be either periodic, inflow/outflow or wall boundary conditions. The velocity vector is denoted by u = (u, v, w)T and the pressure by p. The momentum and continuity equations, in dimensionless form, can be written as ∂t u + ∇. (uu) = −∇ p + ∇.u = 0

1 2 ∇ u+F Re

in , (1)

in ,

(2)

where Re is the Reynolds number, t is time and F is a given body force. These equations are considered for direct numerical simulation. Large eddies simulation is also considered in this work. The equations to solve are obtained by applying a filtering operation featured by a cut length. The filtered dimensionless equations read ¯ = −∇ p¯ + ∂t u¯ + ∇. (u¯ u)

1 2 ∇ u¯ + F − ∇ · τ Re

in , (3)

∇.u¯ = 0

in .

(4)

Due to its simplicity and robustness in complex configurations, an eddy-viscosity model is retained. In this case, T , the traceless part of the subgrid scale model (SGS) stress tensor, is defined as T −

1 (T : I) I = −2νsgs S, 3

(5)

¯ where S is the rate-of-strain tensor S = 12 [∇.u¯ + ∇ t .u]. The formulation is closed with the wall-adapting localeddy (WALE) viscosity, which gives an algebraic expression for the SGS viscosity, ν sgs (Nicoud and Ducros 1999).

4

S. ABIDE ET AL.

The SGS viscosity is computed as νsgs

3/2 V:V = (Cw ) 5/2 5/4 , S:S + V:V 2

(6)

∇ · ∇φ =

where V=

1 1 ¯ 2 − ∇ (u) ∇ (u) ¯ 2 + ∇ T (u) ¯ 2:I I 2 3

(7)

and Cw is a constant of the model and the reference length refers to (xyz)1/3 . The constant model could depend on the flow configuration as observed by Ben-Cheikh et al. (2012) in their simulations of the turbulent lid-driven cavity. 2.2. Time integration The semi-implicit Runge–Kutta/Crank–Nicolson scheme (Knikker 2009; Le and Moin 1991) is considered for time advancement of the incompressible Navier–Stokes ((1) and (2)). For the sake of clarity, let us denote by L(u) = ∇ 2 u/Re and N(u) = 1/2 (∇uu + u∇u) the linear and nonlinear terms, respectively. The semi-implicit Runge– Kutta/Crank–Nicolson scheme deals with implicit linear terms and explicit nonlinear terms such as uk+1 − uk + σk N(uk ) + γk N(uk−1 ) t = −2βk ∇ pk + βk L(uk+1 ) + L(uk ) , 1 ≤ k ≤ 3, (8) where u1 = un and u4 = un+1 . The coefficients β k , σ k and γ k , detailed in Knikker (2009) and Le and Moin (1991), are determined from low-storage and accuracy requirements: second-order for linear and third-order for nonlinear terms. In each sub-step, the velocity fulfils the continuity constraint ∇ · uk+1 = 0.

(9)

This condition is enforced by means of a fractional step, or pressure correction method. For each step, a provisional velocity is computed and corrected: uk+1 = u − 2βk t∇φ,

(10)

where φ = pk + 1 − pk is the pressure correction, and where the provisional velocity fulfils

The boundary conditions are prescribed on u . A Poisson equation for the pressure correction is derived from the combination of the divergence operator with the correction velocity equation (10):

u −u + σk N(uk ) + γk N(uk−1 ) t = −2βk ∇ pk + βk L(u ) + L(uk ) , 1 ≤ k ≤ 3. (11) k

1 ∇u . 2βk t

(12)

The boundary conditions associated to the pressure correction equation are homogeneous Neumann boundary conditions, leading to a singular problem. Regardless of the spatial discretisation, the solution method using pressure correction consists in solving three Helmholtz equations (11), one Poisson equation (12), and in evaluating explicit linear/nonlinear terms. Consequently, the numerical stability depends mainly on two parameters: the √ Courant–Friedrichs–Lewy condition ut/x < Cmax = 3 (Le and Moin 1991) and the lower time scales of the turbulent flow. It should be noted that the contribution of the model LES is treated explicitly in a similar way as the convective terms. 2.3. Spatial discretisation The spatial approximation of the semi-discrete equation (8) is based on a fourth-order compact scheme discretisation, defined on a staggered grid (Lele 1992; Knikker 2009; Abide and Viazzo 2005). This usual layout is retained because of the associated discrete conservation properties. For instance, the kinetic energy in the inviscid limit is conserved if the convective terms are written in their skew-symmetric form (Knikker 2009; Schiestel and Viazzo 1995). The pressure and the velocity components are shifted in space by half the space step. For example, along the first direction, the grid of the interval (0, Lx ) is composed of pressure nodes, xi = (i − 1/2)h with 0 i nx + 1, and velocity nodes xi + 1/2 = ihx with 0 i nx . The uniform space step is hx = Lx /nx . Nonuniform grids are obtained by means of an analytical mesh transformation which introduces three metric coefficients in cartesian orthogonal 3D domains. They can be computed either analytically or numerically (Knikker 2009). The staggered fourth-order compact scheme approximations of derivative or interpolation from the pressure to velocity nodes read for the staggered derivation: + fi + α fi+1 = a fi+1/2 − fi−1/2 /hx α fi−1

(13)

or for the staggered interpolation: α fi−1 + fi + α fi+1 = a fi+1/2 + fi−1/2 /2

(14)

where the coefficients α and a are calculated from accuracy requirements (Lele 1992). The boundary condition


Table . Compact scheme coefficients. Inner nodes Operators Dcf ξ Dfξc Icf ξ Ifξc

where Iξ stands for the identity matrix associated to direction ξ , and κ > 0 associated to the Runge–Kutta scheme equation (8), and κ = 0 for the pressure correction equation (12). The right-hand side (18) is composed of the explicit temporal terms, pressure gradient, convective terms and diffusive terms, which are approximated by implicit finite differences (14) and (15). Hence, an efficient parallel strategy for a DNS code has to consider two sides:

Boundary nodes

α

a

α

a

b

c

/ / / /

/ / / /

−   

− − / /

  / /

− − −/ /

relations are: (p) (p) f1 + α f2 = a f−1/2 + b f1/2 + c f3/2 (p) (p) fn + α fn−1 = a fn+1/2 + b fn−1/2 + c fn−3/2

r all the implicit compact derivatives and interpola(15)

(16)

where p stands for derivation p = 1 or interpolation p = 0, and M refers to a square tridiagonal matrix of size n + 1 or n + 2, depending on the considered variable, velocity or pressure variable. The matrix B is a rectangular matrix with the same line numbers than M, and that the number of rows depends, in a same way as M, on the variable arrangement of f. An explicit form is easily derived by computing the inverse of M, f (p) = M −1 B f = D f .

tions arising from the explicit time scheme (11),

r the linear systems (18) inherent to the implicit time scheme.

One of the main feature of this kind of finite difference scheme is the implicit nature of the approximation. Equations (14) and (15) are recast in a matrix form: M f (p) = B f ,

5

(17)

It should be noted that D(p) = M−1 B is a full matrix and that such an explicit evaluation by matrix/vector multiplication requires o(n2 ) operations. In fact, the matrix M has a tridiagonal structure which allows us to compute (17) with o(n) operations. In the following, the notations cf fc Dξ , Dξ refer to the staggered first derivative with respect to ξ . The first superscript is associated with the node location of the known function and the second refers to the location where the numerical evaluation holds. The set of coefficients defining the compact scheme discretisations involved in this work is detailed in Table 1. Once the spatial discretisation defined, the temporal semi-discretisation (8) combined with divergence constraint (9) and decoupled by means of a projection scheme leads to linear systems for the provisional velocity equation (11) and the pressure correction equation (12). The model linear system arising from the provisional velocity equation and written with tensorial notations (Canuto et al. 1988) reads (Dxx ⊗ Iy ⊗ Iz + Ix ⊗ Dyy ⊗ Iz + Ix ⊗ Iy ⊗ Dzz (18) − κIx ⊗ Iy ⊗ Iz )u = S uk , uk−1 , pk ,

Since the matrices associated to linear system (17) are full matrices, a parallel strategy based on a 2d decomposition domain and global data transpose is considered in Section 2.5. For the computation of the derivatives and interpolations involved in the explicit terms, the retained solution for parallel architectures is based on the parallel diagonal dominant algorithm. The method is detailed, and accuracy, related to simulation incompressible fluid flows, is discussed in the next section.

2.4. The parallel diagonal dominant algorithm and compact scheme discretisations The PDD algorithm allows us to solve efficiently tridiagonal linear systems on distributed architectures (Polizzi and Sameh 2007; Sun 1995). Based on the diagonal dominance assumption, some matrix entries are dropped during the factorisation procedure in order to increase the number of independent computational tasks, and so, to improve the parallelism efficiency. The accuracy of this approximate solver depends on the degree of the diagonal dominance (Sun 1995). Therefore, it should be suitable to assess the effect of this approximation, first, on the accuracy of the compact scheme finite differences (14), and, second, to the other DNS-code features such as the conservation properties. These issues are addressed hereafter. Within the framework of a simulation of turbulent flow reminded in Sections 2.2 and 2.3, the temporal explicit terms are computed with implicit finite differences (14). These are tensorial operations which are performed on three-dimensional fields. For the sake of clarity, the one-dimensional framework is considered. The extension to three-dimensional cases is straightforward. In this case, the fourth-order scheme (14) involves solutions of linear systems Mx = b, where M is a tridiagonal matrix of order n, and x = (x1 , … , xn )T and d = (d1 , … , dn )T . The PDD algorithm consists into split

6

S. ABIDE ET AL.

(1)

(1)

M1

α

α

f1

.. .

.. .

(1)

(2)

M2

α

α

·

α

α

fm (2) f1

.. .

.. .

(2)

xm = (3) x1

(3)

(4)

M4

(1)

xm (2) x1

.. .

(3)

M3

(1)

x1

(2)

fm (3) f1 .. . (3)

xm (4) x1

fm (4) f1

.. .

.. .

(4)

xm

(1)

×

1 ..

.. v (1) .

.

w(2) ... ×

× ..

.. v (2) .

.

1 × × 1 w(3) ...

·

× ..

.. .

.. .

×

w(4) ...

(4)

×

x ˜m (2) x ˜1

.. .

.. .

(2)

xm = (3) x1

(3)

1 × × 1 ..

. 1

(1)

xm (2) x1

.. .

.. v (3) .

.

fm

(a)

x ˜1

(1)

1 × × 1

(1)

x1

(2)

x ˜m (3) x ˜1 .. . (3)

xm (4) x1

x ˜m (4) x ˜1

.. .

.. .

(4)

xm

(4)

x ˜m

(b)

−1 M x = M −1 b . + M x = b (a), and its inverse (b) M −1 Mx = I + M Figure . Partition of a tridiagonal matrix Mx = M

the matrix M into p sub-matrices, where p is the number of available processors. These vectors are partitioned as (k) T ) and x = (x(1) , . . . , x(p) )T , where x(k) = (x1(k) , . . . , xm 1 k p (n = mp is assumed). The original matrix M + M, where M is a block diagois split into M = M nal matrix composed of tridiagonal matrices M(k) , and M is composed of the off block-diagonal entries. The matrix partition and the decomposition are depicted in Figure 1(a). The product between the original tridiagonal linear −1 leads to a block system and the inverse matrix M identity matrix surrounded by two column vectors v(k) and w(k) , denoted as spike (Polizzi and Sameh 2007). −1 M. The Figure 1(b) shows the structure of the matrix M (k) (k) , fulfirst and the last node of each partition, x1 and xm fil to a pentadiagonal linear named reduced system (Sun 1995). Regardless of the diagonal dominance assumption, the parallel algorithm to solve Mx = b can be drawn:

r compute the guessed solution x˜ =M −1 b ; this step is performed independently on each processor,

r solve the pentadiagonal reduced system to compute

(k) , the first and last node of each partition, x1(k) and xm

r use the spikes to correct the guessed solution x(k) = (k) (k) w . x˜(k) − x1(k) v (k) − xm

A detailed and sophisticated mathematical description is given by Sun (1995) or more recently by Polizzi and Sameh (2007). Depending on the degree of the diagonal dominance, a truncated version of this algorithm delivers a higher level of parallelism. In fact, in this case, the entries of the spikes v(k) and w(k) decrease as we move away the diagonal. By dropping the entries of the spikes v(k) and w(k) furthest from the diagonal, the reduced system degenerates into independent linear systems of (k−1) and x1(k) dimension 2 involving only the unknowns xm

at each interface of the decomposition. Sun (1995) gives an estimate of the error when reduced systems are simplified. Applied to the compact schemes formulation (14), the error estimate of the PDD is x − x bm , ≤ x α(α − b)(a − 1)

(19)

where the coefficients a and b are given by

1 1 1 1 1 1 + − −4 , b= −4 . a= 2 α α2 2 α α2 (20) As demonstrated by Sun (1995), the error introduced by the PDD algorithm behaves as a power of the number of nodes per subdomain m. This former can be estimated to achieve a user-defined accuracy. For instance, the accuracy x − x /x = 10−15 is ensured by the PDD algorithm with m to around 15 for the staggered derivative α = 1/22, and to around 20 nodes for the staggered interpolation α = 1/6. Figure 2(a) shows the error versus the size m of the subdomain. From a practical point of view, this first estimate is sufficiently small to be in line with a real application of a parallel DNS code. Based on this error analysis, the parameter α of the scheme (14) can be tuned to require a machine accuracy for a suited subdomain size m. However, the fourth-order accuracy will be maintained only if an alternative formulation (Lele 1992) is considered: (p) (p) (p) α fi−1 + fi + α fi+1 = a fi+1/2 − fi−1/2 /hx + b fi+3/2 − fi−3/2 /2hx . (21) Figure 2(b) shows the influence of the parameter α on the error estimate of the PDD algorithm. Also, it can be


7

Figure . Error estimate of the PPD algorithm, (a) versus the node numbers m = n/p in each subdomain, (b) versus the value α associated to compact scheme definition ().

concluded that the accuracy machine is rapidly achieved when the number of node per subdomain increases. This analysis concerns with accuracy, but accurate simulations of turbulent flows do not rely only on this side of the discretisation. For instance, conservation of kinetic energy is known to be a suitable property when turbulent flows are simulated. In this work, the skew-symmetric formulation on a staggered grid is retained to satisfy the conservation of the kinetic energy (Schiestel and Viazzo 1995; Knikker 2009). A straight derivation of the error bound for the skew-symmetric formulation appears to be a cumbersome task. But, it should be argued that if the accuracy machine is required to each staggered derivative and interpolation, then it may be reasonable to assume that there is no effect on the computation of the convective terms. The same argument can be raised for the velocity correction ((16)–(18)) which governs the threshold of the final divergence. To conclude this section, the error analysis of the PDD applied to the fourth-order compact schemes discretisation (14) shows that a subdomain size of around 20 nodes should be effectless on the computed derivatives and interpolations. Moreover, under this assumption, we speculate that important properties such as energy or mass conservation are nonetheless fulfilled. This point is addressed in the Section 3 where several numerical simulations are performed to confirm that the approximations introduced by the PDD algorithm do not deteriorate the solver properties.

2.5. A parallelised matrix diagonalisation method The semi-implicit Runge–Kutta/Crank–Nicolson combined with a projection scheme leads to three Helmholtz equations and one singular Poisson equation. The implicit nature of the compact scheme discretisation

avoids usual linear solvers as method of solutions. Thus, specific linear solvers have to be designed to this family of discretisations. In this work, linear systems deduced of the Helmholtz/Poisson equations discretisation are addressed by the diagonalisation method. The parallelisation of the solvers is considered by means of the 2dpencil decomposition. Let us consider the Poisson equation (12), where the compact scheme discretisation leads to the general form (D xx ⊗ Iy ⊗ Iz + Ix ⊗ D yy ⊗ Iz + Ix ⊗ Iy ⊗ D zz ) = F

(22)

where D ζ ζ stands for a square matrices of dimension nζ , which is the number of inner nodes in the ξ -direction. These matrices correspond to the fourthorder discretisation of the second derivative modified in order to avoid boundary nodes with the discrete boundary conditions. The boundary contributions also modify the right-hand side, denoted F , of Equation (18). These procedures are detailed by Abide and Viazzo (2005). The diagonalisation method consists in a diagonalisation of each modified second derivative. The matrices D xx , D yy and D zz are diagonalised and expressed as follows: D xx Px = Px x , D yy Py = Py y , D zz Pz = Pz z ,

(23)

where x , y and z are the diagonal matrices filled with the eigenvalues of the matrices D xx , D yy and D zz . The matrices Px , Py and Pz are the right eigenvectors of the modified second derivatives. Now, applying the diagonalisation process on the linear system (22) is equivalent to =F ,

(24)

8

S. ABIDE ET AL.

Figure . D domain decomposition example using a  ×  processor grid: (a) X-pencil; (b) Y-pencil; (c) Z-pencil (from Li and Laizet ).

where is a diagonal matrix, = Px−1 ⊗ Py−1 ⊗ Pz−1 A Px ⊗ Py ⊗ Pz = x ⊗ Iy ⊗ Iz + Ix ⊗ y ⊗ Iz + Ix ⊗ Iy ⊗ z (25) and with the modified fields,

= Px−1 ⊗ Py−1 ⊗ Pz−1 , = Px−1 ⊗ Py−1 ⊗ Pz−1 F . F

(26)

The solution of the modified linear system (26) can be easily written in a component-wise form as

The preprocessing step is the most time-consuming stage, which prohibits to consider the diagonalisation method is suitable for elliptic equation with timeindependent coefficients. In step 2, tasks (a) and (c) require intensive computations, since the calculation of F and involves the evaluation of three tensor products: W = Ax ⊗ Ay ⊗ Az V.

(28)

To outline the important computational cost of this operation, we remind that in component-wise form a such tensor product is computed with wi jk =

y

Axir Ais Azit v rst

(29)

1≤i, j,k≤n

fi jk i jk = − , φ λx,i + λy,i + λz,i

1 ≤ i, j, k ≤ n.

(27)

Then, the solution at the inner nodes are computed by using the reverse of relation (26). Then, the boundary nodes are computed from the boundary conditions (Abide and Viazzo 2005). The algorithm of this method can be drawn as follows: (1) Processing step: compute and store {x , Px , Px−1 , y , Py , Py−1 , z , Pz , Pz−1 } from the modified derivative matrices. (2) Computational step: = (Px−1 ⊗ Py−1 ⊗ (a) compute the source term F −1 Pz ) F, = −1 F using rela(b) compute the solution tion (27), (c) back to the solution at the inner nodes from = Px ⊗ Py ⊗ Pz , (d) compute the boundary nodes from relation (27).

From a computational point of view, this kind of operations performs well because of the regular data structure. Even so, the algorithm complexity still remains o(n4 ). The parallelisation of this solver exploits the regular data structure; this is described hereafter. The parallelisation of the diagonalisation method relies on a 2d-pencil decomposition. This means that the 3D array is distributed over a 2d MPI processors grid, leaving one of the three directions unaffected. Figure 3 shows the three possible states, denoted as pencil. This has been implemented using the open-source library 2DECOMP&FFT (Li and Laizet 2010), which provides collective communications to transpose data from a pencil to an other. Thus, serial one-dimensional operations, such as FFT or tridiagonal solver, can be performed simultaneously. Since the diagonalisation method acts on the data array with one direction at a time, the 2d-pencil decomposition seems to be indicated for the parallelisation. A parallel version of the diagonalisation method is then straightforward, and detailed hereafter.


9

Figure . Numerical accuracy versus the time step (a), the mesh size (b) and the absolute difference between the serial and parallel solutions (c).

(1) Compute the operator diagonalisations in each direction {x , Px , Px−1 , y , Py , Py−1 , z , Pz , Pz−1 } and store on each processes of the 2D decomposition. (2) Compute the solution U from the source term F: = (a) Compute the modified source term F −1 −1 −1 (Px ⊗ Py ⊗ Pz ) F: = (Px−1 ⊗ Iy ⊗ (i) perform the product F Iz ) F and transpose the results to Y-pencil state, = (Ix ⊗ Py−1 ⊗ (ii) perform the product F and transpose F to Z-pencil, Iz ) F = (Ix ⊗ Iy ⊗ (iii) perform the product F . Pz−1 ) F (b) Compute simultaneously on each processor on the Z-pencil state. = −1 F the solution U (c) Compute the inner nodes of the solution U = : Px ⊗ Py ⊗ Pz U (i) perform the product U = Ix ⊗ Iy ⊗ Pz U and transpose U to Y-pencil state, (ii) perform the product U = (Ix Py Iz ) U and transpose U to X-pencil state, (iii) perform the product U = (Px Iy Iz ) U. (d) Compute the boundary nodes from boundary discretisation: (i) update boundary nodes for the direction ox, transpose X-pencil to Y-pencil, (ii) update boundary nodes for the direction oy, transpose Y-pencil to Z-pencil, (iii) update boundary nodes for the direction oz, transpose Z-pencil to Y-pencil and to Xpencil. It should be noted that the X-pencil state is the natural data distribution. This algorithm involves eight

global transposes to compute the final solution of an Helmholtz/Poisson equation. The advantage of this scheme relies on the simplicity to get a parallel version of the diagonalisation method. But, unlike other decomposition methods (Abide and Viazzo 2005; Oguic, Viazzo, and Poncet 2015), this approach does not modify the large algorithmic complexity associated to the implicit nature of compact schemes.

3. Numerical results 3.1. Order of accuracy The accuracy of the present numerical method has been validated against analytical solutions in a square domain = (−1, 1)2 . With an appropriate source term F derived for a viscosity ν = 1/10, the following analytical divergence-free velocity and pressure fields fulfil the Navier–Stokes equations u(t, x, y) = cos (γ t ) sin (πx) cos πy , v (t, x, y) = − cos (γ t ) cos (πx) sin πy , p(t, x, y) = cos (γ t )2 cos (2πx) + cos 2πy /4, (30) where γ is a parameter which defines the steady γ = 0, or unsteady γ = 5, solutions. The orders of accuracy are determined from the absolute numerical error ϵ, which is the maximum point-wise error, computed with several meshes of increasing size for space accuracy, or decreasing time step for time accuracy tests. First, the accuracy of the temporal discretisation is assessed. In this test case, a periodic domain is considered and the numerical error is computed with a time step ranging from t = 2 × 10−2 to 5 × 10−3 , and

10

S. ABIDE ET AL.

Figure . Kinetic energy error versus the time step (a), divergence (b) and kinetic energy difference (c) versus the number of nodes per subdomain.

with a mesh size of h = 1/256. The maximum of the absolute errors during 10 periods is retained to check the temporal scheme accuracy. The numerical errors for velocity and pressure, versus the time step, are plotted in Figure 4(a). The numerical errors decrease with a −2 slope which is the expected order of accuracy of the Runge–Kutta/Crank–Nicolson temporal scheme (Le and Moin 1991). Next, the order of accuracy of the spatial discretisation is assessed by considering the steady state of Equation (30), for which β = 0. The solutions are computed on several non-uniform meshes with size ranges from N = 16 to 256. The numerical errors for velocity and pressure versus the meshes size are plotted in Figure 4(b). We can observe that the numerical errors decrease as the mesh size decreases with a slope close to −4. In order to outline the effect of the PDD algorithm, steady solutions computed on partitioned domains are compared with the unpartitioned one. The mesh size is composed of n = 90 nodes in both directions, which are distributed over p intervals, leading to p2 square subdomains. The parameter p is such that the subdomain grid size n/p is between 5 and 45. Thus, for each run, the absolute differences of the velocity and pressure fields between parallel and serial solutions are computed. The data which are reported in Figure 4(c) indicate us that the maximum difference, around 10−6 , is reached for the finest partition, i.e. five nodes per subdomain. The absolute difference decreases rapidly with respect to the number of nodes per subdomain. Hence, beyond 18 nodes per subdomain, the absolute difference is around 10−13 , and it could be consider that the PDD algorithm does not deteriorate the accuracy of the present scheme. Next, an exponential decay rate of the absolute difference is observed. This feature may have been attributed to the decay rate equation (19), and, so, to the intrinsic nature of the compact schemes.

3.2. Periodic inviscid flow As mentioned in Section 1, the simulation of turbulent flows requires discretisations able to conserve mass and energy. Because of the truncation introduced by the PDD algorithm, its influence on this two important properties is investigated. By assuming the flow incompressible and inviscid, the kinetic energy remains constant with respect to the time. Such a flow is considered in a cubical periodic domain with a length 2π. Based on an uniform mesh of size n3 partitioned into p2 subdomains, the kinetic energy and divergence are computed while a turbulent flow develops from the initial pressure and velocity fields given by the Taylor Green vortex (Duponcheel, Orlandi, and Winckelmans 2008). Simulations are advanced in time up to tsim = 8. The total kinetic energy and the divergence are computed as 1 1 u · u K = k = || 2 1 1 x x y y = ui jk ui jk + v i jk v i jk + wzijk wzijk (31) 2 || i, j,k

and with

div = max |∇ · u| = max δx ui jk + δy v i jk + δz wi jk .

i, j,k

(32) The total kinetic energy should be conserved in time (K = 0.125) and the divergence should be zero at each time step. Figure 5(a) shows the kinetic energy computed for several time steps and without domain decomposition (p = 1). The kinetic energy is not exactly conserved but decreases with respect to the time step with a −3 slope. This feature is associated to the low numerical


diffusion introduced by the third-order Runge–Kutta scheme (Knikker 2009), and does not originate from the compact scheme discretisation. This proves that the present spatial scheme conserves the kinetic energy. In this case, the divergence has been found around 10−13 at each time step. To highlight the effect of the partition size n/p onto the kinetic energy error and the divergence, both parameters are considered with respect to n/p. Simulations are performed with a time step t = 2.10−2 on a grid of size n = 140. The number of partitions p is such as n/p is comprised between 5 and 140. The divergence versus the partition size n/p is shown in Figure 5(b). For the low end of n/p, the divergence is around 10−6 , and rapidly decreases up to the accuracy machine. It seems reasonable to postulate that beyond 10 nodes per subdomain, the PPD algorithm does not introduce error. This agrees with the predictions of the bounding error equation (19). In the same way, for each partition size, the kinetic energy at the simulation tsim = 8 is compared to that computed for the parameter p = 1. As expected, the main discrepancy, around 10−4 , is found for n/p = 5 and decreases to the accuracy machine beyond n/p = 20 (Figure 5(c)). The decay rate seems to be exponential. Thus, combined to the present fourth-order discretisation, the PDD algorithm has no noticeable effect on the kinetic energy conservation and divergence for partition sizes beyond around 15–20 nodes. This validity range seems to be in accordance with an implementation of this algorithm on parallel architectures. 3.3. Strong scaling As detailed in Section 2, the proposed numerical method relies on the evaluation of the explicit terms, and the solutions of the Helmholtz/Poisson problems. Due to the compact scheme discretisation, the computation of the explicit terms involves solutions of multiple tridiagonal linear systems which are considered with the PDD algorithm detailed in Section 2.4. The Helmholtz/Poisson problems are solved with the diagonalisation method. The parallel version is based on a 2d-pencil decomposition and is depicted in Section 2.5. Here, the performance of the parallelelisation is investigated by means of a strong scaling analysis. The solver is configured to simulate the lid-driven cavity flow as detailed in Section 3.5. This configuration is considered because the three directions are non-periodic which corresponds to the most expensive computations. The mesh size ranges from 1283 up to 10243 , and the number of cpu cores reaches 4096 cores. The scaling analysis is performed with the Bull supercomputer ‘occigen’ hosted at the CINES. The cluster is composed of 50, 544 cores spread on 2016 nodes. Each node includes two processors Intel 12-Cores E5-2690 with a

11

Figure . Strong scaling of the different parts of the Navier–Stokes solver.

clock frequency of 2.6 GHz. The communications rely on an infiniBand (IB 4xFDR) fat-tree interconnect. The Intel compiler (16.0) and the Math Kernel Library (MKL 16.0) are chosen for the compilation work-flow. The Message Passing Interface library is that proposed by Intel company, intelmpi 5.0. The profiling of the code concerns the time measurements of the explicit term evaluations and Helmholtz/Poisson problems solutions, during one timestep advancement. It should be noted that one Runge– Kutta step consists of three sub-steps, each one involving standard one-step method similar to an Adams– Bashforth/Crank–Nicolson time step. Figure 6 provides the strong scalability of the present DNS code. The colour denotes the mesh sizes and the filled square symbols refer to the runtime per time step. Also, empty circle and diamond symbols are associated to the evaluation of the temporal explicit terms (PDD) and the solutions of the Helmholtz/Poisson problems (2d-pencil), respectively. First, the scalability of the present algorithm for each mesh size is assessed. For the low end of the processor core numbers, the scalability is good, but to the high end, a loss of scalability is observed. Thus, the strong scalability of the PDD algorithm and the 2d-pencil decomposition are separately assessed. This can be seen in Figure 6 for which the computational time of the evaluation of the explicit terms (PDD algorithm) and the implicit terms (2d-pencil) is plotted. It can be observed that the 2d-pencil decomposition does not provide a good scaling at the high end of the computational cores. This probably originates from the large amount of communication involved in the global data transpose. The PDD algorithm shows a very good scalability up to 4096 cores. Thus, for the largest mesh, the computational time

12

S. ABIDE ET AL.

with 576 cores is 15.5 s and only 2.1 s with 4096 cores. In this case, the relative efficiency is around 100%. This result has to be associated to the communication map which involves only neighbours processes. It should be noted that these results have been obtained without a deep code optimisation, which leads to think us that the scalability could be further improved.

3.4. Turbulent channel flow In this section, the proposed algorithm is used to simulate a turbulent flow to outline the influence of the PDD algorithm on the results. Thus, several simulations of a plane channel flow have been carried out. The Reynolds number based on the friction velocity uτ and the channel half-width h is Reτ = uτ h/ν = 180, ν being the cinematic viscosity of the fluid. This corresponds to a Reynolds number based on the mean velocity of about Reb = um 2h/ν 5600. Here, the purpose is to check that the PDD algorithm does not degrade the data deduced from turbulent flow simulations, such as turbulence statistics or spectrum. The recent article of Vreman and Kuerten (2014) gives a comprehensive review of several databases of turbulent channel flows at this Reynolds number. Their geometrical configuration is retained to ease comparisons with their results. The physical domain is a rectangular box of size 4π × 2h × 4/3π in the streamwise, wall-normal, and spanwise directions, respectively. The mesh is uniform for the periodic streamwise and spanwise directions, and refined at the vicinity of the walls. To this end, a hyperbolic tangent function (Abide and Viazzo 2005) is used. Regardless of the domain partition, the simulations have been performed with three different mesh sizes. Table 2 presents the computational details of the simulations. The first node in the wall-normal direction is located at y+ = 0.12, while the largest space steps correspond to the homogenous directions and range between 11 and 23 wall units. A statistically steady turbulence state is established over a time interval of length 20h/uτ , which has been reduced by a specific flow initialisation. A laminar flow is perturbed with artificial streaks located at the vicinity of the wall (Schoppa and Hussain 2000). All the simulations are performed with a constant CFL = 1.0, and the statistics are gathered over the time interval 20h/uτ Tstat 70h/uτ . The computational grid is Table . Computational details of the turbulent flow simulations. nx × ny × nz Mesh (a)  ×  ×  Mesh (b)  ×  ×  Mesh (c)  ×  × 

Prow × Pcol

x+ × z+

+ + ymin − ymax

× × ×

. × . . × . . × .

.–. .–. .–.

partitioned grid 5 × 4. This partition applied to the coarser mesh (a) leads to a 12 nodes per subdomain, while for the finer mesh (c) this partition leads to 24 nodes per subdomain. The coarser mesh (a) does not fulfil the PDD accuracy requirement previously demonstrated. This under-resolved permits to evaluate the reliability of the simulation in such a configuration. Regardless of the mesh size (a), (b) or (c) depicted in Table 2 , the mean velocity follows the linear velocity distribution u+ = y+ in the viscous sublayer Figure 7(a). On the other hand, the mean velocity profile follows the log-low u+ = 2.5log y+ + 5.5. The solutions computed show a grid dependency which demonstrates that the mean velocity converges to the benchmark of Vreman and Kuerten (2014). The Reynolds stress tensor entries are plotted in Figure 7(b). The grid dependency of the Reynolds stress to the mesh size is noted and the findings converge to the results of Vreman and Kuerten (2014). Whatever the mesh, the grid partition has no noticeable impact on the computed second-order statistics. It can be concluded that the low level of error introduced by the PDD algorithm has no noticeable effect on the computed statistics, at least with a reasonable coarse partition. Now, we focus on the spectral analysis to assess the numerical method. The streamwise spectra are shown in Figure 8, for u, v, w, and p. These spectra are computed at y+ 30 in order to make comparisons with those computed by Vreman and Kuerten (2014) by finite differences and spectral methods. The symbols Euu , Evv and Eww represent the standard velocity spectra. For each velocity spectra and mesh size, the present simulations are able to reproduce data of Vreman and Kuerten (2014) at the low wavenumber. Some discrepancies are noticeable at the high wavenumber. This feature is associated to the grid cut-off and, despite this, the low frequencies still remain well described. This result does not show erroneous behaviour of the spectra despite the truncation introduced by the PDD. These remarks still remain true with the finest partition of 12 nodes per subdomain. 3.5. Large-eddy simulation of a lid-driven cavity Several works have been devoted to the turbulent liddriven cavity flow. The interesting features of this flow are the simplicity of the boundary conditions and the three non-periodic directions. This test case is investigated here to show the ability of the present method to tackle this kind of flow. The flow is considered in a cubical cavity and is driven by a moving wall. The Reynolds number is based on the lid velocity u0 and the length L of the cubical domain. Experiments of Prasad and Koseff (1989) indicate us that beyond a Reynolds number Re = 104 , the flow


13

Figure . Turbulent channel flow: (a) mean velocity profile and (b) velocity fluctuations and Reynolds shear stress profiles computed with different grid sizes.

is turbulent, and is dominated by large-scale vortices. This Reynolds number is retained, because results provided by numerical simulations (Ben-Cheikh et al. 2012; Bouffanais et al. 2006) or experiments of Prasad and Koseff (1989) are available. The WALE large eddy simulation model (6) is considered for the present simulations. Ben-Cheikh et al. (2012) have shown that the results strongly depend on the value of the constant Cw , which leads them to propose an ‘optimal’ model constant

to this configuration. More particularly, at Reynolds number Re = 104 , they find an optimal value for the opt WALE model which is Cw = 0.13. To ease comparisons, the same model constant is used. A numerical simulation is performed on a mesh of 64 × 64 × 64, refined near walls by means of hyperbolic mesh transform. The computational domain is partitioned into a 2d MPI processor grid of size 5 × 4. The time step is dynamically updated in order to ensure a CFL condition number equals to 1. After

Figure . Turbulence spectra: (a) streamwise, (b) transversal, (c) spanwise velocity.

14

S. ABIDE ET AL.

Figure . Mean velocity profiles (a) and iso-surface of the Q criterion iso-surfaces coloured by x-vorticity magnitude (b).

a dimensionless time of integration of about 500, the statistically steady state is reached. The statistical moments are collected over the time interval 500 t 1000. The mean velocity field Uand W and the Reynolds √ stress components Urms = 10 u u , Wrms = 10 w w and UWrs = 500 u w are computed in the symmetry plane y = 0.5, and considered in the vertical and horizontal lines. Figure 9 shows the dimensionless average velocity components U and W at the centrelines of the symmetry plane. As seen here, the computed solutions are in good agreement with the experimental results of Prasad and Koseff (1989). Despite that the same model than

Ben-Cheikh et al. (2012) have been considered, some discrepancies are noticeable and in particular for the maximum value of the boundary layers. Figure 10 displays the turbulence intensities Urms , Wrms and the Reynolds stress component UW. The computed turbulence statistics obtained from the present method are in agreement with the experimental data of Prasad and Koseff (1989) and the recent results of BenCheikh et al. (2012). The present results outline the reliability of the compact scheme discretisation combined with the PDD algorithm to simulate turbulent flows.

Figure . Comparisons of rms u (a) and w (b) velocity and Reynolds shear stress (c) profiles between present and LES or experimental results (Ben-Cheikh et al. ; Prasad and Koseff ).


4. Conclusion In this paper, a strategy to compute solutions of the incompressible Navier–Stokes equations discretised with compact schemes has been developed. This strategy is based on two main features which allow us the computations of fully non-homogenous turbulent flows on distributed computer architectures. The approximate tridiagonal solver, PDD algorithm, permits to compute the compact scheme approximations of derivatives and interpolations with a perfect parallel scaling. The implicit diffusive terms and the pressure correction equation are solved by means of the full three-dimensional diagonalisation method. A parallel version based on a 2d data decomposition and the global data transposes has been successfully implemented. This strategy benefits from a good scalability up to 4096 MPI tasks, while conserving the main features of high-order DNS code, such as divergence-free velocity field and kinetic energy conservation. Several numerical experiments have been also performed to outline the accuracy and the reliability of the present method. The spatial and temporal accuracy has been checked against steady and unsteady analytical solutions. The influence of the approximate tridiagonal solver PDD on the divergence and the kinetic energy has been assessed. Simulations of a turbulent channel flow have been performed, showing the capability of the code to compute accurately the flow spectrum and statistics regardless of the domain decomposition. The turbulent flow in a lid-driven cavity flow have been also simulated using WALE LES model. An excellent agreement with experimental and numerical data has been found, showing the ability of the proposed method to deal with three non-periodic domains. This method could have some interesting perspectives on massively parallel machine, particularly with hybrid OpenMP/MPI programing model. However, the large algorithmic cost involved in the solution of the Helmholtz/Poisson still remains a topical issue.

Acknowledgements Simulations were performed using allocation on the OCCIGEN system, based in France at CINES.

Disclosure statement No potential conflict of interest was reported by the authors.

References Abide, Stéphane, Xavier Chesneau, and Belkacem Zeghmati. 2012. “A Fourth-Order Iterative Solver for the Singular

15

Poisson Equation.” In Numerical Analysis and Its Applications, edited by Ivan, Dimov, István Faragó and Lubin Vulkov, 143–150. Berlin Heidelberg: Springer. Abide, Stéphane, and Stéphane Viazzo. 2005. “A 2D Compact Fourth-Order Projection Decomposition Method.” Journal of Computational Physics 206(1): 252–276. Auzinger, Winfried, and Hans J. Stetter. 1982. “Defect Corrections and Multigrid Iterations.” In Multigrid Methods, 327– 351. Berlin: Springer. Ben-Cheikh, Nader, Faycel Hammami, Antonio Campo, and Brahim Ben-Beya. 2012. “A Dynamic Sub-Grid Scale Model for Large Eddy Simulation of Turbulent Flows in a Lid-Driven Cubical Cavity.” Comptes Rendus Mécanique 340(10): 721–730. Bermejo-Moreno, I., J. Bodart, and J. Larsson. 2013. “Scaling Compressible Flow Solvers on the IBM Blue Gene/Q Platform on up to 1.97 Million Cores.” Center for Turbulence Research, Annual Research Briefs, 343–358. Bouffanais, Roland, Michel O. Deville, Paul F. Fischer, Emmanuel Leriche, and Daniel Weill. 2006. “Large-Eddy Simulation of the Lid-Driven Cubic Cavity Flow by the Spectral Element Method.” Journal of Scientific Computing 27(1): 151–162. Brüger, Arnim, Bertil Gustafsson, Per Lötstedt, and Jonas Nilsson. 2005. “High Order Accurate Solution of the Incompressible Navier–Stokes Equations.” Journal of Computational Physics 203(1): 49–71. Canuto, Claudio, M. Yousuff Hussaini, Alfio Quarteroni, and Thomas A. Zang. 1988. Spectral Methods in Fluid Dynamics. Tech. Report. Berlin Heidelberg: Springer. Choi, J., J. Demmel, Inderjit S. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. 1996. “ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers, Design Issues and Performance.” Computer Physics Communications 97: 1–15. Duponcheel, M., P. Orlandi, and G. Winckelmans. 2008. “TimeReversibility of the Euler Equations as a Benchmark for Energy Conserving Schemes.” Journal of Computational Physics 227(19): 8736–8752. Fox, Douglas G., and Steven A. Orszag. 1973. “Pseudospectral Approximation to Two-Dimensional Turbulence.” Journal of Computational Physics 11(4): 612–619. Gaitonde, Datta V., and Miguel R. Visbal. 1999. “Further Development of a Navier–Stokes Solution Procedure Based on Higher-Order Formulas.” AIAA Paper 99-0557, January 1999. http://dx.doi.org/10.2514/6.1999-557 Kim, Jae Wook. 2013. “Quasi-Disjoint Pentadiagonal Matrix Systems for the Parallelization of Compact FiniteDifference Schemes and Filters.” Journal of Computational Physics 241: 168–194. Kim, Jae Wook, and Richard D. Sandberg. 2012. “Efficient Parallel Computing with a Compact Finite Difference Scheme.” Computers and Fluids 58: 70–87. Knikker, Ronnie. 2009. “Study of a Staggered Fourth-Order Compact Scheme for Unsteady Incompressible Viscous Flows.” International Journal for Numerical Methods in Fluids 59(10): 1063–1092. Ladeinde, F., X. Cai, M.R. Visbal, and D. Gaitonde. 2003. “Parallel Implementation of Curvilinear High-order Formulas.” International Journal of Computational Fluid Dynamics 17: 467–485. Laizet, Sylvain, and Eric Lamballais. 2009. “High-Order Compact Schemes for Incompressible Flows: A Simple and

16

S. ABIDE ET AL.

Efficient Method with Quasi-Spectral Accuracy.” Journal of Computational Physics 228(16): 5989–6015. Laizet, Sylvain, and Ning Li. 2011. “Incompact3d: A Powerful Tool to Tackle Turbulence Problems with up to O (105) Computational Cores.” International Journal for Numerical Methods in Fluids 67(11): 1735–1757. Le, Hung, and Parviz Moin. 1991. “An Improvement of Fractional Step Methods for the Incompressible Navier-Stokes Equations.” Journal of Computational Physics 92(2): 369– 379. Lee, Myoungkyu, and Robert D. Moser. 2015. “Direct Numerical Simulation of Turbulent Channel Flow up to Reτ = 5200.” Journal of Fluid Mechanics 774: 395–415. Lele, Sanjiva K. 1992. “Compact Finite Difference Schemes with Spectral-Like Resolution.” Journal of Computational Physics 103(1): 16–42. Li, Ning, and Sylvain Laizet. 2010. “2DECOMP&FFT–A Highly Scalable 2D Decomposition Library and FFT Interface.” In Cray User Group 2010 Conference, Edinburgh, 2010 May 24–27. Moin, Parviz, and John Kim. 1982. “Numerical Investigation of Turbulent Channel Flow.” Journal of Fluid Mechanics 118: 341–377. Nicoud, Franck, and Frédéric Ducros. 1999. “Subgrid-Scale Stress Modelling Based on the Square of the Velocity Gradient Tensor.” Flow, Turbulence and Combustion 62(3): 183– 200. Oguic, Romain, Stéphane Viazzo, and Sébastien Poncet. 2015. “A Parallelized Multidomain Compact Solver for

Incompressible Turbulent Flows in Cylindrical Geometries.” Journal of Computational Physics 300: 710–731. Polizzi, Eric, and Ahmed Sameh. 2007. “SPIKE: A Parallel Environment for Solving Banded Linear Systems.” Computers & Fluids 36(1): 113–120. Povitsky, Alex, and Philip J. Morris. 2000. “A Higher-Order Compact Method in Space and Time Based on Parallel Implementation of the Thomas Algorithm.” Journal of Computational Physics 203: 182–203. Prasad, Ajay K., and Jeffrey R. Koseff. 1989. “Reynolds Number and End-Wall Effects on a Lid-Driven Cavity Flow.” Physics of Fluids 1: 208. Schiestel, R., and S. Viazzo. 1995. “A Hermitian-Fourier Numerical Method for Solving the Incompressible Navier-Stokes Equations.” Computers & Fluids 24(6): 739–752. Schoppa, Wade, and Fazle Hussain. 2000. “Coherent Structure Dynamics in Near-Wall Turbulence.” Fluid Dynamics Research 26(2): 119–139. Sengupta, T.K., A. Dipankar, and A. Kameswara Rao. 2007. “A New Compact Scheme for Parallel Computing Using Domain Decomposition.” Journal of Computational Physics 220: 654–677. Sun, Xian -He. 1995. “Application and Accuracy of the Parallel Diagonal Dominant Algorithm.” Parallel Computing 21(8): 1241–1267. Vreman, A.W., and J.G.M. Kuerten. 2014. “Comparison of Direct Numerical Simulation Databases of Turbulent Channel Flow at Reτ = 180.” Physics of Fluids 26(1): 015102.