P-CSI v1.0, an accelerated barotropic solver for the high-resolution ocean model component in the Community Earth System Model v2.0

In the Community Earth System Model (CESM), the ocean model is computationally expensive for high-resolution grids and is often the least scalable component for high-resolution production experiments. The major bottleneck is that the barotropic solver scales poorly at high core counts. We design a new barotropic solver to accelerate the high-resolution ocean simulation. The novel solver adopts a Chebyshev-type iterative method to reduce the global communication cost in 5 conjunction with an effective block preconditioner to further reduce the iterations. The algorithm and its computational complexity are theoretically analyzed and compared with other existing methods. We confirm the significant reduction of the global communication time with a competitive convergence rate using a series of idealized tests. Experimental results obtained with the CESM 0.1◦ global ocean model show that the proposed approach results in a factor of 1.7 speed-up over the original 10 method with no loss of accuracy, achieving 10.5 simulated years per wall-clock day on 16,875 cores.

that increasing the ocean models' resolution to the eddy resolving level helps capture the positive correlation between sea surface temperature and surface wind stress and improve the asymmetry of the ENSO cycle in the simulation.
In the High-Resolution Model Intercomparison Project (HighResMIP) for the Coupled Model In-25 tercomparison Project phase 6 (CMIP6), global model resolutions of 25 km or finer at mid-latitudes are proposed to implement the Tier-1 and Tier-2 experiments (Eyring et al., 2015). Because all CMIP6 climate models are required to run for hundreds of years, tremendous computing resources are needed for high-resolution production simulations. To run high-resolution climate models practically, additional algorithm optimization is required to efficiently utilize the large-scale computing  (Smith et al., 2010). The baroclinic mode describes the three-dimensional dynamic and thermodynamic processes, and the barotropic mode solves the vertically integrated momentum and continuity equations in two dimensions.
The barotropic solver is the major bottleneck in the POP within the high-resolution CESM because it dominates the total execution time on a large number of cores (Jones et al., 2005). This results from 40 the implicit calculation of the free-surface height in the barotropic solver, which scales poorly at the high core counts due to an evident global communication bottleneck inherent with the algorithm. The implicit solver allows a large time step to efficiently compute the fast gravity wave mode but requires a large elliptic system of equations to be solved. The conjugate gradient method (CG) and its variants are popular choices in the implicit free-surface ocean solvers, such as MITgcm (Adcroft et al., 2014), 45 FVCOM (Lai et al., 2010), MOM3 (Pacanowsky and Griffies, 1999), and OPA (Madec et al., 1997).
However, the standard CG method has heavy global communication overhead in the existing POP implementation (Worley et al., 2011). The latest Chronopoulos-Gear (ChronGear) (D'Azevedo et al., 1999) variant of the CG algorithm is currently used in the POP to reduce the number of global reductions. A nice overview of reducing global communication costs for CG method can be found 50 in the work of Ghysels and Vanroose (2014). Recent efforts to improve the performance of CG method include a variant that overlaps the global reduction with the matrix-vector computation via a pipelined approach (Ghysels and Vanroose, 2014). However, the improvement is still limited when using a very large number of cores because of the remaining global reduction operations. and Analysis (SC2015) to emphasize the theoretical analysis of the computational complexity, the 95 convergence of P-CSI and the high-resolution POP results.
The remainder of this paper is organized as follows. Section 2 reviews the existing barotropic solver in the POP. Sections 3 details the design of the P-CSI solver, followed by an analysis of the computational complexity and convergence rate of P-CSI in Section 4. Section 5 further compares the high-resolution performance of the existing solvers and the P-CSI solvers. Finally, conclusions 100 are given in Section 6.

Barotropic solver background
We briefly describe the governing equations to formally derive the new P-CSI solver in the POP. The primitive momentum and continuity equations are expressed as: where L(α) = ∂ ∂x (uα) + ∂ ∂y (vα) + ∂ ∂z (wα), which is equivalent to the divergence operator when α = 1; x, y, and z are the horizontal and vertical coordinates; u = [u, v] T is the horizontal velocity; w is the vertical velocity; f is the Coriolis parameter; p and ρ 0 represent the pressure and the water density, respectively; and F H and F V are the horizontal and vertical dissipative terms, respectively 110 (Smith et al., 2010). In particular, we emphasize the two-dimensional barotropic mode in the timesplitting scheme, where the P-CSI is implemented .

Barotropic mode
The governing equations for the barotropic mode can be obtained by vertically integrating Eq. (1) and Eq. (2) from the ocean bottom topography to the sea surface: is the vertically integrated barotropic velocity, g is the acceleration due to gravity, η is the sea surface height (defined as p s /ρ 0 g, where p s is the surface where τ is the time step associated with the time advance scheme. By replacing the barotropic velocity in Eq. (6) with the barotropic velocity at the next time step in Eq. (5), an elliptic system of sea surface height η is obtained For simplicity, we can rewrite the elliptic Eq. (7) as where ψ represents a function of the current state of η.
Spatially, the POP utilizes the Arakawa B-grid on the horizontal grid (Smith et al., 2010) with the 135 following nine-point stencils to discretize Eq. (8) as follows (see Fig. 1): where δ ξ (ξ ∈ {x, y}) are finite differences and ∆ ξ (ξ ∈ {x, y}) are the associated grid lengths. The finite difference δ ξ (ψ) and average ψ ξ notations are defined, respectively, as follows: Because the POP uses general orthogonal girds, the coefficient matrix varies in space. To demonstrate the properties of the sparse matrix used in the POP, we can simplify Eq. (9) using a special case with uniform grids and constant ocean depth H as follows: where S i,j = ∆x∆y, A χ i,j (χ ∈ Q = {O, N W, N E, SW, SE, W, E, N, S}) are coefficients between grid point (i, j) and its neighbors using the nine-point stencil discretization (9), as determined by ∆x, ∆y, τ and H: α = ∆y ∆x , β = 1/α, The full discretization of Eq. (8) for any given grid point (i, j) can then be written as where φ = Si,j gτ 2 H is a factor of the time step. Therefore, the elliptic Eq. (7) leads to a linear system of η, i.e., Ax = b, where A is a block 155 tridiagonal matrix composed of coefficients A χ i,j (χ ∈ Q). The simplified equation set of (13) and (14) show that A is mainly determined by the horizontal grid sizes, ocean depth and time step. These impacts will be further discussed in Section 4.1. Note that Eq. (14) also indicates that the sparse pattern of A comes directly from the nine nonzero elements in each row (Fig. 2).

Barotropic solvers 160
The barotropic solver in the original POP uses the PCG method with a diagonal preconditioner M = Λ(A) because of its efficiency in small-scale parallelism (Dukowicz and Smith, 1994) (see Appendix A1 for the details). To mitigate the global communication bottleneck, ChronGear, a variant of the CG method proposed by D' Azevedo et al. (1999), was later introduced as the default solver in the POP. It combines the two separated global communications of a single scalar into a single 165 global communication (see Appendix A2). By this strategic rearrangement, the ChronGear method achieves a one-third latency reduction in the POP. However, the scaling bottleneck still exists in the high-resolution POP using this solver, particularly with a large number of cores (Fig. 3).
To accurately profile the parallel cost of the barotropic solvers, we clearly separate the timing for computation, boundary communication and global reduction. Operations such as scaler computa-170 tions and vector scalings are categorized as pure computations, which are relatively cheap due to the independent operations on each process. The extra boundary communication is required for each process to update the boundary values from its neighbors ( Fig. 1) after the matrix-vector multiplication. This boundary communication usually costs more than the computation when a large number of cores is used (due to a decreasing problem size per core). The global reduction, which is needed 175 by the inner products of vectors, is even more costly (Hu et al., 2013). Worley et al. (2011) and Dennis et al. (2012) specifically indicated that the global reduction in the POP's barotropic solver is the main scaling bottleneck for the high-resolution ocean simulation. Geosci. Model Dev. Discuss., doi:10.5194/gmd-2016-135, 2016 Manuscript under review for journal Geosci. Model Dev. Published: 1 July 2016 c Author(s) 2016. CC-BY 3.0 License.

Design of the P-CSI solver
The CG-type solver converges rapidly in the sequential computation (Golub and Van Loan, 2012).
However, the bottleneck of global communication embedded in ChronGear still limits the large-scale parallel performance. Here, we design a new solver with the goal of reducing global communication so that the speed-up can be as close to unity as possible when a significant number of cores is used.

Classical Stiefel Iteration method
The CSI is a special type of Chebyshev iterative method (Stiefel, 1958). Saad et al. (1985) proposed a generalization of CSI on linearly connected processors and claimed that this approach outperforms the CG method when the eigenvalues are known. This method was revisited by Gutknecht and Röllin (2002) and shown to be ideal for massively parallel computers. In the procedure of precon-195 ditioned CSI (P-CSI, details are provided in Appendix A3), the iteration parameters, which control the searching directions in the iteration step, are derived from a stretched Chebyshev function of two extreme eigenvalues (Stiefel, 1958). We demonstrate in Section 4.2 that the stretched Chebyshev function in P-CSI provides a series of preset parameters for iteration directions. As a result, P-CSI requires no inner product operation, thus potentially avoiding the bottleneck of global reduc-200 tion (see the workflow of ChronGear and P-CSI in Fig. 4). This makes the P-CSI more scalable than ChronGear on massively parallel architectures. However, it requires a priori knowledge about the spectrum of coefficient matrix A (Gutknecht and Röllin, 2002). It is well known that obtaining the eigenvalues of a linear system of equations is equivalent to solving it. Fortunately, the coefficient matrix A and its preconditioned form in the POP are both positive definite real symmetric matri-205 ces. Approximate estimation of the largest and smallest eigenvalues, µ and ν, respectively, of the preconditioned coefficient matrix is sufficient to ensure the convergence of P-CSI.
To efficiently estimate the extreme eigenvalues of the preconditioned matrix M −1 A (where M is the preconditioner), we adopt the Lanczos method (Paige, 1980) (see the algorithm in Appendix B).
Initial tests indicate that only a small number of Lanczos steps is necessary to reasonably estimate the 210 extreme eigenvalues of M −1 A that results in the near-optimal P-CSI convergence (Hu et al., 2015). Therefore, the computational overhead of the eigenvalue estimation is very small in our algorithm.

A block EVP preconditioner
Block preconditioning is quite promising in POP because the parallel domain-decomposition is ideal for the block structure. A block preconditioning based on the EVP method is proposed and detailed 215 in Hu et al. (2015) to improve the parallel performance of the barotropic solver in the POP. To the best of our knowledge, the EVP and its variants are among the least costly algorithms for solving elliptic equations in serial computation (Roache, 1995), which have also been used in several different Ocean models (Dietrich et al., 1987;Sheng et al., 1998;Young et al., 2012). The parallel EVP solver was 7 Geosci. Model Dev. Discuss., doi:10.5194/gmd-2016Discuss., doi:10.5194/gmd- -135, 2016 Manuscript under review for journal Geosci. Model Dev. Published: 1 July 2016 c Author(s) 2016. CC-BY 3.0 License. also implemented by Tseng and Chien (2011). The standard EVP is actually a direct solver, which 220 requires two solution steps: preprocessing and solving. In the preprocessing stage, the influence coefficient matrix and its inverse are computed, involving a computational complexity of C pre = (2n − 5) * 9n 2 + (2n − 5) 3 = O(26n 3 ), which is intensive but computed only once at the beginning.
The solving stage is computed at every time step and requires only C evp = 2 * 9n 2 + (2n − 5) 2 = O(22n 2 ) (Hu et al., 2015), which is a much lower computational cost than other direct solvers such 225 as LU.
The EVP method is efficient for solving elliptic equations. However, a major drawback of the standard EVP is that, without applying additional modifications, it cannot be used for a large domain due to its global error propagation, which will cause arithmetic overflow in the marching process (Roache, 1995). The fact that the EVP is not well-suited for large domains is not an issue for large-230 scale parallel computing, where a larger number of processors typically results in smaller domains.
Thus, the serial disadvantage becomes an advantage in parallel computing, making the EVP ideal for parallel block preconditioning on a large number of cores. Although the EVP preconditioning may increase the required computation for each iteration, the barotropic solver can greatly benefit from the resulting reduction in iterations, particularly at very large numbers of cores when communication 235 costs dominate the total costs (Hu et al., 2015). We will further illustrate this advantage in Section 5.2.3.

Algorithm analysis and comparison
The extreme eigenvalues of the coefficient matrix are critical to determine the convergence of the iterative solvers (such as P-CSI, PCG and ChronGear). Here, the characteristics of P-CSI are inves-240 tigated in terms of the associated eigenvalues and their connection with the convergence rate. The computational complexity is also addressed.

Spectrum and condition number
Because the coefficient matrix A in POP is symmetric and positive-definite (Smith et al., 2010), its eigenvalues are positive real numbers (Stewart, 1976). We assume that the spectrum (Golub and N is the size of A) are the eigenvalues of A. The condition number, defined as κ = λ max /λ min , is determined by the spectrum radius. Using the Gershgorin circle theorem (Bell, 1965), we know that for any λ ∈ S, there exists a pair of (i, j) satisfying where φ = S gτ 2 H is defined in Section 2.1. With the definition of the coefficients in (13), we obtain To quantitatively evaluate the impacts of the condition number, we set up a series of idealized test cases to solve Eq. (8) in which the coefficient matrices are derived from Eq. (13) and (14) on an idealized cylinder with an earth-size perimeter, which is 2πR (radius R is 6372 km), and a height of 255 πR. A uniform grid with a size of N ×M is used, where the grid size along the perimeter and height are ∆x = 2πR/N and ∆y = πR/M , respectively. The depth H is 4km.
The inequalities (16) suggest that the lower bound of eigenvalues is mostly determined by φ.
This indicates that for a given ocean configuration and grid size, the lower bound of the eigenvalues will decrease with increasing time step, resulting in a larger condition number. Figure 5 shows the Consistent with the theoretical bounds of the extreme eigenvalues in Eq. (16), the condition number in Fig. 5 is smallest when the grid aspect ratio is close to unity (i.e., the decomposition of 64 × 32). When the aspect ratio of the horizontal grid cell is close to unity, the upper (lower) 270 bound of the largest (smallest) eigenvalue decreases (increases), leading to a reduced spectrum radius ([λ min , λ max ]). This implies that the condition number is also reduced at the same time. When the aspect ratio equals to unity (i.e., α = ∆y ∆x = 1), we obtain λ max ≤ 4 + φ and λ min ≥ φ. Figure 6 shows the condition number versus the aspect ratio with fixed grid size N = 2048. Three different time step sizes are tested: 1.0×10 5 s, 5.0×10 5 s and 10.0×10 5 s. Under this configuration, it is clear 275 to see that the condition number reaches its minimum when the aspect ratio is unity, regardless of the time step size.
When the time step is sufficiently large, the foregoing analysis indicates that the spectrum radius is confined in (φ, 4 + φ) if the aspect ratio is 1 regardless of grid sizes. However, the condition number may vary greatly because when the grid size N increases, the largest eigenvalue remains 280 close to 4, whereas the smallest eigenvalue becomes closer to φ. The previous discussion implies that the condition number is significantly affected when the aspect ratio is far from unity. To focus on the impact of the number of grid points, we choose a constant aspect ratio. Because different time step sizes may play an important role when the grid size increases, we assumed that the time step must satisfy the Courant-Friedrichs-Lewy (CFL) condition (Courant et al., 1967), that is, where v is the supported barotropic velocity. Three different configurations of the time step based on v = 2m/s, v = 20m/s and v = 200m/s are chosen. Figure 7 shows that the condition number increases monotonically with increasing grid size. It also shows that the time step (specified by a different propagating speed) has a large impact on the condition number.

Convergence rate 290
The convergence rate of any elliptic solver relies heavily on the condition number of the preconditioned coefficient matrix A . Both PCG and ChronGear have the same theoretical convergence rate because of the same numerical algorithm but different implementations (D'Azevedo et al., 1999).
Their relative residual in the k-th iteration has an upper bound as follows (Liesen and Tichý, 2004): 295 where x k is the solution vector after the k-th iteration, x * is the solution of the linear equation (i.e., x * = A −1 b), λ represents an eigenvalue of A , and P k is the vector space of polynomials with real coefficients and degree less than or equal to k. Applying the Chebyshev polynomials of the first kind to estimate this min-max approximation, we obtain is the condition number of the matrix A with respect to the l 2 -norm.
Equation (18) indicates that the theoretical bound of the convergence rate of PCG decreases with increasing condition number. PCG converges faster for a well-conditioned matrix (e.g., a matrix with a small condition number) than an ill-conditioned matrix.
We now show that the P-CSI has the same order of convergence rate as PCG and ChronGear 305 with the additional advantage of fewer global reductions in parallel computing. With the estimated smallest and largest extreme eigenvalues of coefficient matrix ν and µ, the residual for the P-CSI algorithm satisfies where P k (ζ) = τk(β−αζ) τk(β) for ζ ∈ [ν, µ] (Stiefel, 1958). τ k (ξ) is a Chebyshev polynomial expressed When ξ ∈ [−1, 1], the Chebyshev polynomial has an equivalent form which clearly shows that |τ k (ξ)| ≤ 1 when |ξ| ≤ 1. P k (ζ) is the polynomial satisfying that Assume that A = Q T ΛQ, where Λ is a diagonal matrix having the eigenvalues of A on the diagonal, and Q is a real orthogonal matrix with the columns that are eigenvectors of A . We then
The foregoing analysis applies to cases in which a nontrivial preconditioning is used. Assume that the preconditioned coefficient matrix A = M −1 A. It is worth mentioning that the preconditioned matrix in the PCG, ChronGear and P-CSI algorithms is actually M −1/2 A(M −1/2 ) T , which is symmetric and has the same set of eigenvalues as M −1 A (Shewchuk, 1994). Thus, the condition number 330 of the preconditioned matrix is κ = κ 2 (M −1/2 A(M −1/2 ) T ), which is usually smaller than the condition number of A. The closer M is to A, the smaller the condition number of M −1 A is. When M is the same as A, then κ 2 (M −1 A) = 1.
Because the convergence rate of P-CSI is on the same order as PCG and ChronGear, the performance between P-CSI and the CG-type solvers should be comparable when a small number of cores 335 is used. When a large number of cores is used for the high-resolution ocean model, P-CSI should be significantly faster than PCG or ChronGear per iteration due to the bottleneck in the CG-type method. This is shown in the following analysis of computational complexity.

Computational complexity
To analyze the computational complexity of P-CSI and compare it with ChronGear, we assume 340 that p is the number of processes and N is the number of grid points following the same definition as in Hu et al. (2015). Both ChronGear and P-CSI solver time can then be divided into three of the masking operation decreases with increasing processes p, whereas the cost of MPI_Allreduce monotonically increases, so the global reduction complexity satisfies T g = O(2 N p + log p ). The execution time of one diagonal preconditioned ChronGear solver step can then be expressed as:

355
where K cg is the number of iterations, which does not change with the number of processes (Hu et al., 2015). The complexity of P-CSI with a diagonal preconditioner is where K pcsi is the number of iterations.
Equation (25) indicates that the computation and boundary update time decreases with increasing 360 number of processes. However, the time required for the global reduction increases with increasing numbers of processes. Therefore, we can expect the execution time of the ChronGear solver to increase when the number of processors exceeds a certain threshold. Our analysis shows that P-CSI has a lower computational complexity than ChronGear due to the lack of a log p term associated with global communications. 365 We further consider the computational complexity of preconditioning. The EVP preconditioning has O(22 N p ). Thus, with the EVP preconditioning, the computational complexity of ChronGear and P-CSI becomes O(39 N p ) and O(33 N p ), respectively. As a result, the total complexities of ChronGear and P-CSI with EVP preconditioning are Although the computation time in each iteration doubles with the EVP preconditioning, the total time may still decrease if the number of iterations is reduced. Indeed, with EVP preconditioning, the iteration number K pcsi−evp decreases by almost one-half (see Fig. 11). As a result, the total number

Numerical experiments
To evaluate the new P-CSI solver, we first demonstrate its characteristics and compare it with PCG (and thus ChronGear) using an idealized test case. The actual performance of P-CSI in CESM POP 380 is then evaluated and compared with the existing solvers using the 0.1 • high-resolution simulation.

Condition number and convergence rate
To confirm the theoretical analysis of the convergence in Section 4.2, we created a series of matrices with the idealized setting illustrated in Section 4.1. Instead of a cylindrical grid, we choose a spherical grid with two polar continents (ocean latitude varies from 80 • S to 80 • N). A uniform 385 latitude-longitude grid is used, in which the grid size along the longitude varies with latitude coordinate θ, that is, ∆x = πR cos θ. The time step size is set to τ = ∆x v , where v = 2m/s is the barotropic velocity of the ocean water as used in Section 4.1. These cases differ in the number of grid points, so the condition numbers vary. We compare the results using PCG and P-CSI solvers with no preconditioning, diagonal preconditioning and EVP preconditioning, respectively. Here, the block size 390 in EVP preconditioning is set to be 5 × 5, and the convergence tolerance is tol = 10 −6 . We note that the theoretical convergence rates of ChronGear and PCG are identical, so the results here can apply to the ChronGear at the same time.
As shown in Fig. 8, as the problem size increases, the coefficient matrix becomes more poorly conditioned. All solvers must iterate more to obtain the same level of relative residual. For both 395 PCG and P-CSI, the convergence rate varies with different preconditioners. Given the same problem size, the solvers without preconditioning have the most iterations, and those using the EVP preconditioning require the fewest iterations. This confirms that, with the EVP preconditioning, the matrix becomes better conditioned than the one without preconditioning or with diagonal preconditioning.
It also shows that with the same preconditioning, P-CSI has a slower convergence rate than PCG. It 400 is worth mentioning that the diagonal preconditioner improves the convergence only slightly in our idealized cases because the grid is uniform and the ocean depth is constant in this configuration.

Experiment platform and configuration
We evaluate the performance of P-CSI in CESM1.2.0 on the Yellowstone supercomputer, located at 405 NCAR-Wyoming Supercomputing Center (NWSC) (Loft et al., 2015). Yellowstone uses Intel Xeon E5-2670 (Sandy Bridge: 16 cores @ 2.6 GHz, hyperthreading enabled, 20 MB shared L3 cache) and provides a total of 72,576 cores connected by a 13.6 GBps InfiniBand network. More than 50% of Yellowstone's cycles are consumed by CESM. Therefore, the ability to accelerate the parallel performance on Yellowstone is critical to support the CESM production simulations. To emphasize the advantage of P-CSI, we use the finest 0.1 • grid and 60 vertical levels POP with the "G_NORMAL_YEAR" configuration, which uses active ocean and sea ice components (i.e., the atmosphere and land components are replaced by pre-determined forcing data sets). The I/O optimization is another important issue for the high-resolution POP (Huang et al., 2014) but is not addressed here.

415
The choice of ocean block size and layout has a large impact on performance for the highresolution POP because it directly affects the distribution of the workload among processors. To remove the influence of different block distribution on our results, we carefully specify block decompositions for each core with the same ratio. The time step is set to the default of 172.8 seconds.
For a fair comparison among solvers, the convergence is checked every 10 iterations for all tests.

420
The impacts of CSI and the EVP preconditioner are evaluated separately using several different numerical experiments.

Performance of CSI
The first experiment compares the parallel performance among the three solvers using the same diagonal preconditioners: PCG, ChronGear and P-CSI. Figure 9 compares the convergence rate (rel-425 ative residual versus the number of iterations) among them. Although the order of computation in ChronGear differs slightly with that in PCG, their convergence rates are almost identical as expected. P-CSI converges slightly slower than PCG and ChronGear at the beginning and the final iteration steps, which is related to the unstable distribution of the coefficient matrix's eigenvalues. However, all of these solvers have very similar order of slopes, thus supporting the same upper bound of con-430 vergence rate discussed in Section 4.2.
Through a number of experiments, we set the Lanczos convergence tolerance to 0.15 to obtain the balance between fast convergence rate and reasonable relative residual at the same time. Generally, we can estimate the largest and smallest eigenvalues in no more than 50 Lanczos steps. This causes P-CSI to result in near-optimal convergence. 435 Figure 10 further evaluates the timing for the different phases in the solver. It is clear that P-CSI outperforms ChronGear primarily because it only requires a few global reductions in convergence check. No obvious difference can be found for the boundary updates and the computation phases when using large core counts. The reduction in global communications will also significantly reduce the sensitivity of the ocean model component to operating system noise (Ferreira et al., 2008) by 440 increasing the time interval between global synchronizations.

Performance of EVP preconditioner
The second experiment evaluates the performance of the EVP preconditioner used in the P-CSI solver by comparing the CSI solvers with no preconditioner, the diagonal preconditioner and the EVP preconditioner. Figure 11 shows that the preconditioners can effectively reduce the number

Parallel performance
The last experiment compares the simulated speeds of P-CSI and ChronGear on a variety of computing cores, ranging from 470 to 16,875 cores. When the timing refers to the barotropic mode calculation only, we find that the performance of the ChronGear solver begins to degrade after approximately 2700 cores, but the execution time for P-CSI is relatively flat beyond that core count 460 regardless of preconditioner (Fig. 13). Using the EVP preconditioner, P-CSI can accelerate the barotropic calculation from 19.0 s to 4.4 s per simulation day on 16,875 cores. Dennis et al. (2012) indicated that 5 simulated years per wall-clock day is the minimum requirement to run long-term climate simulations. For the completed POP simulation, Fig. 14 indicates that the simulated timing of P-CSI achieves 10.5 simulated years per wall-clock day on 16,875 cores, whereas the timing 465 of ChronGear with a diagonal preconditioner achieves only 6.2 simulated years per wall-clock day using the same number of cores. In Section 2, we demonstrated that the percentage of the POP execution time required by the barotropic solver increases with increasing number of cores using the original ChronGear solver. In particular, ChronGear with diagonal preconditioning accounts for approximately 50% of the total execution time on 16,875 cores (see Fig. 3). In contrast, Fig. 15 shows 470 that by using the scalable P-CSI solver, the barotropic calculation time constitutes only approximately 16% of the total execution time on 16,875 cores. Finally, note that verification results for 1 • POP by an ensemble-based statistical method in Hu et al. (2015) indicate that our new solver does not introduce statistically significant error into the simulation results.

Code availability
The present P-CSI solver v1.0 is available on https://zenodo.org/record/56705 and https://github.com/ 485 hxmhuang/PCSI. This solver is also included in the upcoming CESM public release v2.0. For the older CESM versions 1.2.x, the user should follow these steps indicated in the Readme.md file: (1) Create a complete case or an ocean component case.
(2) Copy our files into the corresponding case path and build this case.
(3) Add two lines at the end of user_nl_pop2 file to use our new solver.

490
(4) Execute the preview_namelists file to activate the solver.
The user are welcome to see the website mentioned above for more details and use the configuration files to repeat our experiments.

A1 PCG algorithm
The procedure of PCG is shown as follows (Smith et al., 2010): End Do Operations such as β k /β k−1 in line (3) are scaler computations, whereas α k s k in line (6) are vector scalings. As k in line (4) is a matrix-vector multiplication. Inner products of vectors are r T k−1 r k−1 in line (2) and s T k s k in line (5).

Appendix B: Eigenvalue Estimation
The procedure of the Lanczos method to estimate the extreme eigenvalues of the matrix M −1 A is shown as follows: In step (9), T is a tridiagonal matrix which contains α j (j = 1, 2, ..., m) as the diagonal entries and β j (j = 1, 2, ..., m − 1) as the off-diagonal entries.
Published: 1 July 2016 c Author(s) 2016. CC-BY 3.0 License. δ 2 (m) will gradually converge to zero. Thus, the eigenvalue estimation of M −1 A is transformed to solve the eigenvalues of T m .
Step (8) in eigenvalue estimation employs the Gershgorin circle theorem to estimate the largest eigenvalue of T m , that is, µ = max 1≤i≤m m j=1 |T ij | = max 1≤i≤m (α i +β i + β i−1 ). The efficient QR algorithm (Ortega and Kaiser, 1963) with a complexity of Θ(m) is used to 580 estimate the smallest eigenvalue ν in step (9).   as that in Fig. 9 in Hu et al. (2015)