Metos 3 D : the Marine Ecosystem Toolkit for Optimization and Simulation in 3-D – Part 1 : Simulation Package v 0 . 3 . 2

We designed and implemented a modular software framework for the offline simulation of steady cycles of 3-D marine ecosystem models based on the transport matrix approach. It is intended for parameter optimization and model assessment experiments. We defined a software interface for the coupling of a general class of water column-based biogeochemical models, with six models being part of the package. The framework offers both spin-up/fixed-point iteration and a Jacobian-free Newton method for the computation of steady states. The simulation package has been tested with all six models. The Newton method converged for four models when using standard settings, and for two more complex models after alteration of a solver parameter or the initial guess. Both methods delivered the same steady states (within a reasonable precision) on convergence for all models employed, with the Newton iteration generally operating 6 times faster. The effects on performance of both the biogeochemical and the Newton solver parameters were investigated for one model. A profiling analysis was performed for all models used in this work, demonstrating that the number of tracers had a dominant impact on overall performance. We also implemented a geometry-adapted load balancing procedure which showed close to optimal scalability up to a high number of parallel processors.


Introduction
In the field of climate research, simulations of marine ecosystem models are used to investigate the carbon uptake and storage of earth's oceans.The aim is to identify those pro-cesses that play a role in the global carbon cycle.For this purpose coupled simulations of ocean circulation and marine biogeochemistry are required.In this context, marine ecosystems are treated as extensions of biogeochemical systems (cf.Fasham, 2003;Sarmiento and Gruber, 2006).Both terms are therefore used synonymously in this paper.The equations and variables of ocean dynamics are well understood.However, descriptions of biogeochemical or ecological sinks and sources still contain uncertainties with regard to the number of components and to parameterization (cf.Kriest et al., 2010).
To improve this situation a wide range of marine ecosystem models need to be validated, i.e., assessed as to their ability to reproduce real-world data.This involves a thorough discussion of simulation results and, before this, an estimation of optimal model parameters for preferably standardized data sets (cf.Fennel et al., 2001;Schartau and Oschlies, 2003).
As a rule hundreds of model evaluations are required for optimization.Therefore any optimization environment for marine ecosystems, which our software framework is intended to supply (as suggested by its name), first and foremost has to provide a fast and flexible simulation framework.In this paper we will concentrate on this prerequisite and present the simulation package of Metos3D.An optimization package will be released subsequently.
For any fully coupled simulation, i.e., simultaneous and interdependent computations of ocean circulation, tracer transport and the biogeochemical sources and sinks in three spatial dimensions, very high computational efforts are needed even at low resolution.Computational complexity increases still more if annual cycles are investigated, since each model Published by Copernicus Publications on behalf of the European Geosciences Union.evaluation then involves long-time integration (the so-called spin-up) until an equilibrium state is reached under given forcing (cf.Bernsen et al., 2008).
Several strategies have been developed to accelerate computation of periodic steady states in biogeochemical models driven by a 3-D ocean circulation (cf.Bryan, 1984;Danabasoglu et al., 1996;Wang, 2001).We have combined three of them in our software, namely so-called offline simulation, optional use of Newton's method for computing steady annual cycles (as an alternative to spin-ups) and spatial parallelization with high scalability.
Offline simulation affords fundamentally reduced computational costs combined with an acceptable loss of accuracy.The principle is to pre-compute transport data for passive tracers.This approach was adopted by Khatiwala et al. (2005) when introducing the so-called transport matrix method (TMM).The authors used matrices to store the results of a general circulation model, which were then applied to biogeochemical tracer variables.This method proved to be sufficiently accurate to gain first insights into the behavior of biogeochemical models at a global basin scale (cf.Khatiwala, 2007).The software implementation used therein we denote as the TMM framework from now on.It is available at Khatiwala (2013).
From the mathematical point of view a steady annual cycle is a periodic solution of a system of (in this case) nonlinear parabolic partial differential equations.This periodic solution is a fixed point in the mapping that integrates the model variables over 1 year of model time.Seen in this light, a spin-up is a fixed-point iteration.Using an uncomplicated procedure this fixed-point problem can be transformed equivalently into the problem of finding the root(s) of a nonlinear mapping.
Newton-type methods (cf.Dennis and Schnabel, 1996, chap. 6) are well known for their superlinear convergence when applied to problems of this kind.When combined with a Krylov subspace approach, a Jacobian-free scheme can be realized that is based on evaluations of just 1 model year (cf.Knoll and Keyes, 2004;Merlis and Khatiwala, 2008;Bernsen et al., 2008).
Whether fixed-point or Newton iteration is used, highperformance computing will be needed for running multiple simulations over 1 year of model time of a 3-D marine ecosystem.Parallel software employing transport matrices and targeting a multi-core distributed-memory architecture requires appropriate data types and linear algebra operations.The specific geometry of oceans with their varying numbers of vertical layers poses an additional challenge for standard load-balancing algorithms -but also offers a chance of developing adapted versions that will improve overall simulation performance.Except for these adaptations, our implementation is based on the freely available Portable, Extensible Toolkit for Scientific Computation library (PETSc; Balay et al., 1997Balay et al., , 2012b)), which in turn is based on the Mes-sage Passing Interface standard (MPI; Walker and Dongarra, 1996).
The objective of this work is to combine three performance-enhancing techniques (offline computation via transport matrices, Newton method, and highly scalable parallelization) in order to produce a software environment that offers rigorous modularity and complete open-source accessibility.Modularity entails separating data pre-processing and simulation as well as the possibility of implementing any water column-based biogeochemical model with minimal effort.For this purpose we have defined a model interface that permits the use of any number of tracers, parameters, and boundary and domain data.To demonstrate its flexibility we employed an existing biogeochemical model (Dutkiewicz et al., 2005), part of the MITgcm ocean model, as well as a suite of more complex models, which is included in our software package.Our software offers optional use of spin-up/fixed-point iteration or the Newton method; for the latter some tuning options were studied.As a result the work of Khatiwala (2008) could be extended by numerically showing convergence for all six models mentioned above without applying preconditioning.Moreover, a detailed profiling analysis of the simulation when using different biogeochemical models demonstrated how the number of tracers impacts overall performance.Finally an adapted load balancing method is presented.It shows scalability that is close to optimal and in this respect is superior to other approaches, including the TMM framework (Khatiwala, 2013).
This paper is structured as follows: in Sects. 2 and 3, model equations are described, and the transport matrix approach is recapitulated.In Sect. 4 both options for computing steady cycles/periodic solutions (fixed-point and Newton iteration) are summarized, and for the latter some tuning options to achieve better convergence are discussed.In Sects.5 and 6, design and implementation of our software package are described, while Sect.7 offers a number of numerical results to demonstrate its applicability and performance.Section 8 presents our conclusions, and Sect.9 explains how to obtain the source code.
The Appendix contains all model equations as well as the parameter settings used for this work; these are available at the same location as the simulation software.

Model equations for marine ecosystems
We will consider the following tracer transport model, which is defined by a system of semilinear parabolic partial differential equations (PDEs) of the form on a time interval I := [0, T ] and a spatial domain ⊂ R 3 , where boundary = ∂ .y i : I × → R denotes a single tracer concentration, and y = (y i ) n y i=1 is the vector of all tracers.
Since we are interested in long-time behavior and steady annual cycles, we will assume that the time variable is scaled in years.
For brevity's sake we have omitted the dependency on time and space coordinates (t, x) in our notation.
The transport of tracers in marine waters is determined by diffusion and advection, which are reflected in the first two linear terms on the right-hand side of Eq. ( 1).
Diffusion mixing coefficient κ : I × → R and advection velocity field v : I × → R 3 may either be regarded as given data, or else have to be simulated by an ocean model along with Eq. ( 1).Molecular diffusion of tracers is regarded as negligible compared to turbulent mixing diffusion.Thus κ and both transport terms are the same for all y i .
The functions represented by q i will often be nonlinear and depend on several tracers, thereby coupling the system.We will refer to the set of functions q = (q i ) n y i=1 as "the biogeochemical model".Typically this model will also depend on parameters.In the software presented in this paper these parameters are assumed to be constant w.r.t.space and time; i.e., we have u = u ∈ R n u .For the general setting of Eq. (1) this assumption is not necessary.Boundary forcing (e.g., insolation or wind speed, defined on the ocean surface as s ⊂ ) and domain forcing functions (e.g., salinity or temperature of the ocean water) may also enter into the biogeochemical model.These are denoted by b = (b i ) For tracer transport models, Neumann conditions for the tracers y i on the boundary are appropriate.They may be either homogeneous (when no tracer fluxes on the boundary are present) or inhomogeneous (to account for flux interactions with the atmosphere or sediment, e.g., deposition of nutrients and riverine discharges).In the inhomogeneous case, the necessary data have to be provided as boundary data in b.In Khatiwala (2007, Sect. 3.5) it is shown how the case of tracers with prescribed surface boundary conditions (i.e., Dirichlet conditions) can be treated using the TMM.Then, an appropriate change of the transport matrices is necessary and an additional boundary vector has to be added in every time step.

Offline simulation using transport matrices
The transport matrix method (Khatiwala et al., 2005) allows fast simulation of tracer transport, assuming that forcing data diffusivity κ and advection velocity v are given.This method is based on a discretized counterpart of Eq. (1).We introduce the following notation: let the domain be discretized by a grid (x k ) n x k=1 ⊂ R 3 and 1 year in time by 0 = t 0 < . . .< t j < t j + t j =: t j +1 < . . .< t n t = 1.This means that there are n t time steps per year.For time instant t j , y j i = (y i (t j , x k )) n x k=1 denotes the vector of the values of the ith tracer at all grid points, and y j = (y j i ) n y i=1 ∈ R n y n x denotes a vector of the values of all tracers at all grid points, appropriately concatenated.
We use analogous notations b j , d j , and q j for boundary and domain data and for the biogeochemical terms at the j th time step.Only corresponding grid points are incorporated for boundary data.
The linear operators L exp,j , L imp,j represent those parts of the transport term in Eq. ( 1) that are discretized explicitly or implicitly w.r.t.time.These operators therefore depend on the given transport data κ, v and thus on time.The biogeochemical term is treated explicitly in Eq. ( 2) using an Euler step.
Since transport affects each tracer individually and is identical for all of them, both L exp,j , L imp,j are block-diagonal matrices with n y identical blocks A exp,j , A imp,j ∈ R n x ×n x , respectively.Khatiwala et al. (2005) describe how these matrices can be computed by running one step of an ocean model employing an appropriately chosen set of basis functions for tracer distribution.The operator splitting scheme used in this ocean model therefore determines the partitioning of the transport operator in Eq. (1) into an explicit and an implicit matrix.Diffusion (or some part of it) is usually discretized implicitly; in our case this applies only to vertical diffusion.By this procedure we obtain a set of matrix pairs A exp,j , A imp,j n t −1 j =0 , which usually are sparse.To reduce storing efforts and increase feasibility, only a small number of averaged matrices are stored; in our case monthly averages were used.Starting from these matrices, for any time instant t j an approximation of the matrix pair is computed by linear interpolation.
Thus integration of tracers over 1 model year only involves sparse matrix-vector multiplications and evaluations of the biogeochemical model.In fact the implicit part of time integration is now pre-computed and contained in A impl,j , which is the benefit of this method.The approximation error of this method when compared to direct coupled computation is determined by the interpolation of transport matrices, the linearization of possibly nonlinear discretization schemes (e.g., flux limiters), and by discounting the reverse influence of ocean biogeochemistry on circulation fields.
www.geosci-model-dev.net/9/3729/2016/Geosci.Model Dev., 9, 3729-3750, 2016 The purpose of the software presented in this paper is to allow fast computation of steady annual cycles for the marine ecosystem model under consideration.A steady annual cycle is defined as a periodic solution of Eq. ( 1) with a period length of 1 (year), thus satisfying Obviously, the forcing data functions b, d need to be periodic as well.
To apply the transport matrix method, we assume that a set of matrices for 1 model year (generated using this kind of periodic forcing) is available, and that these have been interpolated to obtain pairs (A exp,j , A imp,j ) for all time steps j = 0, . .., n t − 1.In the discrete setting, a periodic solution will satisfy y n t +j = y j j = 0, . .., n t − 1.
Assuming that the discrete model is completely deterministic, it is sufficient if this equation is satisfied for just one j .In this section we will compare the solutions for the first time instants of 2 succeeding model years.Defining y := y ( −1)n t ∈ R n y n x , = 1, 2, . . .as the vector of tracer values at the first time instant of model year , a steady annual cycle satisfies where φ := ϕ n t −1 • . . .• ϕ 0 is the mapping that performs the tracer integration Eq. ( 2) over 1 year.All arguments except for y have been omitted in the notation.A steady annual cycle therefore is a fixed point of the nonlinear mapping φ.
Since condition (3) will never be satisfied exactly in a simulation, we measure periodicity, using norms on R n y n x for the residual of Eq. (3).We use the weighted Euclidean norm with z ∈ R n y n x indexed as z = (z ik ) n x k=1 n y i=1 .This corresponds to our indexing of tracers; see Sect. 3. If w k = 1 for all k, we obtain the Euclidean norm denoted by z 2 .A stronger correspondence to the continuous problem Eq. ( 1) is achieved by using the discretized counterpart of the L 2 ( ) n y norm, where w k is set to the volume V k of the kth grid box.We denote this norm by z 2,V .Other settings of weights are possible.All these norms are equivalent in the mathematical sense; i.e., it holds min for all z ∈ R n y n x and all weight vectors w = (w k ) n x k=1 satisfying the positivity condition in Eq. (4).Spin-up signifies repeated application of iteration step (3), in other words, integration in time with fixed forcing until convergence is reached.Based on Banach's fixed-point theorem (cf.Stoer and Bulirsch, 2002) it is well known that, assuming φ is a contractive mapping satisfying with L < 1 in some norm, this iteration will converge to a unique fixed point for all initial values y 0 .This result holds for weaker assumptions as well (cf.Ciric, 1974).This method is quite robust, but shows only linear convergence, which is especially slow for L ≈ 1.An estimation of L = max y φ (y) is difficult, since it involves the Jacobian q j (y j ) of the nonlinear biogeochemical model at the current iteration.Typically, thousands of iteration steps (i.e., model years) are needed in order to reach a steady cycle (cf.Bernsen et al., 2008).Moreover, this method offers only restricted options for convergence tuning, the only straightforward one being to choose different time steps t j .For this, all transport matrices have to be re-scaled accordingly.The obvious stopping criterion is reduction of the difference between two succeeding iterates measured by in some -optionally weighted -norm.

Computation by the inexact Newton method
By defining F (y) := y − φ(y), the fixed-point problem Eq. ( 3) can be equivalently transformed into the problem of finding a root of F : R n y n x → R n y n x .This problem can be solved by Newton's method (cf.Dennis and Schnabel, 1996;Kelley, 2003;Bernsen et al., 2008).We apply a damped (or globalized) version that incorporates a line search (or backtracking) procedure which (under certain assumptions) provides superlinear and locally even quadratic convergence.
Starting from an initial guess y 0 , in each step the linear system has to be solved, followed by an update y m+1 = y m + s m .> 0 here denotes the step size, which is chosen iteratively in such a way that a sufficient reduction in F (y m + ρs m ) 2 is achieved (cf.Dennis and Schnabel, 1996, Sect. 6.3).Note that regarding the Newton solver, the Euclidean norm is used.This is determined by the PETSc implementation.
The Jacobian F (y m ) of F at any current iteration step contains the derivative of 1 model year; thus, it is not as sparse as the transport matrices themselves.Therefore a matrix-free version of Newton's method is applied: the linear system ( 5) is solved by an iterative, so-called Krylov subspace method, which only requires the evaluation of matrixvector products F (y m )s.Since F (y m ) cannot be expected to be symmetric or definite, we use the generalized minimal residual method (GMRES, Saad and Schultz, 1986).The matrix-vector products needed for this can be interpreted as directional derivatives of F at point y m in the direction of s.They may be approximated by a forward finite difference: The finite difference step-size δ is chosen automatically as a function of y m and s (cf.Balay et al., 2012a).An alternative method would be an exact evaluation of the derivative using the forward mode of algorithmic differentiation (cf.Griewank and Walther, 2008).This approximation of the Jacobian or directional derivative is one reason to call this method inexact.The second reason is the fact that the inner linear solver has to be stopped and therefore also is not exact.We use a convergence control procedure based on the technique described by Eisenstat and Walker (1996) for this purpose.Stopping occurs when the Newton residual at the current inner iterate s satisfies The factor η m is determined by This approach avoids so-called over-solving, i.e., wasting inner steps if the current outer Newton residual F (y m ) is still relatively big.The latter typically occurs at the beginning of Newton iterations.Parameters γ and α can be used to avoid over-solving by adjusting inner accuracy depending on outer accuracy in a linear or nonlinear way, respectively.Moreover, both parameters provide a subtle way to tune the solver.
In contrast to a fixed-point iteration, Newton's method even in its damped version may possibly converge only with an appropriately chosen initial guess y 0 .In a high-dimensional problem such as ours (in R n y n x ), it is a non-trivial task to find such an initial guess if the standard one used for the spinup (i.e., a constant tracer distribution) proves unsuccessful.
In cases where the Newton iteration proceeds slowly and the criterion described above yields only a few inner iterations, it may be advisable to increase their number by either decreasing γ or increasing α.Below we will give some examples of how convergence may be made possible using this strategy.
In order to estimate the total computational effort needed for the inexact Newton solver and to compare its efficiency with the spin-up method, it must be noted that one evaluation of F basically corresponds to one application of φ, i.e., to 1 model year.Each Newton step requires one evaluation of F as the right-hand side of Eq. ( 5).The initial guess for the inner linear solver iteration is always set at s = 0. Thus no computation is required for the first step.For each following inner iteration, some evaluation of F is required to compute the second term in the numerator of the right-hand side of Eq. ( 6).The line search may also require additional evaluations of F .Taken together, the overall number of inner iterations plus the overall number of evaluations for the line search determine the number of evaluations of F necessary for this method, which may then be compared to the number of model years needed for the spin-up.

Software description
Our software is divided into four repositories, namely metos3d, model, data and simpack.The first comprises the installation scripts, the second the biogeochemical model source codes and the third all data preparation scripts as well as the data themselves.The last repository contains the simulation package, i.e., the transport driver, which is implemented in C and based upon the PETSc library.While we have often used 1-indexed arrays within this text for convenience, within the source code C arrays are 0-indexed and Fortran arrays are 1-indexed.All data files are in PETSc format.

Implementation structure
The implementation of the simulation package is structured in layers as is shown in Fig. 1.The layers are organized hierarchically; i.e., each layer provides routines for the layers above.The foundation of the implementation is the PETSc library with its data types and the implementation of the Newton-Krylov solver.
The bgc model layer initializes tracer vectors, parameters and boundary and domain data.It is responsible for the interpolation of forcing data and the evaluation of the biogeochemical model (cf.Sect.5.3).The transport layer is responsible for reading in the transport matrices, interpolating them to the current time step and applying them to the tracer vectors.The main integration routine φ (cf.Algorithms 1 and 2) is located at the time stepping layer.On top resides the solver layer, which contains the spin-up implementation and the call to the Newton-Krylov solver.Algorithm 2: PhiStep (ϕ j ) Input : point in time: tj, time step: ∆t, implicit matrices: Aimp, explicit matrices: Aexp, current state: yin, parameters: u ∈ R m , boundary data: b, domain data: d Output: next state: yout A call graph for the computation of a steady annual cycle is shown in Fig. 2. Note that loops are not explicitly shown therein.Calls to initialization and finalization routines are gathered at the beginning and end of a simulation run.The former are responsible for memory allocation and storage of data used at run time.The latter are employed to free memory and delete all vectors and matrices.
The dimensions of the used vectors and matrices depend on the underlying geometry (cf.Sect.5.2).The distribution of the work load for a parallel run is determined during initialization of the work load (cf.Sect.5.5).

Geometry information and data alignment
Geometry information is provided as a 2-D land-sea mask plus a designation of the number of vertical layers, i.e., the depth of the different water columns (or profiles; cf.Fig. 3).This can be understood as a sparse representation of a landsea cuboid including only wet grid boxes.Hence, the length n x of a single tracer vector (at fixed time) is the sum of the lengths of all profiles, i.e., where n p is the total number of profiles in the ocean and k=1 the set of profile lengths.Each profile corresponds to a horizontal grid point.Due to the locally varying ocean depth, the profile lengths depend on the horizontal coordinate, i.e., on the index k.
We denote by y i,k ∈ R n x,k the values of the ith tracer corresponding to the kth profile at a fixed time step.Then the vector of all tracers at a fixed time, here denoted by y omitting the time index, can be represented in two ways: either by  first collecting all profiles for each tracer and then concatenating all tracers, namely, or vice versa, i.e., In order to multiply matrices by tracer vectors, the first variant is preferable.In order to evaluate a water-column-based biogeochemical model, the second one is appropriate.As a result, all tracers need to be copied from representation Eqs. ( 9) to (10) after a transport step.After evaluation of the biogeochemical model we reverse the alignment for the next transport step.
The situation is similar for domain data.Again, we group all domain data profiles by their profile index k, i.e., where d i,k denotes a single domain data profile.However, no reverse copying is required here.Boundary data have to be treated in a slightly different way.Here we align boundary values, which are associated with the surface of one water column each, where b i,k denotes a single boundary data value as opposed to a whole profile.As with domain data, no reverse copying is required.

Biogeochemical model interface
One of our main objectives in this work is to specify a general coupling interface between the transport induced by ocean circulation and the biogeochemical tracer model.We wish to provide a method to couple any biogeochemical model implementation using any number of tracers, parameters and boundary and domain data to the software that computes the ocean transport.Despite the fact that we consider offline simulation using transport matrices in this paper only, the interface shall not be restricted to this case.This coupling shall furthermore fit into an optimization context, and it shall be compatible with algorithmic differentiation techniques (cf.Sect.7).
The only restriction we make for the tracer model is that it operates on each single water column (or profile) separately.This means that information on exactly one profile is exchanged via the coupling interface.For models that require information on other profiles (e.g., in the horizontal vicinity) for internal computations, a redefinition of the interface and some internal changes would be necessary.In fact, most of the relevant non-local biogeochemical processes take place within a water column (cf.Evans and Garçon, 1997).
The evaluation of a water-column-based biogeochemical model for any fixed time t consists of separate model evaluations for each profile (corresponding to a horizontal spatial coordinate), i.e., for profile index k: Here, (y i,k ) i=1 is an input array of n y tracer profiles according to Eq. ( 10), each with a length or depth of n x,k .The vector u contains n u parameters.Boundary data (b i,k ) n b i=1 are given as a vector of n b values, and domain data (d i,k ) n d i=1 as an input array of n d profiles.Results of the biogeochemical model are stored in the output array (q i,k ) n y i=1 , which also consists of n y profiles.
Formally speaking this tracer model is scaled from the outside by the (ocean circulation) time step.However, we have integrated t into the interface as a concession to the common practice of refining the time step within the tracer model implementation (cf.Kriest et al., 2010).As a consequence, the responsibility for scaling results before returning them to the transport driver software rests with the model implementer.
Listing 1 shows a realization of the biogeochemical model interface in a Fortran 95 subroutine called metos3dbgc.The arguments are grouped by data type.The list begins with variables of the type integer, i.e., n y , n x,k , n u , n b and n d .These are followed by real * 8 (double precision) arguments, i.e., t, q, t j , y, u, b and d.For clarity we have omitted the profile index k and the time index j in our notation.Moreover, we have used dt as a textual representation of t.
A model initialization and finalization interface is also specified.The former is named metos3dbgcinit and the latter metos3dbgcfinal.These routines are called at the beginning of each model year, i.e., at t 0 , and after the last step of the annual iteration, respectively.Both routines employ the same argument list as metos3dbgc.They are not shown here.The names of all three routines are arbitrary and can be altered using pre-processor variables that are defined within Makefile.subroutine metos3dbgc (ny, nx, nu, nb, nd, dt, q, t, y, u, b, d) integer :: ny, nx, nu, nb, nd real * 8 :: dt, q(nx, ny), t, y(nx, ny), u(nu), b(nb), d(nx, nd) end subroutine Listing 1. Fortran 95 implementation of the coupling interface for biogeochemical models.

Interpolation
Transport matrices as well as boundary and domain data vectors are provided as sets of files.The number of files in each set is arbitrary, although most of the data we use in this work represent a monthly mean.
However, time step counts per model year are generally much higher than the number of available data files.For this reason matrices and vectors are linearly interpolated to the current time step during iteration.All files of a specific data set are interpreted as averages of the time intervals they represent.We therefore interpolate between the centers of associated intervals.The appropriate weights and indices are computed on the fly using Algorithm 4.
With regard to boundary and domain forcing, we denote data files by ((b i,j ) Here, n b is the number of distinct boundary data sets, and n b,i is the number of data files provided for the ith set.In the same way, n d denotes the number of domain data sets and n d,i the number of data files of a particular set.
For every index i and its corresponding boundary data set (b i,j ) n b,i j =1 we compute the appropriate weights α, β as well as indices j α , j β and then form a linear combination The same applies to domain data; i.e., for every domain data set (d i,j ) n d,i j =1 , we compute We use PETSc routines VecCopy, VecScale and VecAXPY for this process.
With regard to transport we have (A imp,j ) n imp j =1 and (A exp,j ) n exp j =1 as data files, where n imp and n exp specify the number of implicit and explicit matrix files, respectively.Analogous to the interpolation of vectors, we first interpolate all user-provided matrices to the current point in time t j ; i.e., we assemble A = α A j α + β A j β using the appropriate α, β and j α , j β .We use the matrix variants MatCopy, MatScale and MatAXPY for this purpose.The technical details of this process have been discussed in depth in Siewertsen et al. (2013).
To avoid redundant storing we do not assemble both (block-diagonal) system matrices during simulation.We use the matrices provided to build just one block for each matrix type instead.The transport step is then applied as a loop over individual tracer vectors.
Unlike vector interpolation and vector operations in general, each matrix operation has a significant impact on computational time.In Sect.6.2 we will present results from profiling experiments showing detailed information on the time usage of each operation.

Load balancing for spatial parallelization
For spatial parallelization, the discrete tracer vectors have to be distributed to the available processes.Since biogeochemical models operate on whole water columns, profiles cannot be split without message passing.But due to the locally varying ocean depth, a tracer vector is a collection of profiles with different lengths.Thus a load balancing that takes into account only the number of profiles, but not their respective length, would be sub-optimal.
The PETSc library provides no load balancing algorithm suitable for this case.We therefore use an approach that was inspired by the idea of space filling curves presented by Zumbusch (1999).
For each profile we compute its "computational weight", i.e., its mid, in relation to the overall computational effort, i.e., the vector length.We then project this ratio to the available number of processes; i.e., we round this figure down to an integer and use the result as the index of the process the profile belongs to.By using this information the profiles can then be assigned consecutively to the processes involved.
For 0-indexed arrays this calculation is described by Algorithm 3. Its theoretical and actual performance is discussed in Sect.6.3, where a comparison between Metos3D and the TMM framework is shown.

Results
In this section we will present results from our numerical experiments to verify the software.For these experiments the interface described in this paper has been used to couple the transport matrix driver with a suite of biogeochemical models.We will also inspect the convergence behavior of both solvers included in the software.A profiling of the main parts of the algorithm will complement the verification.In a second step we have performed speed-up tests to analyze the load distribution implemented in our software and compared it with the TMM framework.We will also investigate the convergence control settings of the Newton-Krylov solver and examine the solver's behavior within parameter bounds.
The experimental setup is described in Appendix A in more detail.

Solver
We begin our verification by computing a steady annual cycle for every model, using both solvers.When using the spin-up we set no tolerance and let the solver iterate for 10 000 model years.The Newton approach is set to a line search variant and the Krylov subspace solver to GMRES.All other settings are left at default, so overall absolute tolerance is at 10 −8 and the maximum number of inner iterations is 10 000.
The parameter values used for the MITgcm-PO4-DOP model are listed in Table 2 under the heading u d .Table 3 lists the parameter values used for the N, N-DOP, NP-DOP, NPZ-DOP, NPZD-DOP model hierarchy.If not stated otherwise the initial value is set to 2.17 m mol P m −3 for N or PO 4 and 0.0001 m mol P m −3 for all other tracers.
A comparison of convergence towards a steady annual cycle for both solvers, applied to the MITgcm-PO4-DOP model, is shown in Fig. 4. We observe that both solvers reach the same difference between consecutive iterations at the end.Table 4 shows the differences between both solutions in Euclidean and volume-weighted norms; cf.Eq. ( 4). Figure 5 depicts the difference between both solutions for one tracer at the surface layer.Except for numerical error, both solvers obviously compute the same solution.
Figures 6 and 7 show the convergence behavior of both solvers for the N and N-DOP models, respectively.Again, both solvers end with approximately the same accuracy and produce similar results.This impression is confirmed by an inspection of Figs. 8 and 9 as well as Table 4.
However, in Fig. 10  veals a peak every 30 model years, which results from the settings of the inner solver, where GMRES is set to perform a restart every 30 years.This option is chosen to reduce the internal storage requirement, but may lead to stagnation for indefinite matrices; cf.Saad (2003, Sect. 6.5.6).It is likely that the Jacobian at some Newton step will become indefinite, and thus we assume that this is the case here.Figure 11 and Table 4 do not indicate any influence on the solution, however.
For the NPZ-DOP or NPZD-DOP model, the Newton solver shows a different behavior.For both models the solver does not converge if default settings are used, as depicted in Figs. 12 (top) and 13 (top).Reduction of the residual per step is quite low, which results in a huge number of iterations.In this case the solver was stopped after 50 iterations (the default setting), which is quite high for Newton's method.This behavior was caused by the fact that convergence of this method -even in its so-called globalized or damped version used here -at times still depends on the initial guess y 0 .We therefore used a different one, which was successful with the NPZD-DOP model; see Fig. 13 (middle).With the NPZ-DOP model, this procedure still did not work; see Fig. 12 (middle).
However, the result of a second and much easier way to achieve convergence can be seen in Figs. 12 (top) and 13 (top).If the last Newton iteration step did not lead to a big reduction of the residual, which was obviously the case here, the stopping criterion Eq. ( 8) for the inner iterations of the Newton solver becomes less restrictive.If this criterion is sharpened, the number of inner iterations increases and thus the accuracy of the Newton direction improves.This somewhat contradicts the idea formulated in Eisenstat and Walker (1996).Sharpening can easily be achieved by decreasing γ , in this case to γ = 0.3.This tuning led to convergence; see Figs. 12 (bottom) and 13 (bottom).When using these settings the same solutions are obtained as with the spin-up, if numerical errors are neglected (see Figs. 14 and 15).This result is confirmed by evaluating the differences in the norm; see Table 4.
It can be observed that as a rule the Newton-Krylov solver does not reach default tolerance within the last Newton step and iterates unnecessarily for 10 000 model years.From now on we will therefore limit the inner Krylov iterations to 200.
For our next investigations using the MITgcm-PO4-DOP model we will alter the convergence settings as well to get rid of the over-solving observed before.More detailed experiments on this subject are presented in Sect.6.4.

Profiling
In the next two sections we will investigate more closely some technical aspects of the implementation.We will first look at the distribution of computational time among the main operations of 1 model year.
λ DOP 0.5 0.5 0.5 0.5 yr −1 b 0.858 0.858 0.858 0.858 1 Table 4. Difference in the Euclidean ( • 2 ) and volume-weighted ( • 2,V ; cf.Eq. 4) norms between the spin-up (y S ) and the Newton (y N ) solution for all models.The total volume of the ocean used here is V ≈ 1.174 × 10 18 m 3 .Solutions for models NPZ-DOP and NPZD-DOP were produced by experiments with altered inner accuracy or initial value, respectively.For this purpose we perform a profiled sequential run for each model, iterating for 10 model years.An analysis of our profiling results is shown in Figs.16-18.When using the MITgcm-PO4-DOP model, for instance, the biogeochemical model takes up 40 % of computational time.Interpolation of matrices (MatCopy, MatScale and MatAXPY) amounts to approximately one-third.Matrix-vector multiplication (MatMult) takes up a quarter of all computations and all other operations amount to 0.5 %.

Model
Our data also suggest that the greater the number of tracers involved, the more dominant matrix-vector multiplication becomes.The MatMult operation takes up 19.8 % of computational time for the N model, but 56.7 % for the NPZD-DOP model.In Table 5 the absolute timings and the comput- 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 1 0 0 0 0 Model years Spin-up Newton-Krylov   Siewertsen et al. (cf. 2013) also made use of this profiling capacity when porting the software to an NVIDIA graphics processing unit (GPU).The authors investigated the impact of the accelerator's hardware on the simulation of biogeochemical models.Their work comprises a detailed discussion of peak performance and memory bandwidth and includes a counting of floating point operations.

Speed-up
In this section we will investigate in detail the performance of the load balancing algorithm and compare our results with the scalability provided by the TMM framework.We compile both drivers using the same biogeochemical model.We choose the MITgcm-PO4-DOP model using the same time step, initial condition as well as boundary and domain data.
Our tests are run on hardware located at the computing center of Kiel University: an Intel ® Sandy Bridge EP architecture with Intel Xeon ® E5-2670 CPUs that consist of 16 cores running at 2.6 GHz.We perform 10 tests for our implementation, using 1 to 256 cores.
Each test consists of a simulation run of 3 model years, where each year is timed separately.For the TMM frame- work we use 1 to 192 cores and run five tests on each core.We use the given output here, which shows the timing for one whole run.
To calculate speed-up and efficiency we use the minimum timings for a specific number of cores.All timings are related to the timing of a sequential run.For a set of computational times (t i ) N i=1 measured during our experiments, with N = 192 or N = 256, respectively, we calculate speed-up as s i = t 1 /t i and efficiency as e i = 100 × s i / i.
To investigate the load distribution implemented by us (cf.Sect.5.5), we compute the best ratio possible between a sequential and parallel run.Using Algorithm 3 we first compute the load distribution for all numbers of processes, i.e., i = 1, . .., 260, and then retrieve the maximum (local) length n i,max .To calculate speed-up we divide the vector length by this value, i.e., s i = n y /n i,max , and to calculate efficiency we again use e i = 100 × s i / i.
Figure 19 depicts ideal, theoretical and actual data for speed-up and efficiency.Here, the term "ideal" refers to a perfectly parallelizable program and a perfect hardware with no delay in memory access or communication.Regarding the load distribution implemented by us, a good (theoretical) performance can be observed over the whole range of processes.This refers again to a perfect hardware except that we distribute a collection of profiles of different lengths here.
The data also show that a parallel run of Metos3D on the Intel hardware achieves close to perfect performance when using between 100 and 140 cores.Efficiency is at about 95 % in this range and speed-up nearly corresponds to the number of processes.In fact speed-up may rise still further up to slightly over 160, but a minimum of 200 processes are required to achieve this.
In comparison, the scalability of the TMM framework is not optimal.Efficiency drops off immediately and speed-up never rises above 40.For 120 cores and above, Metos3D is at least 4 times faster.Interestingly, for low numbers of processes a significant drop in performance can be observed for both drivers.The implications of this are discussed briefly in Sect.7. We did not investigate this effect any further, however, since the results presented here already provide a good guideline.

Convergence control
After this basic verification and the review of some technical aspects of our implementation, we will now investigate those settings that control convergence of the Newton-Krylov solver.Once again we use only the MITgcm-PO4- DOP model.Our intention here is to eliminate the oversolving we observed during the first 200 iterations as shown in Fig. 4.This effect occurs if the accuracy of the inner solver is significantly higher than the resulting Newton residual (cf.Eisenstat and Walker, 1996).The relation between these two is controlled by the parameters γ and the α used in Eq. ( 8).
To investigate the influence of these parameters on convergence, we compute the reference solution described in Sect.6.1 using different values of γ and α.We set overall tolerance to the difference measured between consecutive states after 3000 model years of spin-up, i.e., approximately 9.0 × 10 −4 .γ is varied from 0.5 to 1.0 in steps of 0.1 and α from 1.1 to 1.6, also in steps of 0.1.This makes for a total of 36 model evaluations.
Figure 20 depicts the number of model years and Newton steps required as a function of γ and α.We observe that the overall number of years decreases as the two parameters tend towards 1.0 and 1.1, respectively.In contrast, the number of Newton steps increases; i.e., the Newton residual is computed more often and the inner steps become shorter.
Consequently, since the computation of one residual is negligible in comparison to the simulation of 1 model year, we focus on decreasing the overall number of model years.A detailed inspection of the results reveals that for γ = 1.0 and α = 1.2 the solver reaches the tolerance set above after approximately 450 model years, which is significantly less than the 600 years needed when using the default settings.
We therefore use these values for our next experiment.As before we set overall tolerance to a value comparable to 3000 spin-up iterations and let the Newton solver compute a solution for each parameter sample.

Parameter samples
Figure 21 shows histograms of the total number of model years or Newton steps required to solve the model equations.We observe that most computations converge after 400 to 550 model years and require 10 to 30 Newton steps.Interestingly, there is a high peak around 15 and a smaller one around 12 for the Newton method.We also find some outliers in both graphs.Nevertheless, all model evaluations we started converged towards a solution within the desired tolerance.

Conclusions
We designed and implemented a simulation framework for the computation of steady annual cycles for a generalized class of marine ecosystem models in 3-D, driven by transport matrices pre-computed in an offline mode.Our framework allows computation of steady cycles by spin-up or by a globalized Newton method.The software has been realized as open-source code throughout.
We also introduced a software interface for water-columnbased biogeochemical models.We demonstrated the applicability and flexibility of this interface by coupling the biogeo- chemical component used in the MITgcm general circulation model to the simulation framework.To test the general usability of the interface we then coupled our own implementations of five different biogeochemical models of varying complexity (already used in Kriest et al., 2010) to the framework.The source code of these models is also available as part of the software package, and may serve as a template for the implementation or adaption of other models.
We implemented a transient solver based on the transport matrix approach, where all matrix operations and evaluations of biogeochemical models are performed by spatial parallelization via MPI using the PETSc library.The transport matrices needed for this process are available directly and require no pre-processing.
We realized both a spin-up (or fixed-point iteration) and a globalized Newton solver for the computation of steady cycles.We compared the performance of both solvers and made the following observations: both delivered the same results (up to a reasonable precision) on convergence.The spinup converged when using standard sets of parameters, which were taken from Kriest et al. (2010), and equally distributed values for all tracers.The Newton solver did the same for the four models of lower complexity.It did not converge for the other two models when using standard parameter settings and an initial distribution of tracers as described above.For both of these more complex models convergence could be achieved by increasing the number of inner iterations in the Newton solver, which is realized by decreasing the parameter γ in Eq. ( 8).For one of these models convergence could also be achieved by choosing a different initial guess.With regard to performance, the Newton solver was about 6 times faster for all models.It can be concluded that for complex models the Newton method requires more attention to solver parameter settings, but then is superior to the spinup, at least when using parameter sets as described above.In a next step we investigated how performance of the Newton method is influenced by the two solver parameters α, γ in Eq. ( 8), using one model as an example.
Employing the optimal choice derived from these experiments (and one model parameter set), we then studied the number of Newton iterations and overall model years needed for 100 Latin hypercube model parameter samples.This is an important test for the usability of the Newton method in various kinds of optimization runs, for example if model parameters are varied by the optimizer.As it turned out, there was a certain variance in the number of steps needed and thus in the overall effort, but there were no extreme outliers.Our conclusion is that the Newton method is appropriate for op- timization, at least for this model, and faster than the usually robust spin-up.
We further analyzed which proportion of computational time is utilized by different parts of our software during simulation of 1 model year.Our experiments showed that with an increase in the number of tracers the matrix-vector operations started to dominate the process, thus offering the greatest potential for further performance tuning.This was the case even though the transport operator was the same for every tracer.In contrast, all biogeochemical interactions contained in the nonlinear coupling terms q j , which mostly are spatially local, become less performance-relevant as the number of tracers increases.
Finally, we implemented a load balancing mechanism that exploits the fact that water columns in the ocean vary in depth, resulting in data vectors of variable length.Using this balancing method, a close to optimal speed-up by spatial par-allelization was achieved up to the relatively high number of 140 processes.This results in an acceleration factor of 4 compared to the TMM framework.The factor increases even to 5 if 200 processes are used.However, here already 20 % of computational resources are wasted.
To summarize, the software framework presented here offers high flexibility w.r.t.models and steady cycle solvers.The implemented load balancing scheme results in significant improvement in parallel performance.Especially the applied Newton solver can be tuned to converge for all six biogeochemical models.

Code availability
Name of software: Metos3D (Simulation Package v0.3.2) Developer: Jaroslaw Piwonski Year first available: 2012 Software required: PETSc 3.3 Program language: C, C++, Fortran Size of installation: 1.6 GB Availability and costs: free software, GPLv3 Software homepage: https://metos3d.github.com/metos3d The toolkit is maintained using distributed revision control system git.All source codes are available at GitHub (https://github.com).The current versions of simpack and model are tagged as v0.3.2.The data repository is tagged as version v0.2.All experiments presented in this work were carried out using these versions.Associated material is stored in the 2016-GMD-Metos3D repository.
To install the software, users should visit the homepage and follow instructions.Future installations will reflect the state of development at that point of time, but users may still retrieve the versions used in this work by invoking git checkout v0.3.2 in the simpack and model repository as well as git checkout v0.2 in the data repositories.

Figure 219 .Figure 220 .
Figure 219.MITgcm-PO4-DOP model: Ideal and actual speed-up factors and efficiency of parallelized computations.The term 'theoretical' here refers to the use of load distribution as introduced in Section 6.3.

Figure 3 .
Figure 3. Land-sea mask (geometric data) of the used numerical model.Shown are the number of layers per grid point.Note that the Arctic has been filled in.

Figure 221 .
Figure221.Distribution of number of model years and Newton steps required for the computation of one annual cycle using 100 ra parameter samples (cf.Section 6.5).

Figure 4 .
Figure 4. MITgcm-PO4-DOP model: convergence towards an annual cycle.Spin-up: norm of the difference between initial states of consecutive model years (solid line).Newton-Krylov: residual norm at a Newton step (diamond) and norm of the GMRES residual during solving (solid line in-between).

Figure 5 .
Figure 5. MITgcm-PO4-DOP model.Difference between the phosphate concentration of the spin-up and the Newton solution at the first layer (0-50 m) in the Euclidean norm.Units are mmol P m −3 .

Table 3 .
Parameter values used for the solver experiments with the N, N-DOP, NP-DOP, NPZ-DOP and NPZD-DOP model hierarchy.Parameter N N-DOP NP-DOP NPZ-DOP NPZD-DOP Unit

Figure 6 .
Figure 6.N model: convergence towards an annual cycle using spin-up and the Newton-Krylov solver.

Figure 7 .
Figure 7. N-DOP model: convergence towards an annual cycle using a spin-up and a Newton-Krylov solver.

Figure 8 .
Figure 8. N model: difference between the phosphate concentration of the spin-up and the Newton solution at the first layer (0-50 m) in the Euclidean norm.Units are mmol P m −3 .

Figure 9 .Figure 10 .
Figure 9. N-DOP model: difference between the phosphate concentration of the spin-up and the Newton solution at the first layer (0-50 m) in the Euclidean norm.Units are mmol P m −3 .

Figure 11 .
Figure 11.NP-DOP model: difference between the phosphate concentration of the spin-up and the Newton solution at the first layer (0-50 m) in the Euclidean norm.Units are mmol P m −3 .

Figure 14 .Figure 15 .
Figure 14.NPZ-DOP model: difference between the phosphate concentration of the spin-up and the Newton solution at the first layer (0-50 m) in the Euclidean norm.Units are mmol P m −3 .

Figure 16 .Figure 17 .
Figure 16.Distribution of computational time among main operations during integration of 1 model year.Left: MITgcm-PO4-DOP model.Right: N model.

Figure 18 .
Figure 18.Distribution of computational time among main operations during integration of 1 model year.Left: NPZ-DOP model.Right: NPZD-DOP model.

Figure 19 .
Figure 19.MITgcm-PO4-DOP model: ideal and actual speed-up factors and efficiency of parallelized computations.The term "theoretical" here refers to the use of load distribution as introduced in Sect.6.3.

Figure 20 .
Figure 20.model: number of model years and Newton steps required for the computation of the annual cycle y(u d ) as a function of different convergence control parameters α and γ (cf.Eq. 8).

Figure 21 .
Figure21.Distribution of the number of model years and Newton steps required for the computation of one annual cycle using 100 random parameter samples (cf.Sect.6.5).

Table 27 .
Dutkiewicz et al. (2005)n the MITgcm-PO4-DOP model.Specified are the location within the paramet used byDutkiewicz et al. (2005)and the value used for the computation of the reference solution (u d ).Shown are furth and upper (bu) boundaries used for the parameter samples experiment.

Table 1 .
Vertical layers of the numerical model, in meters.