Porting marine ecosystem model spin-up using transport matrices to GPUs

We have ported an implementation of the spin-up for marine ecosystem models based on transport matrices to graphics processing units (GPUs). The original implementation was designed for distributed-memory architectures and uses the Portable, Extensible Toolkit for Scientific Computation (PETSc) library that is based on the Message Passing Interface (MPI) standard. The spin-up computes a steady seasonal cycle of ecosystem tracers with climatological ocean circulation data as forcing. Since the transport is linear with respect to the tracers, the resulting operator is represented by matrices. Each iteration of the spin-up involves two matrixvector multiplications and the evaluation of the used biogeochemical model. The original code was written in C and Fortran. On the GPU, we use the Compute Unified Device Architecture (CUDA) standard, a customized version of PETSc and a commercial CUDA Fortran compiler. We describe the extensions to PETSc and the modifications of the original C and Fortran codes that had to be done. Here we make use of freely available libraries for the GPU. We analyze the computational effort of the main parts of the spin-up for two exemplar ecosystem models and compare the overall computational time to those necessary on different CPUs. The results show that a consumer GPU can compete with a significant number of cluster CPUs without further code optimization.


Introduction
This work is motivated by the usually huge effort that is needed when computing steady annual cycles (or, mathematically speaking, periodic solutions) of spatially three-dimensional marine ecosystem models.In most cases this is done by "spinning up" the model, i.e. by using a timestepping algorithm with climatological, periodic forcing data until the steady cycle is reached, at least up to a certain tolerance.This can take a huge number of iterations, in typical cases about 3000 to 5000 model years, each of which involves thousands of time steps (e.g.2880 steps for a threehour step-size).Thus the overall number of iterations may be in the range of 10 6 to 10 7 .When aiming at parameter optimization or sensitivity studies, the spin-up process has to be repeated several times, and thus in these cases a reduction of the computational time of a single spin-up run is even more important.
There are several strategies to reduce this computational effort.The following ones are more or less independent from each other: one of them is of course parallelization, usually by domain decomposition methods.The second one is the usage of precomputed transport matrices (see Khatiwala, 2007) that represent the (possibly linearized) tracer transport scheme applied in an ocean model.Monthly averaged matrices for the explicit and the implicit parts of the ocean tracer transport operator are usually used.In the ecosystem spinup, these "climatological" matrices are then interpolated accordingly in every time step.With this method, the transport part of the ecosystem model reduces to matrix-vector multiplications, whereas the biogeochemical source-minus-sink terms are evaluated separately.A third way to reduce computational effort is to replace the standard spin-up (which, in mathematically terms, is a fixed-point iteration) by variants of Newton's method, which have higher convergence rates.
Published by Copernicus Publications on behalf of the European Geosciences Union.

E. Siewertsen et al.: Porting marine ecosystem model spin-up to GPUs
In this work we start from an implementation of a spinup that applies the first two strategies.In order to drive the biogeochemical tracers, the software handles transport matrices that are stored in a common sparse format.Moreover, it uses routines of the Portable, Extensible Toolkit for Scientific Computation (PETSc; Balay et al., 1997Balay et al., , 2012) ) library to perform matrix-vector multiplications in parallel.The main advantages of this toolkit is that all Message Passing Interface (MPI; Walker and Dongarra, 1996) calls are hidden in built-in functions, and that optimized functions for matrixvector operations (and more) already exist.The resulting software can be coupled with a wide range of biogeochemical models, as long as they conform to a rather flexible and general interface.
The main focus of this work is to describe the necessary changes to the software to port it to GPU hardware and to determine the resulting speed-up.High-performance computing on GPU or other special, highly parallel hardware is becoming more and more attractive in climate and geophysical research as well (e.g.Hanappe et al., 2011;Horn, 2012).To our knowledge there is no publication about using GPUs for marine ecosystem simulations.Since sparse matrix-vector multiplication (SpMVM) is an integral part of our spin-up implementation, this work is clearly motivated by the performance gains (up to a speedup of 24) achieved by the algorithms presented by Bell and Garland (2008).Moreover, we are interested in the behavior of the incorporated biogeochemical models ported to the GPU.For this purpose, we take here two examples with two tracers each.One of them is a simple linear model, describing for example the radioactive decay of two compounds.The second one is a wellknown biogeochemical model that serves as a basis for more complex descriptions of the interplay of ocean biota and its major nutrients.It was used for numerical experiments by Parekh et al. (2005) or Kriest et al. (2010) for example.
Since we want to explicitly show what steps were necessary for the mentioned CPU-to-GPU port, we start by describing the original software for the ecosystem spin-up and the used biogeochemical models in Sect. 2. Afterwards we describe the standards, tools and libraries used for GPU programming in Sect.3. We then show which GPU-adapted software can be used and what kind of adaption we additionally had to make in Sect. 4. We then show numerical results in Sect. 5 for the two models, both on CPU and GPU hardware.Finally, we conclude our work and give an outlook in Sect.6.

Coupled marine tracer transport simulation using transport matrices
A marine ecosystem is usually modeled as a system of equations for the ocean circulation and the transport of temperature, salinity and the incorporated biogeochemical tracers, including their interactions.A fully coupled simulationreflecting the fact that tracers are advected by the ocean circulation, their diffusion is dominated by the turbulent mixing of marine water, and, vice versa, a tracer concentration may effect the ocean circulation -is computationally expensive.Even on high-performance hardware, such a coupled (also called "online") simulation in three spatial dimensions is restricted to single model evaluations only, especially if steady annual cycles, which require long term spin-ups, are under investigation.
In contrast, a so-called "offline" computation is a simplified approach for tracers that are (or are regarded as) "passive", i.e. they do not affect the ocean physics, or this influence is neglected.This results in a one-way coupling from the ocean circulation to the tracer dynamics only, where the precomputed circulation data (advection velocity vector field v, mixing coefficient κ, temperature, and optionally salinity) enter the tracer transport equations as forcing.
With this data given, a marine ecosystem model considered in an offline computation consists of the following system of parabolic partial differential equations (here for n tracers y i summarized in the vector y = (y i ) i=1,...,n ): in the space-time cylinder × [0, T ] with ∈ R 3 being the spatial domain (i.e. the ocean) and [0, T ], T > 0, the time interval.Here, we neglect the additional dependency on the space and time coordinates (x, t) in the notation for brevity.Additionally, homogeneous Neumann boundary conditions on = ∂ for all tracers y i are imposed.The source-minussink or coupling terms q i in general are nonlinear and represent growth, dying, and tracer interaction.Each of them need not necessarily depend on all tracers in y, but usually on more than the y i itself.The q i also include model parameters (as growth and dying rates, sinking velocities etc.) that are often subject to identification or estimation.They are usually spatially and temporally constant and not mentioned explicitly here.

Transport matrices
Since in an offline simulation the ocean circulation data is only used as pre-computed input for the tracer transport equations (Eq.1), the spatial differential operators therein can be represented as a linear operator and the equations can be formally written as Here, L(κ, v, t) is a linear operator comprising the whole transport, i.e. diffusion and advection, for the given ocean circulation data κ and v.It is time-dependent since the circulation data also depend on time, both in case of a transient simulation, and where a steady annual cycle driven by climatological data is sought.The operator L is identical for all tracers if the molecular diffusion of the tracers is small    compared to the turbulent mixing, which is a reasonable simplification.
The idea of the Transport Matrix Method (TMM) introduced in Khatiwala et al. (2005) is to compute or approximate the matrices that represent an appropriate discretization of L. This is done by running time steps of the ocean model that has produced the circulation data v, κ etc., with special, only locally non-zero initial distributions for one tracer.By varying the support of the initial distributions over the whole spatial domain, an approximation for one or several time steps can be obtained, which can be then used to build up a matrix representation of L. A comprehensive discussion of the temporal and spatial discretization as well as the process of evaluating transport matrices, especially in combination with operator splitting schemes can be found in Khatiwala et al. (2005).For our results we used twelve implicit and twelve explicit transport matrices, which represent monthly averaged diffusion and advection.The matrices are interpolated linearly to the corresponding discrete time step during simulation.
As a result, we obtain the following fully (temporal and spatial) discrete scheme where we now denote by y j the appropriately arranged vector of the values of all n tracers on all spatial grid points at time step j .In the same way, we denote by q j the vector of the discretized source-minus-sink terms at all spatial grid points in time step j .Using the TMM with a fixed time step-size τ , the time integration scheme for (Eq.2) reads y j +1 = A imp,j (A exp,j y j + τ q j (y j )) =: ϕ j (y j ). (3) Here n τ is the total number of time steps and A imp,j , A exp,j are the implicit and explicit transport matrices at time step j = 0, . . ., n τ − 1.The matrices are blockdiagonal and sparse and depend on the used time-stepping scheme: if -as a simple and unrealistic example -the whole system were solved explicitly by an Euler step, A imp,j would be the identity and A exp,j would be the discrete counterpart of I + τ L(κ, v, t j ).Summarizing, starting from a vector y 0 of initial values, each step in the time integration scheme (Eq. 3) to solve the tracer transport equations (Eq. 1) consists of the evaluation of the source-minus-sink term and two matrix-vector multiplications per tracer.
Table 1 shows typical values for the sizes and sparsity of transport matrices generated by the MIT General Circulation Model (MITgcm; Marshall et al., 1997) for two spatial resolutions, see Khatiwala et al. (2005); Piwonski and Slawig (2012).Since we deal with quadratic matrices and the sparsity patterns remain the same throughout the whole spin-up process a characterization of the used matrices by the number of rows (nrows) and the number of non-zero elements (nnz) is sufficient for our purpose.Figure 1 shows the sparsity patterns.The matrix entries are stored as double precision values.

Computation of steady annual cycles
Computing a periodic solution of the discretized system (Eq.3) means looking for a fixed point of the mapping for a trajectory (y j ) j =0,...,n τ with (4) Thus one application of the mapping corresponds to the computation of one year model time (or model year).The time step used in our computations was 3 h, which corresponds (taking 360 days a year) to n τ = 2880.The discretization of the biogeochemical terms q i may include shorter time steps (typically 8 per outer 3-h step).
The whole iteration to compute a steady cycle (or fixed point) now consists of a repeated application of the mapping : where y l is the vector of discretized tracer after l model years, i.e. y l = y l•n τ , and n l the total number of model years necessary to reach a steady annual cycle.The resulting structure of the spin-up is sketched in Algorithm 1.
From several computations it can be observed that after about n l = 3000 iterations, a numerical steady solution (up to an accuracy of about 10 −2 in discrete L 2 ( ) n norm) is obtained.Thus we refer to this as a "converged steady annual cycle".This value of n l was also used in (Kriest et al., 2010).The residual can be further decreased by using a higher number n l of model years.

Applying parallel algorithms using the PETSc library
Obviously, a parallelization of the matrix-vector multiplication occurring every time step can significantly speed up the process of computing the steady annual cycle by the pseudotime stepping (or fixed point iteration) described above.In the CPU setting (e.g.Piwonski and Slawig, 2012) the parallelization is carried out on a multi-processor, distributedmemory architecture.In order to avoid the direct implementation of MPI directives, we make use of the PETSc library.
It is a collection of data structures and algorithms for the parallel solution of numerical problems and provides interfaces (APIs) to programming languages as Fortran, C, C++, Python, and MATLAB ® .Main advantages of PETSc for our application are the parallelized matrix-vector-multiplication routines and the usage of an efficient sparse matrix storage format, in our case the default PETSc format, namely the "AIJ" or "Yale sparse" or "CSR" (compressed sparse row) format.
In our original implementation, the biogeochemical part (Algorithm 1, line 4) is implemented in Fortran, whereas the remainder of the code is realized in C.There is a difference with respect to the access of the tracer data that becomes important later on the GPU: for the biogeochemical computations (line 4), the values of the separate tracers and also on different spatial grid points (compare Eq. 6) are needed simultaneously.In contrast, the matrix-vector products (lines 6, 7) are executed separately for each tracer, thus allowing us to store and work with one block of the transport matrices only.Each matrix-vector product is computed by one call to the PETSc routine MatMult().
For the interpolation step in line 5, three other PETSc routines are used (for explicit and implicit matrix separately) to compute the appropriately weighted matrices: These three routines together compute a linear interpolant or convex combination of two succeeding monthly averaged matrices, which are stored in the array A starting at index i alpha and i beta, respectively.Thus the above lines compute A work = alpha * A[i alpha] + beta * A[i beta], which gives the desired interpolated matrix in A work, if alpha, i alpha and beta, i beta are chosen correctly with respect to the time step j .

Ecosystem and biogeochemical model examples
We use two simple models to test the computational gain possible with the GPU hardware.Each of them has two tracers (i.e.n = 2 in Eq. 1 and thereafter).Source codes for both models are available at Piwonski and Slawig (2012).
The first one is a simple radioactive decay model which is uncoupled and has the autonomous source-minus-sink term The parameters λ 1 , λ 2 > 0 are the decay rates of the two radioactive elements.We chose Iodine I 131 with λ 1 ≈ 44.88 and Caesium Cs 137 with λ 2 ≈ 0.0331.This uncoupled model is used in order to test the gain in CPU time for the pure matrix-vector multiplication and interpolation in the TMM.
The second model is a typical biogeochemical model, including both coupling and nonlinearities.It is based on the N-DOP model described in Parekh et al. (2005), which was also used in Kriest et al. (2010), from which we basically take the notation.The model incorporates phosphate (nutrients, N, y 1 ) and dissolved organic phosphorus (DOP, y 2 ).The source-minus-sink term is split up into the upper, sun-lit or productive euphotic zone 1 with depth z , and the lower, aphotic zone 2 : Algorithm 1: Marine ecosystem spin-up using TMM Require: Set of monthly averaged transport matrices A imp , A exp , initial tracer distribution y 0 , time step τ Ensure : At the end y is a tracer distribution (at one point in time) of a steady annual cycle 1 y = y 0 2 repeat 3 for j = 0, . . ., n τ − 1 do 4 compute biogeochemical source-minus-sink terms: ỹ = q j (y) z being the vertical coordinate.The biological production is calculated as a function of nutrients y 1 and light I .The dependence on the latter is omitted here in the notation for brevity.The production is limited by a half saturation function, also known as Michaelis-Menten kinetics, and a maximum production rate parameter α.Light is modeled as a portion of shortwave radiation I SWR , which is computed as a function of latitude and season following the astronomical formula of Paltridge and Platt (1976).The portion depends on the photo-synthetically available radiation σ PAR = 0.4, the ice cover σ ice , and the exponential attenuation of water, i.e.
A fraction σ of the biological production remains suspended in the water column as dissolved organic phosphorus, which remineralizes with a rate λ.The remainder of the production sinks as particulate to depth where it is remineralized according to the empirical power-law relationship determined by Martin et al. (1987), Similar modeling of biological production can be found for example in Dutkiewicz et al. (2005).Algorithm 2 sketches the implementation of the N-DOP model, whereas the model parameters are given in Table 2.

GPU computing with CUDA
In this section we describe the basic architecture of GPUs and give an overview of some useful libraries.We concentrate on NVIDIA's Compute Unified Device Architecture (CUDA; NVIDIA Corporation, 2012).One alternative is, for example, OpenCL (The Khronos Group, 2012).NVIDIA, as one of the leading producers of graphic cards, has developed its own parallel architecture for executing computationally expensive code on GPUs.By exploiting the architecture of graphic cards as well as the increased memory bandwidth, it is possible to perform a far greater number of floating point operations per second (FLOPS) than on CPUs.
While CPUs have about one to eight cores each with up to 4 GHz clock rate, GPUs nowadays do have a lower clock rate, but hundreds of cores which can run multiple threads simultaneously.
The basic unit of the CUDA programming model is called kernel.A kernel is a piece of program code invoked on the CPU host and executed on the GPU device by threads.These threads are organized in a "grid" of thread "blocks".A call to kernel<<<gridSize, blockSize>>>(); creates gridSize blocks of blockSize threads ready for execution, whereas the order of processing the blocks depends on the hardware.The GPU hardware consists of several Streaming Multiprocessors (SMs).Each SM has its own buffer memory, registers, and a number of cores.The cores have their own units for integer and floating-point calculation.For example, the GeForce GTX 480 used here has 15 SMs with 32 cores each, i.e. a total of 480 cores.On a core, the smallest executable unit is a "warp", which consists of 32 threads.The total number of threads that can run simultaneously on a multiprocessor is dependent on the Compute Capability (CC) of the graphics chips.For the GTX 480 the limit is 1536 threads, which results in a maximum number of concurrent threads for the entire GPU of 15 × 1536 = 23 040 (p.159, NVIDIA Corporation, 2011).
The device memory on the GPU is divided into three types of physical and virtual portions.At first, a thread has access to its own private memory which is, depending on the CC, between 16 kB and 512 kB.Secondly, threads within one block have access to a shared memory of between 16 kB and 48 kB.Finally, all threads have access to a shared global memory whose size is limited by the total amount of memory of the GPU.In order to run kernel code on the GPU, all data must be transferred from the host memory of the CPU to the device memory on the GPU.
NVIDIA provides a compiler (nvcc) that translates C code into the CUDA Instruction Set (called PTX) and behaves similarly to the C compiler (gcc) included in the GNU Compiler Collection (GCC).A port of the GNU debugger gdb is also included in the CUDA toolkit.

Libraries
We make use of libraries that provide basic algorithms while working with GPUs.The first one is: Thrust (Bell and Hoberock, 2011), a C++ collection of generic algorithms, similar to the C++ Standard Template Library (STL), that exploit the parallelism of the GPU in a transparent way.Using Thrust, many problems can be solved without even writing code for the GPU.For documentation and sample code we refer to (Hoberock and Bell, 2012).
The second library is Cusp (Bell and Garland, 2010), which provides data types for sparse matrices and algorithms for basic linear algebra operations on them.All data structures in Cusp have a parameter that determines whether it is stored in CPU or GPU memory.Operations on the data will then take place in the respective storage area.For our application, in particular the structure cusp::csr matrix for the CSR format and the matrix-vector multiplication routine cusp::multiply that uses the algorithm described in Bell andGarland (2008, 2009), which was specially developed for GPUs, are important.Documentation and sample code can be found at (Bell and Garland, 2010).
The third library we used was the preliminary implementation of PETSc for the CUDA architecture presented in Minden et al. (2010).With the help of the Thrust and Cusp libraries, a large part of the PETSc Vector and some parts of the Matrix class have been implemented.The fundamental problems of interaction of PETSc with the GPU have been resolved, but only the routines that were necessary for the example treated in Minden et al. (2010) have been implemented.Basically this "PETSc GPU" extends the built-in structures by a value that indicates in which memory the most recent data are stored.This guarantees that the correct data is available (and if necessary copied to) the memory that is currently used.Here, we employed the developer PETSc library version 3.2-p5.

Port of the marine ecosystem simulation onto the GPU
We now describe the necessary modifications and extensions of the original program that was running on a multi-processor CPU cluster in order to perform the simulation on a GPU.Basically these modifications are extensions of PETSc GPU, modifications necessary to use the CUDA Fortran compiler for the biogeochemical model code and some routines for conversions between different data alignments.

Necessary extensions of PETSc GPU
The preliminary PETSc GPU implementation was designed to solve systems of equations, and thus not all functions necessary for our applications were included.To avoid any copying of data between CPU and GPU storage that would have destroyed the speed-up, we had to extend the library.In our case, the three PETSc routines MatCopy, MatScale, and MatAXPY mentioned in Sect.2.2 had to be modified.If using sparse matrices with PETSc and working with GPUs the PETSc wrapper function MatCopy(Ain, Aout, ...); accesses MatCopy SeqAIJCUSP() to copy the values from matrix Ain to Aout.Here, it is theoretically possible that both matrices are either currently in the GPU memory, the CPU memory or in both.For a complete and correct implementation, it would have been necessary to cover all these cases, and accordingly select the memory the matrices are actually copied to.For our application it was sufficient to cover only the case where the matrices are both in the GPU memory, thus only this case was implemented.Therefore, an additional call to MatCUSPCopyToGPU() in MatCopy SeqAIJCUSP() ensures that both matrices are in the GPU memory.
The PETSc routines MatScale() and MatAXPY() implement typical linear algebra subproblems, which are only performed on the non-zero matrix elements.Consequently, they could be completely realized using the Cusp BLAS library for the GPU.

PGI CUDA-Fortran
Many biogeochemical models are implemented in Fortran.Mostly, they are part of a software that has evolved over decades (e.g.MITgcm).Since we want to use them with GPUs without any modification to original source code, we need a Fortran compiler and the appropriate libraries.At the time of this work there was only one candidate, namely the PGI CUDA Fortran compiler (The Portland Group, 2012).It extends the language by constructs for calling kernel as well as the CUDA API functions.The syntax of a kernel call in Fortran is call kernel<<<gridSize, blockSize>>>() and thus similar to CUDA C++ .There are some extensions compared to CUDA C++, but also some restrictions.For details we refer to the manual (The Portland Group, 2011a, p. 14).

Other extensions to the implementation on the CPU
As mentioned in Sect.2.3, there are two different data alignments useful for the spin-up using the TMM: one for the biogeochemical source-minus sink terms, where all tracers of a water column are kept in a contiguous piece of memory, and another one for the multiplication with the transport matrices, where every water column of a tracer is kept together to reduce the storage requirements for the matrices.Thus a copying between these two data alignments is necessary in every step of the algorithm.For the use on the GPU, three copying functions in the original code were additionally modified using the Thrust library.

The compilation process for the GPU
Here we briefly sketch the overall compilation and linking process of the resulting code for the use on the GPU.The process is visualized in Fig. 2.
In a first step (top right in Fig. 2  and processed to driver model.CUF by the preprocessor of the C++ compiler pgcpp.The Fortran compiler pgfortran then generates the object file driver model.o. The driver routine driver.CUF has two tasks: at first the Fortran compiler requires that all functions which shall run on the GPU are marked with the device attribute, see The Portland Group (2011b).Since the compiler has no ability to set default attributes for all functions, it is necessary to integrate them through a preprocessor macro.Therein the Fortran keyword subroutine is replaced by attributes(device) subroutine.Secondly, the driver provides support functions for the three entry points into the biogeochemical model, namely (i) the evaluation of the source-minus-sink term, (ii) the initialization and (iii) deinitialization of the model.These three functions need corresponding kernels for the GPU.This approach ensures the original Fortran interface of the biogeochemical model remains unaltered.
In a second step (top left of Fig. 2) the original, unmodified C code is compiled with the MPI wrapper of the GNU C compiler mpicc, while CUDA extensions are translated with nvcc.Finally all object code files are linked against PGI Fortran libraries, which results in the final executable.

Numerical results
In this section we compare the performance of the spin-up on our CPU/GPU test hardware.We use the two models described in Sect.2.4.A special emphasis lies on the time needed for the individual parts, namely the evaluation of the biogeochemical source minus-sink term, the matrix interpolation and the matrix-vector multiplication.Moreover, we   contrast the best GPU result for the N-DOP model with results from three different distributed-memory architectures.

Setup
The CPU/GPU test hardware consists of two GeForce GTX 480 graphic cards and two Intel ® Xeon ® E5520 CPUs running at 2.27 GHz.However, the following tests were performed only on one GPU and only on one core of the CPU.No display was connected to the graphic card and computations on the GPU were performed with double precision, which is natively supported by the GTX 480.The theoretical peak performance of the GPU is at 168 GFlop s −1 and the internal bandwidth at 177 GB s −1 .The performance of one core of the CPU system is at 9.08 GFlop s −1 , its bandwidth at 21.2 GB s −1 .
To test a specific biogeochemical model, the software is compiled with the according source code and run for 100 model years.In detail, when the executable starts the data (matrices, initial vectors, etc.) is copied into the CPU or GPU memory and 100 iterations, 2880 time steps each, are performed consecutively.In the case of a GPU run, the results are copied back to CPU memory at the end.
Thus, the whole data has to fit into the memory of the device (or host).This is the case if the 2.8125 • horizontal resolution is used.Here, the 1.5 GB RAM of a GTX 480 (or 40 GB of the CPU system) are enough for about 1 GB of data.However, a monthly averaged set of transport matrices based on a 1 • resolution (approximately 13 GB) is too large for the used GPU system.Such an amount of data requires a different approach (see Sect. 6).Hence, we focus on the 2.8125 • resolution and omit profiling of data transfers between CPU and GPU memory.
When processing source codes, the mpicc, mpif90 and nvcc compilers are switched to -O (i.e.optimize).For pgfortran no optimization flags are used.To perform time measurements, the profiling system of PETSc is applied.No further source code optimization is performed regarding the GPU.

Results
We start by examining the block size parameter for the Fortran kernel calls of the biogeochemical model.The block size describes the number of vertical profiles (or water columns) that are processed within a block.While the grid and block dimensions are calculated automatically, if using Thrust or Cusp for example, a suitable value for the Fortran kernel must be determined experimentally for the time being.For all tests we use just 100 model years (instead of 3000 or more needed in practice, see Sect.2.2) to render the numerical experiments feasible, especially when simulating the N-DOP model on the CPU, which still takes about 17 h.Figure 3 depicts the mean of 100 model years' computational time spent on the GPU for biogeochemical model steps depending on the block size.In both models, strong fluctuations up to 100 % occur.However, both graphs show similar occurrence of minima and maxima.We suppose this is due to the unbalanced distribution of water columns (see Sect. 6).However, the absolute minimum (I-Cs: 0.38 s, N-DOP: 12.6 s) is obtained for a block size of 160.
This value is used for the subsequent test, in which every year is timed separately.Table 3 shows the minimum, maximum and mean of computational time for one model year spent on the CPU and GPU.The standard deviation is small on the CPU (I-Cs: 0.47, N-DOP: 0.54) and marginal on the GPU (I-Cs: 0.002, N-DOP: 0.003).However, the overall reduction is about 10 for the simpler I-Cs model and about 22 for the more complex N-DOP model, a difference we investigate further.
Thus, the next tests focus on the individual steps within the repeat-until loop of Algorithm 1, corresponding to one annual cycle.The invoked routines are listed in Table 4. Their individual performance gain is depicted in Table 5. Regarding MatCopy, MatScale, MatAXPY and MatMult, we see a similar relative performance gain for both models from about 9 to 13.In contrast, BGCStep shows a speed-up of about 10 for the I-Cs model, whereas for N-DOP a ratio between the CPU and GPU of 36 can be observed.In addition, in Fig. 4 we recognize that 75 % of the overall time on the CPU, which is spent for the evaluation of the N-DOP model, is sped up by this factor on the GPU.This explains the overall ratio of 22.Note that the slightly higher average computational times in Fig. 4 (compared to those in Table 3) are due to the higher granularity of profiling.Moreover, we see that the computational effort for the I-Cs model, which is just a scaling of the tracer vector, is smaller than 3 % on both architectures.Here, the overall speed up is dominated by the matrix operations.
Concerning the latter, we pick MatMult for a detailed view on performance and bandwidth and compare our results with those reported by Bell and Garland (2008).We calculate the number of floating point operations for one model year as follows: which is the number of time steps per year times number of tracers times (explicit plus implicit) sparse matrix-vector multiplication, which is exactly twice the number of nonzeros.We consider the results from the N-DOP model and divide n ops by 58.19 s (CPU) and 5.87 s (GPU), respectively.We obtain a performance of approximately 1.2 GFlop s −1 for the CPU and 11.9 GFlop s −1 for the GPU.This is about 13 % of the theoretical peak performance of one CPU core (9.08 GFlop s −1 ) and about 7 % of 168 GFlop s −1 , regarding the GPU.The poor performance is due to the bandwidth limitation, which is typical for sparse matrix-vector multiplications.Following Bell and Garland (2008), we multiply n ops by 10 Byte Flop −1 (CSR vector kernel) and relate the result to the computational time spent on the CPU and GPU, respectively.We obtain 56.8 % (12 GB s −1 ) of the theoretical bandwidth for the CPU and 67.4 % (119.4GB s −1 ) for the GPU.These figures in turn are satisfying and confirm a good performance of the CSR vector kernel used by MatMult.However, they also show that a sparse matrix-vector multiplication on a GTX 480, which is two generations ahead of the GTX 280 used by Bell and Garland (2008), is only slightly faster.Here, we refer to the 10 GFlop s −1 , achieved by the GTX 280 for "unstructured" matrices, compared to the 11.9 GFlop s −1 achieved by the GTX 480 for the transport matrices.This is obviously due to the only slightly increased memory bandwidth from 141.7 GB s −1 (GTX 280) to 177 GB s −1 (GTX 480).
Nevertheless, motivated by the overall speed up, we perform simulations of the N-DOP model on three different CPU clusters and put them in relation to the best performance on the GPU as a last comparison.Figure 5 shows that a GTX 480 can compete with approximately 56 Barcelona, 28 Westmere, and 17 Gainestown processors.

Conclusions
In order to port our existing implementation of the spin-up of marine ecosystem models using transport matrices from CPU to GPU hardware, modifications of our own code and   extensions to the used libraries were necessary.This work required knowledge in the computing architecture of the used CUDA programming framework and the PETSc, Thrust and Cusp libraries.In order to compile Fortran code for the GPU, a commercial compiler was necessary.
Concerning the computational gain of the used biogeochemical models, we were surprised by the good performance of the N-DOP implementation.Here, we can only speculate about the reasons and see a need for a more detailed investigation.Considering the complexity of Algorithm 2, however, such an effort was out of the scope of this work.We thus reported only results here.
Regarding MatMult, we observed a similar good utilization of memory bandwidth by the CSR vector kernel for transport matrices as reported by Bell and Garland (2008) for "unstructured" matrices.Moreover, all matrix operations showed a satisfactory performance gain.
Our results motivate us to investigate other biogeochemical models and to get to the bottom of the significantly higher speed-up of the N-DOP model compared to other operations.Additionally, we are eager to prepare the code for usage with multiple GPUs and/or techniques of simultaneous copying and computing.

Fig. 1 .
Fig. 1.One block of the explicit (left) and implicit (right) transport matrices Aexp,Aimp computed using the MITgcm for a 2.8125 • resolution (output of MATLAB ® 's spy command).

Fig. 1 .
Fig. 1.One block of the explicit (left) and implicit (right) transport matrices A exp , A imp computed using the MITgcm for a 2.8125 • resolution (output of MATLAB ® 's spy command).

5
interpolate the monthly averaged transport matrices to the current time step j 6 perform explicit step: ŷ = A exp,j y 7 perform implicit step: y = A imp,j ( ŷ + τ ỹ) 8 end 9 until steady annual cycle is reached

Fig. 2 .
Fig. 2. Compilation and linking process of the spin-up for usage on the GPU.

Fig. 2 .
Fig. 2. Compilation and linking process of the spin-up for usage on the GPU.

Fig. 3 .
Fig. 3. Computational time needed for the I-Cs (left) and N-DOP (right) model within one model year depending on the block size.

Fig. 3 .
Fig. 3. Computational time needed for the I-Cs (left) and N-DOP (right) model within one model year depending on the block size.

Fig. 4 .
Fig. 4. Fraction of computational time needed for the individual parts in one year of the spin-up (Algorithm 1 and Table 4) for the I-Cs (top) and the N-DOP (bottom) model on the CPU (left) and GPU (right).

Fig. 4 .
Fig. 4. Fraction of computational time needed for the individual parts in one year of the spin-up (Algorithm 1 and Table 4) for the I-Cs (top) and the N-DOP (bottom) model on the CPU (left) and GPU (right).

Fig. 5 .
Fig. 5. Comparison between CPU cluster and the used GPU for one model year for the N-DOP model, ("rzcluster" refers to the Kiel University cluster, "HLRN" to the cluster of the North-German Supercomputing Alliance).

Fig. 5 .
Fig.5.Comparison between CPU cluster and the used GPU for one model year for the N-DOP model, ("rzcluster" refers to the Kiel University cluster, "HLRN" to the cluster of the North-German Supercomputing Alliance).

Table 2 .
Parameters in the N-DOP model.

Table 1 .
Resolution, sizes and sparsity of one block of the explicit and implicit transport matrices for two resolutions computed with the MITgcm.

Table 2 .
Parameters in the N-DOP model.
f j

Table 3 .
Minimum, maximum, average and standard deviation of computational time for one model year spent on the CPU and GPU.Shown are results of 100 model years, each year timed separately.BGCStep block size: 160.

Table 4 .
The three main portions in every time step of the spin-up.

Table 5 .
Mean computational time within one model year and performance gains of the individual routines depicted in Table4.