The present work aims at evaluating the scalability performance of a high-resolution global ocean biogeochemistry model (PELAGOS025) on massive parallel architectures and the benefits in terms of the time-to-solution reduction. PELAGOS025 is an on-line coupling between the Nucleus for the European Modelling of the Ocean (NEMO) physical ocean model and the Biogeochemical Flux Model (BFM) biogeochemical model. Both the models use a parallel domain decomposition along the horizontal dimension. The parallelisation is based on the message passing paradigm. The performance analysis has been done on two parallel architectures, an IBM BlueGene/Q at ALCF (Argonne Leadership Computing Facilities) and an IBM iDataPlex with Sandy Bridge processors at the CMCC (Euro Mediterranean Center on Climate Change). The outcome of the analysis demonstrated that the lack of scalability is due to several factors such as the I/O operations, the memory contention, the load unbalancing due to the memory structure of the BFM component and, for the BlueGene/Q, the absence of a hybrid parallelisation approach.

Nowadays, the study of climate change needs high-resolution simulations as
one of the possible strategies to reduce uncertainty in climate predictions.
In addition, the interaction of the physical components of the climate system
with Earth biogeochemistry and socio-economical aspects implies that multiple
dynamical models are coupled together in the so-called Earth system models

The community climate models have to be carefully analysed in order to
emphasise the scalability bottlenecks, which could not be the same on
different architectures. Moreover, the implemented parallel approaches and
the available alternatives have to be investigated to select the best
strategy. The computational scientists have to decide whether the model has
to be re-designed from scratch or whether it can be optimised in order to
exploit the next-generation architectures. The performance could be improved
by using optimised numerical libraries

As an example of this assessment of multi-component Earth system models, we
focused on an implementation that is likely to be standard in the next
generation of climate models. We considered two components that are usually
computationally demanding, the ocean physics and ocean biogeochemistry. As in
most of the cases, ocean biogeochemical models are tightly linked to the
ocean physics computational cores, as they share the same grid and numerical
schemes. In particular, the present work aims at analysing the computational
performance of the Nucleus for the European Modelling of the Ocean (NEMO)
oceanic model at 0.25

PELAGOS (PELAgic biogeochemistry for Global Ocean Simulations;

The coupling between NEMO and the BFM occurs at every time step and each
processing element (PE) resolves the integration of both physical and
biogeochemical model equations. The memory layout of the BFM can be defined
by construction as zero- or one-dimensional, and the latter is used for the
coupling with NEMO by considering only the ocean points of the model
subdomain. This implies that each BFM variable is a one-dimensional array,
with all the land points stripped out from the three-dimensional domain of
NEMO, and the remapping into the ocean grid is done only when dealing with
transport processes. This operation is done for every subdomain of the grid
decomposition. A thorough description of the NEMO–BFM coupling is detailed
in

NEMO uses a horizontal domain decomposition based on a pure MPI (message
passing interface) approach. Once the number of cores has been chosen, the
number of subdomains along the two horizontal directions (hereinafter
jpni and jpnj) are consequently defined. The numerical
discretisation used in NEMO is based on finite differences. According to this
method, the communication pattern among the parallel tasks is based on the
five-point cross stencil. The best decomposition strategy for reducing the
communication overhead is to select jpni and jpnj to obtain
subdomains as square as possible. By following this procedure, the
communication overhead is minimal. However, coupling the biogeochemical
component, the number of ocean points for each subdomain becomes a crucial
factor, since the BFM, unlike NEMO, performs the computation only on these
points. A pre-processing tool has been written to establish the best domain
decomposition that minimises the number of ocean points of the biggest
subdomain. In addition, a NEMO feature allows one to exclude the domains with
land points only.

The PELAGOS model was tested in this work at the highest available horizontal
resolution of 0.25

The analysis of the strong scalability of the code has been performed on two
architectures: the first one is a BlueGene/Q (named VESTA), located at the
Argonne Leadership Computing Facilities (ALCF/ANL); the second one is the
ATHENA system, available at the CMCC (Euro Mediterranean Center on Climate
Change), an iDataPlex equipped with Intel Sandy Bridge processors. The
activity has been conducted in collaboration with the ALCF/ANL. Details about
the systems are reported in Table

Example of PELAGOS025 decomposition on 54 subdomains. There are five subdomains with land points only (marked by an X). These subdomains are not included in the computation.

Architecture parameters related to the BlueGene/Q (named VESTA), located at the Argonne Leadership Computing Facilities (ALCF/ANL) and the iDataPlex equipped with Intel Sandy Bridge processors (named ATHENA), located at the CMCC.

Table

Domain decompositions used for the experiments on the Sandy Bridge (ATHENA) and BG/Q (VESTA) architectures. The first two columns report the number of subdomains along the two horizontal directions; the third column shows the total number of processes excluding the land ones. A column follows indicating the number of nodes used to run the experiment, while the last columns show the average execution time, in s, for a time step of the simulation on both machines.

The performance analysis started from the evaluation of the parallel scalability. Two definitions of parallel scalability can be considered: the strong and weak scalability. The former is defined as the computational behaviour of the application when the number of computing elements increases for a fixed problem size; the latter describes how the execution time changes with the number of computing elements for a fixed grain size. This means that the computational work assigned to each processor is fixed, and hence the problem size grows with the number of processes. The weak scalability is relevant when a parallel architecture is used for solving problems with a variable size, and the main goal is to improve the solution accuracy rather than to reduce the time-to-solution. The strong scalability is relevant for applications with a fixed problem size, and hence the parallel architecture is used to reduce the time-to-solution. The PELAGOS025 coupled model can be considered a problem with a fixed size and the main goal is to use computational power to reduce the time-to-solution.

Scalability of the PELAGOS025 configuration: comparison between the results obtained on ATHENA and VESTA. The red line represents the speedup of the model on ATHENA, the blue line on VESTA. The dashed line represents the ideal speedup.

Scalability of the PELAGOS025 configuration: comparison between the results obtained on ATHENA and VESTA. The red line represents the execution time for a time step of the model on ATHENA, the blue line on VESTA.

Scalability of the PELAGOS025 configuration: comparison between the results obtained on ATHENA and VESTA. The red line represents the simulated years per day of the model on ATHENA, the blue line on VESTA.

The charts in Figs.

Figure

The MPI communication time decreases for two main reasons: the first one relates to the communication type that can be classified as a neighbourhood collective, which means that each process communicates only with its neighbours and no global communication happens, so the number of messages per core does not change when the number of processes increases; the second reason involves the amount of data exchanged between processes, which becomes smaller when the local subdomain shrinks.

The optimisation process of a code requires the analysis of the bottlenecks
that limit the scalability. The investigation methodology used in the present
work is based on the analysis at the routine level. Two different reference
decompositions have been taken into account and the execution times of the
main routines for the two decompositions have been analysed in order to
evaluate the speedup of each single routine. The

As with many codes in this domain, NEMO has a broad, flat execution profile
with no single routine accounting for more than 20

MPI communication time for two configurations for each architecture: with 1344 and 2048 processes on ATHENA and with 2048 and 4096 processes on VESTA.

Code profiling on BG/Q and Intel Sandy Bridge. The data have been
taken with the

Name and description of the routines selected during the code profiling analyses. The routines identified as belonging to the BFM are also the ones that originate from NEMO, but they have been modified for the BFM memory structure.

Some of the most time-consuming routines are related to the advection
(

In addition, on the BG/Q machine, an in-depth analysis using the Hardware
Performance Monitoring (HPM,

Code profiling by applying the HPM (Hardware Performance Monitoring) tool on the BG/Q cluster. The performance values do not include the start-up or the I/O operations. The first column reports the measured parameters, while the other ones show the values on two reference decompositions, respectively, on 2048 and 4096 cores.

The profiling at routine level helps to discover the model bottlenecks. The
code profiling has been performed with 2048 and 4096 cores. The most
time-consuming routines have been selected in both cases.
Figure

Analysis of scalability of the main routines on the BG/Q cluster in terms of speedup. The speedup is evaluated as the ratio between the execution time on 2048 and 4096 cores. Hence the ideal scalability is reached with speedup equal to 2. The red circles indicate the routines whose speedup is far from the ideal value.

Table

The analysis of routine scalability on the iDataPlex architecture has been
performed on two other reference decompositions respectively on 1344 and
2048 cores. Figure

Analysis of scalability of the main routines on the iDataPlex cluster in terms of speedup. The speedup is evaluated as the ratio between the execution time on 1344 and 2048 cores. Hence the ideal scalability is reached with a speedup equal to 1.52. The red circles indicate the routines whose speedup is far from the ideal value. The data, in this case, have been taken using the NEMO profiling support which provides information at a higher level.

In this section we deeply analyse the differences between the data structures adopted in NEMO and in the BFM, and we evaluate which one is better for use. A three-dimensional matrix data structure is used in NEMO. Each matrix also includes the points over land and it is the natural implementation of the subdomains defined as regular meshes by the finite difference numerical scheme. Even if this data structure brings some overhead due to the computation and memorisation of the points over land, it maintains the topology required by the numerical domain. The finite difference scheme requires each point to be updated considering its six neighbours, establishing a topological relationship among each point in the domain. Using a three-dimensional matrix to implement the numerical scheme, this relationship is maintained, and the topological position of a point in the domain can be directly derived by its three indexes in the matrix. Changing this data structure would imply the adoption of additional information for representing the topology with a negative impact on the performance due to indirect memory references, introduction of cache misses and reduction of the loop vectorisation level.

The BFM uses instead a one-dimensional array data structure with all the land
points stripped out from the three-dimensional
domain. The BFM is zero-dimensional by construction, so the new value of
a state variable in a point depends only on the other state variables in the
same point, and no relationship among the points is needed. The transport
term of the pelagic variables is demanded to NEMO, and
this requires a remapping from a one-dimensional to a three-dimensional data
structure and vice versa at each coupling step. In this section we aim at
evaluating whether the adoption of the three-dimensional matrix data
structure for the BFM can improve the performance of the whole model. Three
main aspects will be evaluated: the number of floating point operations, the
load balancing and the main memory allocation. The evaluation has been
conducted by choosing a number of processes that lead each subdomain of the
PELAGOS025 configuration to have exactly a square shape.
Figure

The number of floating point operations is directly proportional to the
number of points included in the subdomain. Since a parallel application is
driven by the most loaded process in the pool, we will evaluate how the
number of points changes at different decompositions for the process with the
biggest domain considering the two data structures. Figure

Relationship between the number of processes along the

Ratio between the number of floating point operations when using the three-dimensional and one-dimensional data structures.

The load balancing is measured by evaluating how many points are taken by
each process. An optimal load balancing is reached when each process
elaborates the same number of points. Even if some alternative and efficient
balancing approaches have been proposed for a multi-core aware partitioning
of the NEMO domain

Load balancing for one-dimensional

Histogram of the ocean point distribution using the one-dimensional data structure for two different configurations, with 1026 and 9882 processes.

The BFM is quite sensitive to the amount of allocated memory since it handles
tens of state variables. For simulations at high resolution the memory could
be a limiting factor. Figure

To conclude, the one-dimensional data structure performs better or, in the
worst case, equal to the three-dimensional one in terms of floating point
operations. Moreover, the one-dimensional data structure requires the minimum
amount of memory, since it stores only the ocean points, while the
three-dimensional approach increases the amount of memory for a very high
factor, demanding a huge amount of memory and making prohibitive the
simulations at high resolution. Finally, even if the workload is not
balanced, the solution for a better balancing is not given by the use of the
three-dimensional data structure. An ad hoc policy to redistribute the ocean
points among the processes could bring ideally a performance improvement of
more than 30

Load balancing when adopting the three-dimensional or
one-dimensional data structure. The first column reports the number of
processes followed by the dimension of the biggest domain. The Max and
Avg columns report the maximum number of grid points (i.e. the number of
grid points for the biggest domain) and the average value among all the
domains. The Unbal. columns give the estimation of the overhead due to
unbalancing. It is computed as (Max

Amount of memory allocated using three- and one-dimensional data structures. The values refer to the minimum amount of memory allocated in a sequential run.

The presence of the BFM component in the coupled model produces a work load unbalancing due to the different numbers of ocean points assigned to processes. We already stated that a better load balancing policy would notably improve the performance, even though an optimal mapping of the processes over the computing nodes can bring a slight improvement without changing the application code. The load unbalancing affects both the number of floating point operations and also the amount of memory allocated by each process. The local resource manager of a parallel cluster (LSF – Load Sharing Facility, PBS – Portable Batch System, etc.) typically handles the execution of a parallel application mapping the processes on the cores of each computing node without any specific criteria, just following the cardinal order of the MPI ranks. This generates an unbalanced allocation of memory on the nodes; some nodes can saturate the main memory and some others could use only a small part of it. The amount of allocated memory is also an indirect measurement of the memory accesses, as the larger the allocated memory the higher the number of memory accesses. For those nodes with full memory allocation, the memory contention among the processes impacts on the overall performance. A fairer distribution of processes over the computing nodes can better balance the allocated memory, reducing the memory contention. In this section we describe a mathematical model to estimate the amount of memory required by each process. The memory model can be used to choose an optimal domain decomposition (i.e. a decomposition such that the memory footprint of the heaviest process is minimum) or it can be used to evenly map each process on computational nodes using the amount of memory per node as a criterion.

The model was built considering the peculiarities of the data structures used in NEMO and the BFM as discussed in the previous section. In general, the memory allocated by each process is given by a term directly proportional to the subdomain size (according to the data allocated in NEMO), a term directly proportional to the number of ocean points in the subdomain (according to the data allocated in BFM) and a constant quantity of memory related to the scalar variables and the data needed for parallel processes management.

The memory model can be formalised by the following equation:

The test configuration used to evaluate the coefficients is executed on 672
processes and, for each one, the total amount of allocated memory was
measured. The

The relationship between the number of ocean points belonging to a subdomain and the memory footprint needed to process that subdomain. The chart shows the data extracted from a reference run on 672 processes (hence 672 subdomains) on the ATHENA cluster. The data have been used to evaluate the memory model coefficients.

Table

Estimation of the memory model coefficients. The evaluation has been
experimentally performed considering a decomposition made up of

Comparison between the memory model trend (red line) and the
experimental values (blue line) for a reference configuration on
160 processes. The decomposition is made up of

Evaluation of the memory model accuracy. The first column reports the examined decompositions, the last one shows the root mean square error (RMSE), expressed in GigaBytes, while the second one shows the relative RMSE expressed as the root mean square error compared with the average of the examined sample.

Estimation of the memory footprint using the memory model for an increasing number of processes. The red and blue lines respectively indicate the maximum and minimum allocated memory among the processes involved.

Figure

The present work aimed at analysing the computational performance of the
PELAGOS coupled model at 0.25

The I/O management. Before starting the scalability analysis, some tests on the two architectures have been performed using the model complete with all of its features. The management of I/O is inefficient when the number of processes increases. In fact, the number of reading/writing files is proportional to the number of processes. On the one hand this peculiarity allows the parallelisation of the I/O operations (each process can read/write its inputs/outputs independently); on the other hand, the I/O management is prohibitive when we have thousands of processes. For this reason, the I/O has been omitted from the performance analysis, focusing only on the computational aspects. In future, the adoption of a more performant I/O strategy will be necessary (e.g. the use of the XIOS tool for I/O management).

The memory usage balancing. The presence of the BFM component introduces a load imbalance due to the different number of ocean points belonging to each subdomain. Since the memory allocated by each process is related to the number of ocean points, a balancing strategy of the memory allocated for each node would improve the performance. In this context, some mapping strategies of the processes on the physical cores could be taken into account.

The communication overhead. PELAGOS is based on a pure MPI parallelisation. When the number of processes increases, the ratio between computation and communication decreases. Beyond a limit, the communication overhead becomes unsustainable. A possible solution is to parallelise along the vertical direction or overlap communications with computation. A hybrid parallelisation strategy can be taken into account, adding for example OpenMP to MPI. This strategy would allow a better exploitation of many-core architectures. Moreover, a further level of parallelism over the state variables treated by the BFM could be introduced.

The work has also demonstrated that the one-dimensional data structure used
in the BFM does not affect the performance when compared with the
three-dimensional data structure used in NEMO. The workload in the BFM is
unbalanced since the global domain is divided among the processes following
a block decomposition without taking into account the number of ocean points
which fall in a subdomain. The adoption of smarter domain decomposition, e.g.
based on the number of ocean points, could lead to a significant improvement
in the performance at lower process counts. When the number of processing
elements is greater than 1024, the difference between both strategies is
negligible (see Fig.

Finally, the current version of PELAGOS025 is still far from being ready for scaling on many-core architecture. A constructive collaboration between computational scientists and application domain scientists is a key step to reaching substantial improvements toward the full exploitation of next-generation computing systems.

The PELAGOS025 software is based on NEMO v3.4 and BFM v5.0, both available
for download from the respective distribution sites
(

The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Argonne Leadership Computing Facilities, namely Paul Messina. The authors acknowledge the NEMO and BFM consortia for the use of the NEMO system and the BFM system.

This work was partially funded by the EU Seventh Framework Programme within the IS-ENES project (grant number 228203) and by the Italian Ministry of Education, University and Research (MIUR) with the GEMINA project (grant number DD 232/2011). Edited by: A. Ridgwell