Towards convection-resolving , global atmospheric simulations with the Model for Prediction Across Scales ( MPAS ) v 3 . 1 : an extreme scaling experiment

The Model for Prediction Across Scales (MPAS) is a novel set of Earth system simulation components and consists of an atmospheric model, an ocean model and a land-ice model. Its distinct features are the use of unstructured Voronoi meshes and C-grid discretisation to address shortcomings of global models on regular grids and the use of limited area models nested in a forcing data set, with respect to parallel scalability, numerical accuracy and physical consistency. This concept allows one to include the feedback of regional land use information on weather and climate at local and global scales in a consistent way, which is impossible to achieve with traditional limited area modelling approaches. Here, we present an in-depth evaluation of MPAS with regards to technical aspects of performing model runs and scalability for three medium-size meshes on four different high-performance computing (HPC) sites with different architectures and compilers. We uncover model limitations and identify new aspects for the model optimisation that are introduced by the use of unstructured Voronoi meshes. We further demonstrate the model performance of MPAS in terms of its capability to reproduce the dynamics of the West African monsoon (WAM) and its associated precipitation in a pilot study. Constrained by available computational resources, we compare 11-month runs for two meshes with observations and a reference simulation from the Weather Research and Forecasting (WRF) model. We show that MPAS can reproduce the atmospheric dynamics on global and local scales in this experiment, but identify a precipitation excess for the West African region. Finally, we conduct extreme scaling tests on a global 3 km mesh with more than 65 million horizontal grid cells on up to half a million cores. We discuss necessary modifications of the model code to improve its parallel performance in general and specific to the HPC environment. We confirm good scaling (70 % parallel efficiency or better) of the MPAS model and provide numbers on the computational requirements for experiments with the 3 km mesh. In doing so, we show that global, convection-resolving atmospheric simulations with MPAS are within reach of current and next generations of high-end computing facilities.


Introduction
The weather-and climate-modelling community is currently seeing a shift in paradigm from limited area models towards novel approaches involving global, complex and irregular meshes.Yet, regional models are commonly used in numerical weather prediction and to study past, current and future climate at high spatial and temporal resolution over areas of specific interest.A wealth of such models exists currently, which differ in their discretisation of the computational grid, their implementation of the numerical solvers, their parameterisation of physical processes and most notably in their Published by Copernicus Publications on behalf of the European Geosciences Union.

D. Heinzeller et al.: MPAS: an extreme scaling experiment
simulation results (e.g.Smiatek et al., 2009;Nikulin et al., 2012).Despite these differences, these models share the common principle of nested modelling: regional climate information is generated by supplying a set of initial conditions as well as time-varying lateral boundary conditions (LBCs; large-scale atmospheric fields such as wind, temperature, geopotential height and hydrometeors) and lower boundary conditions (sea surface temperature, sea ice) to the regional model.The idea behind this approach is that the LBCs keep the regional climate model (RCM) solution consistent with the forcing atmospheric circulation, while small-scale patterns are generated with higher accuracy due to the increase in temporal and spatial resolution.Sub-grid scale processes in the RCM are included through parameterisations, which can be entirely different from those of the forcing global circulation model (GCM).
Supplying lateral boundary conditions to nested models can cause severe problems, up to the point where the RCM solution becomes inconsistent with the forcing data (Davies, 1983;Warner et al., 1997;Harris and Durran, 2010;Park et al., 2014).Starting off as an initial-value problem, the RCM solution gradually becomes a boundary value problem, which from a mathematical point of view represents a fundamentally ill-posed boundary value problem (Staniforth, 1997;Laprise, 2003).This is less of an issue in the context of numerical weather prediction (NWP), where typical model run times are 3 to 15 days, than in seasonal forecasting (weeks to months) and in regional climate modelling (years to centuries).
A common problem for both short-and long-term forecasting is that the solution of the RCM seems to vary with the size of the computational domain, as well as location and season (Caya and Biner, 2004;Leduc and Laprise, 2008;Caron, 2013).Several authors have shown that nudging techniques (grid or spectral nudging) towards the large-scale features of the forcing model can reduce these adverse effects, but that they can also hide model biases (von Storch et al., 2000;Miguez-Macho et al., 2004).
Further, the technique of grid nesting introduces discontinuities in spatial resolution between the regional model and the coarser-grid driving model, as well as between the nests within the regional model itself.For a typical refinement ratio of 3, two-thirds of the spatial wave-number spectrum present in the fine mesh are absent in the coarse mesh (Park et al., 2014).This implies that (a) these features must be spunup for inflows into the higher-resolution domain, and that (b) these wave numbers are reflected at the domain boundary for outflows from the high-resolution domain to the coarser domain.To address the latter issue, filters that are efficient over a large range of wave numbers are required.The temporal interpolation required by nesting can introduce further numerical artefacts, in particular when interpolating forcing LBCs, usually available at 3-6 h time steps, to the model integration time step of the high-resolution domain (typically 6 s per 1 km grid size).
One obvious solution is to avoid using LBCs and nesting by running a global model at the resolution required for the area of interest.This, however, is prohibitively expensive or simply not feasible, even on the latest generations of supercomputers.An intermediate approach therefore is to run a global model at a moderate resolution and use a smooth mesh transition on a variable-resolution grid, where filters are efficient at the local scale of the corresponding grid cell (Ringler et al., 2011).Beside the here-discussed Model for Prediction Across Scales (MPAS)1 model, few other recent developments such as ICON (icosahedral non-hydrostatic model; Zaengel et al., 2015) adopt this strategy.Applying such models for mid-and long-term regional climate simulations has only recently become possible and requires substantial computational power.
The MPAS is a recent numerical modelling framework that includes an atmospheric model MPAS-A (Skamarock et al., 2012), an ocean model MPAS-O (Ringler et al., 2013) and a land-ice model MPAS-LI.The MPAS model is a collaborative project, led by the National Center for Atmospheric Research (NCAR) and the Los Alamos National Laboratory (LANL).The three components of MPAS in principle form a so-called Earth system model (ESM), but a coupler between them is not yet available.A common feature of the three MPAS constituents are unstructured, centroidal Voronoi meshes (spherical centroidal Voronoi tessellations, SCVTs; Du et al., 2003), which allow for the generation of global, irregular, variable-resolution meshes with smooth transitions between areas of different refinement.
The atmospheric model MPAS-A is a global, fully compressible non-hydrostatic model using finite-volume numerics.Based on the Voronoi mesh, the model uses a C-grid staggering for the state variables (i.e.wind components are modelled at the faces of every cell, and the prognosed component of the wind is orthogonal to the cell face) as described in Thuburn et al. (2009) and Ringler et al. (2010) (see Fig. 1, upper panels, for illustration).The governing equations can then be cast in a way such that energy, momentum and water vapour content are conserved (more precisely: potential temperature, mass and scalar mass; see Skamarock et al., 2012, Sect. 2).The MPAS-A model builds on existing, well-established techniques of the Advanced Research Weather Research and Forecasting model (WRF-ARW; Skamarock et al., 2008), for example the split-explicit time integration scheme for the treatment of gravity waves and horizontally propagating acoustic waves.It also contains a subset of WRF's physics parameterisations that are suitable for climate modelling purposes.While WRF uses a terrainfollowing hydrostatic pressure coordinate for the vertical discretisation, MPAS employs a height-based terrain-following vertical coordinate.The latter discretisation reduces artificial circulations caused by inaccuracies in the horizontal pressure gradient term (Klemp, 2011).It should be noted here that because MPAS was developed out of a regional climate modelling context, the typical model tops in MPAS-A are around 30 to 42 km, while current Earth system models work with model tops of up to 100 km.At the time of writing, work is underway to increase the model tops in MPAS-A for future applications as an Earth system modelling component.
Until recently, advances in computational power following Moore's law were mainly driven by transistor speed and energy scaling, as well as by micro-architecture advances.Physical limitations and practical energy concerns will create new challenges for continued performance scaling in the coming decades.In consequence, large-scale parallelism and the use of accelerators will be required to achieve performance and energy efficiency (Borkar and Chien, 2011).Hence, a key aspect of modern numerical codes is their ability to scale on massively parallel systems.The quasiuniform centroidal Voronoi meshes used by MPAS are similar to icosahedral (hexagonal) meshes and can provide nearly uniform resolution over the globe, as opposed to latitudelongitude grids that require polar filtering to overcome the issue of converging grid lines at the poles.Grids requiring polar filtering or spherical transform methods do not scale very well on massively parallel systems (Skamarock et al., 2012).With MPAS, an efficient parallelisation can be achieved by aligning all grid cells in a 1-D array, with the vertical coordinate stacked on top as the second dimension (MacDonald et al., 2011).Good scaling has been achieved in early weak and strong scaling tests.However, a thorough investigation of the scalability of MPAS on parallel and massively parallel systems has not yet been conducted.
In this study, we investigate the performance of MPAS for different problem sizes on four high-performance computing (HPC) facilities in Europe.For each problem, strong scaling tests are conducted on all four platforms, which cover a range of different architectures to reflect the large variety of computational systems available for research.Additionally, we conduct extreme scaling tests using a 3 km global mesh to study the scalability of MPAS up to nearly half a million tasks and to demonstrate that global, convectionresolving simulations are becoming possible.We explore the limits of the MPAS model when its parallel efficiency breaks down and identify opportunities for improvement.We further derive estimates on the feasibility to conduct longer runs at convection-resolving resolution on current HPC facilities.
We also assess the quality of the MPAS model output in terms of its accuracy for climate modelling.We have chosen to study the particular problem of reproducing the characteristics of the West African monsoon (WAM).The WAM is the most prominent feature of the West African climate and accounts for the majority of the annual precipitation.Differential heating of the ocean and the land surface cause a seasonal change of the large-scale wind systems during boreal summer, which results in the migration of the inter-tropical convergence zone (ITCZ) and the associated rain band northwards over the West African continent.The WAM is driven by a complex and not yet fully understood interplay of various dynamical features (e.g.Sultan et al., 2003;Grist and Nicholson, 2001;Nicholson and Webster, 2007).GCMs often fail to reproduce this annual movement of the ITCZ due to their limited temporal and spatial resolution (e.g.Hourdin et al., 2010;Sylla et al., 2010).Despite their deficiencies discussed above, regional climate models can improve the representation of precipitation in comparison to their forcing data set (Nikulin et al., 2012;Klein et al., 2015).Variableresolution meshes allow for resolving the region of interest (greater West Africa in this case) at high resolution, while keeping the model aligned with large-scale features outside of this area.It is hoped that this will lead to an improvement of the representation of the WAM.It also opens up the possibility to study processes such as the teleconnection between the Indian monsoon and the WAM (Rodwell and Hoskins, 1996), or the impact of land use changes on weather and climate in a consistent approach.
The paper is organised as follows: in Sect.2, we introduce the HPC facilities used for this study, provide details about the MPAS-A code and present the scaling experiments with moderate problem sizes.We continue in Sect. 3 with an analysis of the physical accuracy of the MPAS model in comparison to observational data and data from our own regional climate modelling experiments.Section 4 is devoted to the extreme scaling tests, and Sect. 5 summarises our findings and gives an outlook on future modelling activities.Lastly, Sect.H provides information on how to access the model code and the test cases presented in Sects. 2 and 3.

Scaling experiments for moderate problem sizes
We perform strong scaling experiments for three different meshes and on four HPC facilities in Europe.The problem sizes range from a regular 120 km mesh with 40 962 cells as the smallest problem to a large, variable-resolution 60-12 km mesh with 535 554 grid cells.An intermediate test case with a variable-resolution 100-25 km mesh with 163 842 grid cells is studied as well.

HPC facilities
Two of the four HPC systems, the Très Grand Centre de Calcul (TGCC) Curie and the Forschungszentrum Jülich (FZJ) Juqueen, belong to the largest machines in Europe and are part of the PRACE Tier-0 pool2 .The technical specifications for each of the systems are summarised in Table 1, a brief summary is given in the following.TGCC Curie:3 the TGCC Curie went into service in 2012 and consists of 360 fat nodes and 16 hybrid nodes, not used in this study, and 5040 thin nodes with 2 eight-core Intel Sandy Bridge CPUs.An InfiniBand QDR network is used for both the compute network and the I/O (input and output) to the global LUSTRE file system.With three different node types, Curie addresses a wide range of scientific challenges.
FZJ Juqueen:4 the FZJ Juqueen is an IBM Blue Gene/Q system and was installed in 2012/13.It hosts 28 racks with 1024 nodes per rack and 16 cores per node, which totals to 458 752 physical cores.Simultaneous multi-threading is supported by the hardware, but not used in this study due to the lack of threading in MPAS-A (see below).A 5-D Torus interconnect is used as compute network, while I/O is redirected to dedicated I/O nodes using a 10 Gb Ethernet to connect to the General Parallel File System (GPFS) file system.With a large number of relatively slow CPUs and a small memory per core, Juqueen most resembles the future massively parallel systems described above.As such, porting and scaling experiments of numerical codes onto this architecture are as challenging as instructive for future applications.
SCC ForHLR1:5 the ForHLR1 is the most recent addition to the SCC's HPC systems, installed in September 2014.It hosts different types of nodes to cater for the various needs of the modelling community: the workhorse for parallel applications, and used in this study are 512 thin nodes with 2 10-core Intel Ivy Bridge CPUs, connected via an Infiniband FDR (fourteen data rate) interconnect.A central LUSTRE filesystem is attached to the nodes, using the same Infiniband interconnect for I/O as for the compute network.
FZJ Juropatest:6 the FZJ Juropatest cluster is a small but cutting-edge prototype system and consists of 70 T-Platform V210s blades with 2 14-core Intel Haswell processors.The Message Passing Interface (MPI) communication is realised over Infiniband FDR, and the I/O to the central GPFS file system is routed via 10 Gb Ethernet.While optimising the MPAS model for the Haswell features and instruction sets is beyond the scope of this study, it will become an inevitable step in future model development and tuning.
On Juropatest, we conduct two sets of runs for each of the test cases: for the first set (Jtest-half in the following), we use only one of the two available 14-core CPUs in each node, which implies a similar number of cores per node for Curie and Juropatest or, in other words, a similar number of nodes for the same total number of tasks.In this configuration, each task is bound to one core on the node.For the second set (Jtest-full in the following), we use both CPUs, i.e. 28 cores on each node to exploit the capabilities of the Juropatest system and the possible memory bandwidth limitations of MPAS-A.

MPAS-A code
For the strong-scaling studies in this paper, we use MPAS-A v3.1, released on 24 November 2014.This release of the model employs a horizontal domain decomposition for parallel execution, and parallelisation is implemented using MPI only; in this version of the code, no threading is used.The MPAS code is written almost entirely in Fortran 2003, with a few minor parts written in C. The MPAS build system uses only the make utility, with settings for different compilers and architectures described as different targets in the top-level Makefile; see Appendix A for the compiler flags used in this study.
The optimal parallelisation and distribution of the cells of the Voronoi mesh for a given number of tasks is treated as a graph partitioning problem.The dual mesh of a Voronoi tessellation is a Delaunay triangulation, which immediately provides the connectivity graph for the primal (i.e.Voronoi) cells in the mesh.In MPAS, the graph partitioning is computed as a separate pre-processing step for which the METIS software is used 7 .An optimal partitioning distributes equal work (by proxy, the number of cells) to each task while minimising the edge cut (assumed to model the communication between tasks).METIS uses a multilevel k-way partitioning scheme, which produces partitions of comparable quality to traditional multilevel bisection algorithms and is about 2 orders of magnitude faster (Karypis and Kumar, 1998).The resulting graph partitioning can be critical for the model performance due to, for example, a large overhead of communication and computational imbalances between the individual partitions.
At start-up, the MPAS-A model reads a file that assigns Voronoi cells to each of the MPI tasks according to a partitioning produced by METIS.The set of cells assigned to an MPI task is referred to as a block, and the cells in this assignment are referred to as the owned cells.The dynamical solver in MPAS-A requires stencils of cells in order to apply various operators, and as part of the model start-up, referred to internally as the bootstrapping process, a pre-determined number of layers of halo cells (sometimes referred to as ghost cells in other modelling systems) are added around each block.
7 http://glaros.dtc.umn.edu/gkhome/views/metisAlthough the number of halo cells can vary between different MPAS models, as illustrated in Fig. 1 (lower left panel), MPAS-A adds two layers of halo cells around each block of cells.A lower bound for the number of halo cells N h for a given number of owned cells N o can be estimated as At points in the MPAS-A dynamical solver where current values of fields in halo cells are required, values are communicated between tasks, from owned cells to ghost cells, with point-to-point MPI communications.An important aspect and common bottleneck in numerical weather prediction and climate modelling is disk I/O, since large 4-D fields such as temperature, geopotential height or wind components need to be written to the disk frequently.In MPAS v3.1, I/O is facilitated by the parallel I/O library (PIO) v1.7.1, a wrapper with an easy-to-use application programme interface (API) that encapsulates the complexity of parallel I/O for a number of supported formats: binary, serial NetCDF8 , Parallel-NetCDF9 and recently (since v.1.9.14) parallel NetCDF-4 through PHDF5 10 (Dennis et al.,  2013).PIO is compiled without further optimisation (standard settings) on all four machines.The HDF5, NetCDF and Parallel-NetCDF libraries are provided as modules on all four systems.
Unless stated otherwise, all experiments are conducted with double precision floating point precision, 41 atmospheric levels, 4 soil levels, a full suite of physics and dynamics (see Appendix B for details), and standard disk I/O.Each experiment is run for 24 h model time, during which an initial conditions file is read (init.nc),diagnostic output files are written every 3 h (diag.nc),and a final restart file and a comprehensive output file are written at the end (restart.nc,output.nc).The model integration time step depends on the grid resolution and is mentioned in the individual sections below.Note that for variable-resolution meshes, the global time step is determined by the smallest grid size.By default, all tasks participate in the parallel I/O.Each experiment is repeated once or twice, depending on how close the measured run times are, to account for fluctuations of single experiments.

Regular 120 km grid
The first and smallest test case consists of a global, regular 120 km mesh with 40 962 grid cells, which is roughly comparable in resolution to a 284×142 (latitude-longitude) grid.It is thus in the range of current Earth system models.A model integration time step of 150 s is adopted.For a resolution of 120 km, this is an extremely conservative setting (1.25 s per km grid size) for MPAS-A, which implies 576 integration time steps for a 24 h test run, compared to 120-144 integration time steps for typical values of 5-6 s per km grid size.This increases the time spent for the actual parallel integration relative to that for model initialisation and disk I/O.The decomposition of the 120 km grid for 64 tasks is illustrated as an example in Fig. 1 (lower right panel), while Fig. 2 (left panel) shows the scaling plots on the four HPC facilities described above.For an easier comparison of the scalability of the different test cases, the scaling is displayed as a parallel efficiency (i.e. the ratio of real scaling and ideal scaling) vs. the number of tasks (bottom horizontal axis) and number of cells owned per task (top horizontal axis).Table D1 provides further details about the scaling, whereas Table H1 summarises the size of the files to be read and written during one model run.
Previous scaling tests on the NCAR-Wyoming Supercomputing Center's (NWSC) Yellowstone machine 11 suggest that for regular meshes, the parallel efficiency of MPAS-A is correlated with the number of cells owned per task.Considering the time required for the solver only, i.e. neglecting the set-up costs and the disk I/O, a parallel efficiency of close to 70 % is obtained for more than 160 cells per task.Here, we include the set-up costs (bootstrapping and reading of initial conditions file) and the output to the disk in the scaling to emphasise the importance of all aspects of the systemfrom filesystem performance to compute performance to the speed of the interconnect -and to estimate the necessary resource requirements.It should be noted that this can have a negative and noticeable influence on the parallel efficiency, depending on the performance of the parallel I/O operations and the ratio of the time spent for the set-up of the model and the actual time integration.Hence, the threshold of 160 cells owned per task for the breakdown of the parallel efficiency should be considered as a lower limit.In Sect.2.7, we pro-11 https://www2.cisl.ucar.edu/resources/yellowstonevide a detailed analysis of the costs of the individual steps for the Jtest-full system.
On the three Linux-cluster type systems, test runs are conducted on single nodes up to 32 (Curie), 20 (ForHLR1), 15 (Jtest-full) and 30 (Jtest-half) nodes.In the following, we refer to good scaling as a parallel efficiency of ≈ 70 % or more.Good scaling is achieved up to 6 nodes on Curie (96 tasks, or 427 grid cells per task), 8 nodes on ForHLR1 (160 tasks, or 256 grid cells per task), 8 nodes on Jtest-full (224 tasks, or 183 grid cells per task) and 15 nodes on Jtest-half (210 tasks, or 195 grid cells per task).Comparing Curie and Jtest-half, it is evident that (a) a single Haswell CPU with 14 cores outperforms two Sandy Bridge CPUs with 2 × 8 cores on the same board, and that (b) the parallel efficiency decreases faster with the number of nodes on Curie.This is probably related to the interconnect: while Juropatest (as well as Yellowstone) uses Infiniband FDR Full Fat Tree technology (FDR, theoretical effective, aggregated throughput 56 Gb s −1 = 7 GB s −1 ) for the inter-process communication (MPI) and a separate 10 Gb Ethernet connection for I/O operations, Curie uses QDR Full Fat Tree technology (quad data rate, theoretical effective, aggregated throughput 32 Gb s −1 = 4 GB s −1 ) for the inter-process communication and for I/O operations.The transition zone for the breakdown of the parallel efficiency between 600 and 150 owned cells per task is indicated as shaded blue area in Fig. 2. A comparison of the absolute run times on Jtest-half and Jtest-full shows that runs with 28 cores per node are 5-15 % slower than runs with 14 cores per node, which is presumably due to memory bandwidth bottlenecks.An exception here is the 420-task run (30 nodes on Jtest-half, 15 nodes on Jtest-full), for which the increase in inter-node communication is the limiting factor (see also the discussion in Sect.2.4).
The minimum job (and block) size on Juqueen is 32 nodes or 512 tasks, which corresponds to only around 80 cells owned by each task.The parallel efficiency drops rapidly with an increasing number of nodes, since this problem size is simply too small for application on Juqueen.The dashed lines correspond to an exponential relation with a power-law index of 0.52.For large numbers of tasks, a runaway growth can be seen for the 120 km and to some extent also for the 100-25 km mesh.
panel) displays the communication volume as a function of the number of tasks, which follows a power law with an index of 0.52 up to about 1000 tasks (approx.40 owned cells per task).A runaway growth can be seen for larger numbers of tasks.At the same time, the graph partitions become increasingly non-contiguous (not displayed).It should be noted that the communication volume and number of non-contiguous partitions are computed by METIS and as such are a function of the partitioning of the cells only, which essentially assumes a single layer of ghost cells.Since MPAS-A exchanges two layers of ghost cells at maximum, the actual number of ghost cells, edges and vertices can be slightly different.A detailed study of the impact of these graph properties will be given in the following section.Table H2 lists the cheapest model runs in terms of CPUh spent per 24 h model integration and the fastest runs in terms of real time per 24 h model integration for the four HPC sites.It is important to remember that while the Jtest-half runs use only 50 % of the available cores on each node, the computational costs for the full node (28 CPUh per node per hour real time) are charged for the model run, since the node is not available for other users or jobs.Also, a one-to-one relation of CPUh between Linux cluster-type machines and an IBM Blue Gene is not meaningful.By comparing typical calls for proposals for the different HPC systems, a conversion factor of 1 : 16 seems to be reasonable, i.e. to consider one entire node with 16 cores on Juqueen as equivalent to one core on the other systems.However, since applications for computing resources usually demand estimates for the required amount of CPUh, we list the actual CPUh here.

Variable 100-25 km grid
The second test case employs an irregular mesh with a variable resolution ranging from 100 km for most parts of the globe to 25 km for a circular area spanning about 60 • , and centred on West Africa (lat = 12.5 • N, long = 0 • E).An integration time step of 120 s is used.The mesh as well as the frequency distribution of cell sizes are displayed in Fig. 4 (top panels), the scaling is illustrated in Fig. 2 (middle panel) and summarised in Table E1.
Test runs are conducted on single nodes up to 192 nodes on Curie (3072 tasks, 53 owned cells per task), 60 nodes on ForHLR1 (1200 tasks, 137 owned cells per task), 50 nodes on Jtest-full (1400 tasks, 117 owned cells per task) and 55 nodes on Jtest-half (770 tasks, 213 owned cells per task).Good scaling is achieved up to 32 nodes on Curie (512 tasks, 320 owned cells per task), 20 nodes on ForHLR1 (400 tasks, 410 owned cells per task), 25 nodes on Jtest-full (700 tasks, 234 owned cells per task) and 35 nodes on Jtest-half (490 tasks, 334 owned cells per task).The parallel efficiency on average drops from about 90 to 40 % within the transition zone (shaded area in Fig. 2, 600 to 150 cells owned per task), similar to the first test case on the regular 120 km mesh.
Notably different to the previous test case are the measured run times for Jtest-full and Jtest-half: for small numbers of tasks, the Jtest-full runs show a worse performance due to the aforementioned memory bandwidth limitations.For large numbers of tasks (nodes), the increase in inter-node MPI communication, which impacts the Jtest-half runs more than the Jtest-full runs, becomes the limiting factor and decreases the performance of the Jtest-half runs below that of the Jtest-full runs.With respect to the remaining HPC systems, a clear separation of the parallel efficiency by interconnect technology as for the 120 km test case cannot be seen here, due to the following reasons.
First, the disk I/O demand scales with the number of grid cells and is larger by a factor of 4 for this mesh (see Table H1).As we will see in the following sections, in particular for the extreme scaling experiments in Sect.4, the disk I/O becomes increasingly important for larger problem sizes and can consume a significant part of the total run time.The I/O is routed differently at the four HPC sites, the central storage systems have different bandwidths and block sizes, and the parallel I/O libraries might perform differently, depending on the compilers and compilation flags.Additionally, in this test case we adopt an integration time step of 120 s (4.8 s per km grid size), which implies a smaller fraction of the total time spent for the actual time integration relative to the disk I/O.
Second, the graph partitioning adds another layer of complexity and variability to the performance diagnostics.Figure 3 (middle panel) shows that the above-mentioned powerlaw relationship between the communication volume and the number of tasks also holds for the variable 100-25 km mesh up to about 40 owned cells per task (4000 tasks).Contrary to the 120 km regular mesh, METIS tends to create more non-contiguous partitions for complex mesh structures (not displayed here).This variability introduced by the graph partitioning is unpredictable and may have a significant impact on the model performance.If not instructed otherwise, the graph partitioning tool METIS aims at minimising the edge cut (communication volume), which potentially comes at the cost of having non-contiguous partitions.As discussed earlier in Sect.2.2, halo cells are added around each patch of the individual tasks, for which communication with the neighbouring tasks is required.The additional amount of communication caused by halo cells around non-contiguous partitions can be substantial, in particular if the number of cells owned per task is small (i.e. the ratio of halo cells to owned cells is large).
To investigate the effect of non-contiguous partitions on the parallel efficiency, we analyse one partition with 300 tasks for the 100-25 km mesh on ForHLR1 (546 owned cells per task), for which METIS by default produces a noncontiguous partition.We create an additional, contiguous partition using the command line arguments -contig -minconn for METIS, which results in an increase of the edge cut from 58 031 to 58 870 (1.4 %). Figure 5 displays the total number of cells (nCellsByTask), the number of owned cells (nCellsSolveByTask) and the number of halo cells per task (nHaloCellsByTask) as a ratio between the noncontiguous and the contiguous partition.Since the communication volume increases for the contiguous partition, the total number of cells per task on average is larger, too.The average number of owned cells is identical, since the number of cells of the graph does not change.Notably, task 200 has a 1.3 times larger number of halo cells for the non-contiguous partition, since its partition consists of two separate patches, which implies a larger number of neighbouring tasks and surrounding halo cells.
To eliminate the influence of the disk I/O on the run times for the two partitions in this test, we switch off the output to the disk.We find that the measured run times for the model integration is practically identical for the two runs tition using the command line arguments -contig -minconn for METIS, which results in an increase of the edge cut from 58031 to 58870 (1.4%).Figure 7 displays the total number of cells (nCellsByTask), the number of owned cells (nCellsSolveByTask), and the number of halo cells per 370 task (nHaloCellsByTask) as ratio between the non-contiguous and the contiguous partition.Since the communication volume increases for the contiguous partition, the total number of cells per task on average is larger, too.The average number of owned cells is identical, since the number of cells of the graph does not change.Notably, task 200 has a 1.3 times larger number of halo cells for the non-contiguous partition, since its partition consists of two separate patches, which implies a larger 375 number of neighbouring tasks and of surrounding halo cells.To eliminate the influence of the disk I/O on the runtimes for the two partitions, we switch off the output to disk.We find that the measured runtimes for the model integration is practically identical for the two runs (251 s non-contiguous vs. 255 s contiguous).For 300 tasks, the average ratio of halo cells to owned cells is 1 : 2.8, which might be too small to see the effect of the additional halo cells in the non-contiguous partition.We there-380 fore repeat the test for non-contiguous and contiguous partitions for 2520 tasks (65 owned cells per task), with a corresponding ratio of 1.2 : 1 halo cells to owned cells.Even in this case, the measured runtimes for the model integration are nearly identical (45.2 s contiguous vs. 45.6 s non-contiguous).
We conclude therefore that the impact of non-contiguous partitions on the runtime is negligible for any reasonable number of tasks for a given mesh.

385
Although the number of grid cells is 4 times larger for this test case than for the regular 120 km mesh, the problem size is still too small for application on Juqueen.The two smallest possible parallel runs with 512 and 1024 tasks correspond to 320 and 160 cells owned per task, for which the decrease in parallel efficiency is 20%.Runs with larger number of tasks all have parallel efficiencies of less than 60%.Table 6 in G lists the cheapest and fastest model runs for the four HPC sites.(251 s non-contiguous vs. 255 s contiguous).For 300 tasks, the average ratio of halo cells to owned cells is 1 : 2.8, which might be too small to see the effect of the additional halo cells in the non-contiguous partition.We therefore repeat the test for non-contiguous and contiguous partitions for 2520 tasks (65 owned cells per task), with a corresponding ratio of 1.2 : 1 halo cells to owned cells.Even in this case, the measured run times for the model integration are nearly identical (45.2 s contiguous vs. 45.6 s non-contiguous).We conclude therefore that the impact of non-contiguous partitions on the run time is negligible for any reasonable number of tasks for a given mesh.
Although the number of grid cells is 4 times larger for this test case than for the regular 120 km mesh, the problem size is still too small for application on Juqueen.The two smallest possible parallel runs with 512 and 1024 tasks correspond to 320 and 160 cells owned per task, for which the decrease in parallel efficiency is 20 %.Runs with a larger number of tasks all have parallel efficiencies of less than 60 %.Table H2 lists the cheapest and fastest model runs for the four HPC sites.

Variable 60-12 km grid
The third moderately sized scaling test consists of a variableresolution mesh with maximum grid spacing of 60 km and minimum grid spacing of 12 km.The refinement area is an approximate ellipse, illustrated in Fig. 4 (lower panels), and encompasses the whole of north and central Africa, extending as far as India in the east and covering a large part of the Atlantic Ocean in the west.This particular mesh is useful for studying the teleconnection between the Indian and Atlantic oceans and the monsoon systems in East and West Africa.The total number of grid cells is 535 554, which corresponds to 1034×517 grid points on a regular latitude-longitude grid and thus is in the range of current re-analyses.A time step of 72 s (6 s per km grid size) is adopted.
Figure 2 (right panel) and Table F1 summarise the scaling of this mesh on the four systems.As for the variable 100-25 km mesh, a separation of the parallel performance by in-terconnect cannot be detected, due to the increasing variability introduced by the disk I/O (see Table H1).While the number of tasks is sufficiently large to obtain a constant powerlaw relationship between the communication volume and the number of tasks up to 16 384 tasks on Juqueen (Fig. 3, right panel), the complexity of the mesh leads to more noncontiguous partitions (not displayed).
Tests runs are conducted on single nodes up to 384 nodes on Curie (6144 tasks, 87 owned cells per task), 120 nodes on ForHLR1 (2400 tasks, 223 owned cells per task), 55 nodes on Jtest-full (1540 tasks, 348 owned cells per task), and 55 nodes on Jtest-half (770 tasks, 696 owned cells per task).Good scaling is achieved up to 48 nodes on Curie (768 tasks, 697 owned cells per task), 45 nodes on ForHLR1 (900 tasks, 595 owned cells per task) and up to the maximum number of 55 nodes on Jtest-full and Jtest-half.In the transition zone between 600 and 150 owned cells per task (shaded in blue), the parallel efficiency on Curie and ForHLR1 drops off quickly from about 80 % to as low as 30 %.
Juropatest shows the best (although most irregular) scaling of the three Linux-cluster systems.With a maximum of 55 available nodes on the system, the parallel performance is better than 70 % for both Jtest-half and Jtest-full.The parallel efficiencies for ForHLR1 and Curie follow a more regular trend, although the number of non-contiguous partitions is highly variable for ForHLR1, but not for Curie.This adds further support to our conclusion that the impact of noncontiguous partitions on the run time is negligible for any reasonable number of tasks for a given mesh.With 535 554 grid cells, this test case also shows good scaling on Juqueen up to 128 nodes (2048 tasks, 262 owned cells per task).For larger number of tasks, the parallel efficiency drops rapidly and constantly and the number of non-contiguous partitions becomes highly variable.
Table H2 lists the cheapest model runs in terms of CPUh spent per 24 h model integration and the fastest runs in terms of real time per 24 h model integration for the four HPC sites.
Figure 6.Total run times (s) for the 120 km regular mesh with 40 962 grid cells, the variable 100-25 km mesh with 163 842 grid cells and the variable 60-12 km mesh with 535 554 grid cells.Indicated are time ratios between the Linux-cluster systems Curie, ForHLR1 and Juropatest, and the Blue Gene system Juqueen.Fits to the total run times are obtained separately for the Linux-cluster systems and the Blue Gene system, using a modified version of Amdahl's law (see Eq. 2 and Table 2), and are displayed as dashed black lines.
Table 2. Fit coefficients of modified Amdahl's law (Eq.2) for the three test cases and the two classes of architectures, with a common value for the exponent X = 2.85.For the Blue Gene system, only the fit to for the 60-12 km mesh should be used to estimate the required computing times.

Comparison of HPC systems
We further address the question how the total run times compare on the various HPC systems.Figure 6 displays the total run times for the 24 h model runs on the four systems.For all test cases, the three Linux-type systems line up quite well, which means that the absolute run times are very close for similar parallel decompositions and that differences in the processor architectures do not translate into model speedup.Due to the large differences in tasks between the Linuxcluster systems and the Blue Gene system, the problem size must be large enough for a fair comparison.Among the three test cases, only the 60-12 km mesh test case satisfies this requirement.
A common feature across the three test cases is a minimum of the total run times on the three Linux-type systems around 150 owned cells per task.The reason for the increase in total run times for smaller numbers of owned cells per task will be discussed in the following section.Here, we attempt to fit a modified version of Amdahl's law (Amdahl, 1967), which also accounts for the observed increase in run times when the parallel performance breaks down: Here, n denotes the number of parallel tasks and T (n) the total run time in seconds.The first term on the right-hand side of Eq. ( 2) represents the serial part of the code, the second term the part that scales perfectly and the third term the part that takes longer when more tasks are used.This allows one to quickly estimate the required resources for the test cases presented here for Linux-type systems (and Blue Gene/Q systems) with similar specifications.The coefficients A and B are allowed to vary across the three test cases and two different classes of architectures.The power-law index X, on the other hand, is derived as the best fit for the three test cases on the Linux-type systems and the 60-12 km on the Blue Gene system.This results in a value of X = 2.85 with a good fit to all four curves, which indicates that the increase in total run times below a certain number of owned cells per task occurs for the same reason (see also the discussion in Sect.2.7).
Table 2 summarises the coefficients obtained from fitting Eq. (2) to each of the six curves.Note that the fits to the 120 km mesh and the 100-25 km mesh on the Blue Gene system are listed here as well, but should be used with precaution.The minimum of T (n) is obtained by setting ∂T (n)/∂n = 0| n=n min and solving for the number of owned cells per task N o, min = N cells /n min .This leads to minimum values of N o, min = {143, 146, 182} owned cells per task and T min = {116, 177, 402} s for the three Linux-type systems on the 120 km mesh, the 100-25 km mesh and the 60-12 km mesh, respectively.For the Blue Gene system, only the fit for the 60-12 km mesh test case can be considered as realistic From the discussion in this section, we infer that a lower limit of about 150 owned cells per task is a good and quick estimate for the Linux-type systems below which the parallelisation becomes highly inefficient.The shift of the minimum value to larger numbers of tasks (lower numbers of owned cells per task) on the Blue Gene system can be attributed to the performance of the faster 5-D Torus interconnect relative to that of the slower CPU cores.The resulting fitting curves are also displayed in Fig. 6.

Breakdown of parallel performance
In the following, we investigate the reasons for the breakdown of the parallel performance.In the previous scaling tests, we include the model set-up -in principle, bootstrapping and reading of the initial conditions file -as well as the parallel output to the disk in the measurements.Here, we split up the total computational costs into three different steps: time integration, model set-up and disk I/O.Over longer runs, the model initialisation costs are amortised and the parallel performance is determined by the numerical solver (time integration) and the parallel I/O (disk output, reading of boundary updates).Accordingly, Fig. 7 displays the total computational costs, the costs for the three steps and a combination of time integration and disk I/O for the Jtest-full system and all test cases.Each time, the parallel efficiency of the combined time integration and disk I/O, indicated with orange stars, starts to decrease for less than 600 owned cells per task.The time integration alone shows nearly ideal scaling down to 150 owned cells per task, while the model set-up and disk I/O are only weakly dependent on the number of tasks.
According to Figs. 6 and 7, the absolute run times for all three test cases start to increase for less than 150 owned cells per task on the three Linux-cluster systems.For the Blue Gene system, this increase in absolute run times takes place at smaller numbers of owned cells per task, but the picture is less clear due to the small problem sizes.However, given the faster interconnect on the Blue Gene system and the fact that the model initialisation and disk I/O depend only weakly on the number of tasks, this suggests a breakdown of the parallel efficiency of the solver (i.e. the time integration) itself.
To exploit the limiting factors of the solver for a large number of tasks, we use the parallel debugging and profiling tools Scalasca12 and Score-P13 .These tools are available as modules on Juqueen; hence, we focus on three particular runs on the 60-12 km mesh on this system.We analyse in detail the 2048-task run (261 owned cells per task), the 4096-task run (130 owned cells per task) and the 8192-task run (65 owned cells per task), for which the latter takes longer than the former two runs (Fig. 6).
Halo cells are added around each patch of the individual tasks, for which communication with the neighbouring tasks is required.The larger the number of tasks, the smaller is the number of owned cells per tasks, and the larger the ratio between halo cells and owned cells.A lower bound for the number of halo cells N h for a given number of owned cells N o is given by Eq. ( 1) and corresponds to a ratio of N o : N h = 261 : 127 = 2.1 for the 2048-task run, 131 : 94 = 1.4 for the 4096-task run and 65 : 70 = 0.94 for the 8192-task run.
Table G1 summarises the percentages of time spent during the time integration step for both runs.Leaving aside the model initialisation (initial bootstrapping, reading of initial conditions), the total time spent for communication increases from 35 % for 2048 tasks to 70 % for 8192 tasks.A detailed analysis of the Scalasca report reveals that most of this time is wasted in MPI_WAIT during the exchange of 2-D and 3-D halo fields.About 16 % of the time is spent for parallel I/O, namely for building and writing output streams and for D. Heinzeller et al.: MPAS: an extreme scaling experiment reading boundary updates (sea surface temperature, sea ice fraction).It should be noted here that the selection of fields written to the disk is customisable at run time using simple ASCII files.Table H1 reports on the number of 3-D and 4-D fields written to the disk in our tests.
The Scalasca reports also reveal that computational imbalances in the parallelisation can have serious impacts.For the example of the 2048-task run on the 60-12 km mesh, the average time required to update the boundary information (sea surface temperature, sea ice fraction) is about 11.8 s per task.However, one single task takes 19.3 s for the same action and blocks all remaining tasks in their execution.Since the model synchronisation takes place when exchanging halo cell data, the different computing times of the model physics appear in the time spent for communication.
These results highlight the importance of an efficient parallelisation and fast interconnects between the compute nodes and to the central storage, in particular for future applications on massively parallel systems and for the extreme scaling tests in Sect. 4.

Reproducing the dynamics of the West African monsoon
The second important aspect of a numerical weather prediction and climate model is its accuracy in reproducing observed meteorological conditions on global, regional and local levels.As described in the introduction, the WAM has turned out to be a notoriously difficult problem in climate modelling, since it is a complex interplay of various dynamical, microphysical and surface-related processes across scales.The common understanding is that the intensity of the monsoon and the associated precipitation on local scales are regulated by the driving, large-scale atmospheric circulation.This picture is challenged and complicated by a recent study of Klein et al. (2015), who found that processes on a local scale, such as mesoscale convection and precipitation events, can have a noticeable influence through feedback effects on the entire monsoon system.Examples are therefore enhanced moisture transport and circulation, and strengthening of westward travelling disturbances (African easterly waves).
In this study, we attempt a first and brief evaluation of the ability of MPAS-A to reproduce the dynamics of the WAM.Due to computational limitations for this short comparison, we are restricted to 11-month long model runs.We focus on the onset of the monsoon season in June/July 1982 for two of the meshes presented above, namely the regular 120 km mesh and the variable 60-12 km mesh.Both models are initialised in September 1981 using Climate Forecast System Reanalysis (CFSR) data (NCEP Climate Forecast System Reanalysis; Saha et al., 2010) at 0.5 • × 0.5 • resolution as initial conditions.Daily updates of the sea surface temperature are taken from the NOAA Optimum Interpolation Sea Surface Tem-perature Analysis (NOAA OI SST; Reynolds et al., 2002) at 0.25 • ×0.25 • resolution.The period from September 1981 to the beginning of 1982 is considered as spin-up time for the model, in particular for the soil conditions.An analysis of the soil properties over the 11-month runs is undertaken to investigate whether such a short spin-up time is sufficient.The model output is compared to different sets of observational data to account for the large uncertainty in the gridded observational products in the data-sparse region of West Africa (see, for example, Lorenz and Kunstmann, 2012;Sylla et al., 2013).
For near-surface air temperature and precipitation, we refer to (1) the Climate Research Unit (CRU) high-resolution gridded time-series data set v. 3.22 (Harris et al., 2014), and (2) the University of Delaware (UDEL) air temperature and precipitation long-term monthly means V3.01 (Willmott and Matsuura, 2014) at 0.5 • ×0.5 • .Additional observational data for near-surface air temperature are obtained from the Global Historical Climate Network (GHCN) gridded 2 m temperature data set V2 (Fan and van den Dool, 2008) at 0.5 • × 0.5 • .For precipitation, we also use the Global Precipitation Climatology Centre (GPCC) full data re-analysis version 6 monthly means (Schneider et al., 2011), also at 0.5 • × 0.5 • .We also compare the MPAS-A model output to reference data obtained from a set of novel regional climate simulations over West Africa within the WASCAL program14 .these data are produced using the regional climate model WRF at 12 km resolution with initial and lateral boundary conditions provided by the ERA-Interim re-analysis (Dee et al., 2011, 80 km resolution).The WRF model uses a set-up that is optimised for the region of West Africa, following a detailed analysis of the monsoon dynamics for different WRF model configurations by Klein et al. (2015).An extensive documentation and analysis of this reference data set will be given in a forthcoming paper.The WRF model run covers a region from 25 • W to 25 • E and 5 • S to 25 • N and is initialised in January 1979.Spectral nudging is applied to the WRF limited area model to keep it on track with the large-scale features of the driving ERA-Interim re-analysis.The sea surface temperature is updated every 6 h from the ERA-Interim SST data.
The WRF model domain lies entirely within the refinement zone of the variable 60-12 km mesh.Hence, the resolution of the MPAS-A runs is 120 km for the regular mesh (MPAS-120r hereafter) and 12 km for the variable mesh (MPAS-12v hereafter).Figure 8 shows the model topography of the WRF reference model at 12 km resolution (WRF-12r hereafter) and the MPAS-A models.Also indicated is a classification of the land area into five agro-climatical zones, which will be used in the further analysis.Naturally, the topography is nearly identical for WRF-12r and MPAS-12v.Minor differences can be seen along the coastline and inland water bodies, which are caused by the different grids and by the need to re-grid the MPAS model output onto a regular lat-long grid.This re-gridding is performed with an NCL 15 script mpas_to_latlon.py(L.Fowler, personal communication, 2014), which adds another step to the postprocessing tasks and is computationally quite demanding; for example, a minimum of 1.5 GB of memory and 10 s run time is required to re-grid one 3-D variable of MPAS-12v for one time step.For MPAS-120v, the terrain is smoothed out and the coastline (often referred to as landmask in the models) is ill represented.This has negative effects when verifying the model output over regions such as Guinea, as we will see below.
In Fig. 9, we analyse the spatial distribution of the nearsurface temperature for July 1982 (top panel) and its annual cycle between September 1981 and July 1982 (bottom panel).The observations provide data over land only and show noticeable spatial differences in the position and intensity of the Saharan heat low (SHL), in particular between CRU/UDEL and GHCN, and minor differences over Ghana and along the coastline.Regarding the model runs, WRF-12r matches the position and temperature of the SHL best and reproduces the observed temperatures.Both MPAS models show a slightly colder surface temperature distribution for July 1982, and the cold bias in MPAS-120r is larger on average.The sea surface temperature distribution is similar for the two MPAS runs, since they are using the same SST data for their daily updates.However, MPAS-120r shows strong artefacts along the coastline of the Gulf of Guinea, which is due to its inaccurate landmask.The WRF-12r SST, which is updated every 6 h from ERA-Interim data, is colder over the Golf of Guinea, but otherwise shows the same patterns.With respect to the temporal evolution, the annual cycle is reproduced well for both MPAS models; however, a significant cold bias is detected over most of the land area from the time of model initialisation in September 1981 up to May 1982.From June to July 1982, this bias seems to nearly vanish with the exception of the coarser MPAS-120v over Guinea 15 http://http://www.ncl.ucar.edudue to the limited resolution of the coastline.The observational data sets displayed in the lower panel agree in general, but show small differences for the Saharan region (only CRU and GHCN are displayed for clarity; UDEL is very close to CRU).
Figure 10 likewise displays the spatial distribution of the monthly precipitation for July 1982 (top panel) and its annual cycle (bottom panel) for the three observational data sets and the three models.At the onset of the monsoon season, the rain band is centred over 10 • N for all three observational data sets.Areas of high precipitation are found over Guinea, Sierra Leone and the Senegal in the west, and over Nigeria, Cameroon and the Central African Republic in the east.While the spatial distribution hardly shows any difference between the observational data sets, the temporal evolution deviates for the Saharan and Guinea regions.For the Saharan region, small absolute values and a very small number of actual observations lead to large relative uncertainties.For Guinea, the spatial interpolation along the coastline leads to differences in the timing of the maximum precipitation (April 1982 for CRU, May 1982 for GPCC; as for temperature, UDEL follows CRU closely).
The three models successfully reproduce the location of the rain band, a fact that should not be taken for granted.In the case of WRF, Klein et al. (2015) demonstrated that the representation of the monsoon dynamics is largely determined by the microphysics and the planetary boundary layer schemes, while the cumulus scheme seems to play more of a role on daily timescales and for the actual amount of precipitation triggered by mesoscale convection.The WRF model configuration chosen here uses the WSM5 microphysics scheme, the ACM2 planetary boundary layer parameterisation and the Grell-Freitas cumulus scheme (see Wang et al., 2014, for a summary of the WRF physics options and further references).It is highly optimised for the region and thus not only matches the location of the rain band but also reproduces the observed precipitation patterns over the Soudano, the Sahel and the Sahelo regions which consists of the WSM6 microphysics scheme (similar to WSM5 with graupel as additional hydrometeor), the YSU (Yonsei University) planetary boundary layer parameterisation and the Kain-Fritsch cumulus scheme.This particular combination produces excessive precipitation during the peak of the monsoon in WRF due to a non-linear response of convective precipitation events to the dynamics, and it seems to exhibit the same behaviour in the two MPAS models.
With respect to the annual cycle of precipitation, the WRF model shows excessive precipitation for all months and an early onset of the monsoon season for the Soudano, Sahel and Sahelo regions.The excess in rainfall over all land area is mainly caused by an overestimation over Guinea, which receives the most rainfall over the year (more than 3000 mm year −1 , compared to less than 500 mm year −1 in the Sahelo region).The out-of-the-box MPAS models match the observations better, in particular the timing of the onset of the rainy season and the precipitation over Guinea.The coarser MPAS-120r is closer to the observed precipitation over the inland areas than the higher-resolution MPAS-12v during the onset of the monsoon in June/July 1982.This, however, is not a signal of a higher accuracy of the coarser model: at 120 km resolution, the Kain-Fritsch cumulus scheme is less active than at 12 km resolution, and consequently its wet bias is reduced.To support this further, Fig. 11a displays the mean sea level pressure map for July 1982.WRF-12r and MPAS-12r both show a distinct area of low pressure, the SHL, alongside the south-north pressure gradient that is causing the movement of the ITCZ and the associated monsoon rain band.This SHL is much less pronounced in MPAS-120r.In both MPAS models, the region of low pressure is displaced by about 10 • to the east compared to WRF.Lavaysse et al. (2009) derived a climatological mean position of 3 • W, 23 • N from reanalysis data between 1979 and 2001, which coincides well with WRF.
However, the SHL is not only a region of low surface pressure but also of high surface temperatures.Comparing Figs. 9 and 11a, it is apparent that the areas of low pressure and high temperature are co-located for WRF-12r.The hightemperature regions of MPAS-120r and MPAS-12v match the WRF SHL position albeit with a cold bias of about 2 • C, while the low-pressure region is shifted to the east.This shift is caused by the overall cold bias in the two MPAS models: the SHL, like any other non-frontal thermal low, is formed by rapid solar heating of the land surface and the near-surface layers, which leads to a rising of hot and dense air and to the formation of a stationary low-pressure area.Thus, the cold bias in the MPAS models implies a less intense deepening of the pressure system at the expected location of the SHL.As discussed above, the general cold bias over all land areas  persists from the onset of the MPAS model runs in September 1981 and only starts to vanish from June to July 1982 on.This is reflected in the temporal evolution of the mean sea level pressure (Fig. 12, left panel), integrated over all land areas, which is the highest for MPAS-120r, the lowest for WRF-12r, in-between for MPAS-12v and merging towards each other from June 1982 on.
A final investigation of the soil properties in Figs.11b  and c and 12 (middle and right panel) reveal the root cause of the cold temperature bias and the associated displacement of the low-pressure system.Starting from the MPAS model initialisation in September 1981, the soil temperature is about 2 • C lower then for the WRF model run, which is initialised in January 1979 and thus has more than 2 years to spin up the land surface model (community Noah land surface model for both WRF and MPAS).Over the course of the simulated 11 months, the MPAS soil temperatures start to converge towards the WRF soil temperature: in July 1982, the soil temperatures distributions in Fig. 11b are already quite similar.However, the relative soil moisture maintains a constant bias for most of the time and still shows spatial differences in July 1982.This is due to the fact that between the MPAS model initialisation in September 1981 and the onset of the monsoon season in 1982, virtually no precipitation occurs over large fractions of the land area and that the inherently low soil moisture simply has no possibility to adapt.
We therefore conclude from this section that the globalto-variable-resolution mesh concept and its implementation in MPAS-A have the potential to reproduce the dynamics of the WAM and its associated precipitation.In this single, 11-month experiment, the out-of-the-box set-up of MPAS can compete with the optimised WRF set-up with respect to the timing of the onset of the rainy season.The highresolution MPAS-12v model shows a better performance than the coarse-resolution MPAS-120r model in general, but in particular for the coastal regions.Deficiencies in the location of the SHL and a cold bias in the surface temperature are apparent in both MPAS model runs, which are caused to a large extend by an insufficient spin-up time of the models.Detailed investigations of the WRF reference run, not discussed here, suggest that a spin-up time of at least 1 year is required in order to adjust the soil moisture.The MPAS runs also tend to overestimate the monsoon precipitation, which we attribute to the combination of physical parameterisations of the out-of-the-box set-up.

Extreme scaling experiment at very high resolution
In this section, we evaluate the scalability and limitations of the MPAS-A model for a very large mesh on a massively parallel system.This experiment is motivated by the following two aspects: 1. Future HPC environments will most likely be massively parallel systems, with the number of cores per node and the number of nodes increasing much faster than the speed of the individual cores.Models such as MPAS-A have to be able to scale on these systems in order to be used successfully in the future.
2. The typical model resolution of global and regional NWP and climate models has increased continuously over the past few decades.Currently, the European Centre for Medium-Range Weather Forecasts (ECMWF) is leading the field with an operational global NWP model at 14 km resolution16 , and limited area models have been taken down to sub-kilometre resolution.With this experiment, we want to demonstrate that convectionresolving (below about 4 km grid size; see Weisman et al., 1997;Prein et al., 2013, for example), global atmospheric simulations are possible on current HPC environments.
To create a large-enough problem for these tests at a convection-resolving resolution, we use a regular, global 3 km mesh with more than 65 million horizontal grid cells and 41 vertical levels.Up to now, only a few real-data simulations on this mesh have been conducted on NWSC's Yellowstone supercomputer using a maximum of 16 384 MPI tasks (approx.4000 owned cells per task); a set of scaling benchmarks based on an idealised case have also been run on up to 131 072 MPI tasks on Edison, a Cray XC30 at the National Energy Research Scientific Computing Center17 .Among the four HPC sites presented earlier, only the FZJ Juqueen offers a large enough number of cores to conduct this extreme scaling experiment.With a maximum of 458 752 cores, these tests require scaling out to the full system, which is not possible during normal operations.However, following the 3rd Juqueen Tuning and Porting Workshop in February 2015, a few selected applicants were invited to conduct extreme scaling tests on the full machine during a period of 24 h.The results presented in the following were obtained during this event, which is summarised in a technical report by Brömmel et al. (2015).

Model configuration and experiment preparation
Several modifications of the MPAS-A code are required to conduct this experiment.First, it is no longer possible to use the standard NetCDF large-file format (CDF-2), since the maximum number of elements for some of the 4-D variables exceeds the internal limit of 2 billion values for any particular record.One possible solution is to use the CDF-5 extension of CDF-2, which is supported by the Parallel-NetCDF library.However, only very few applications understand this format, and initial testing on Juqueen revealed problems reading correct data in massively parallel read operations.Another solution, which is adopted here, is to use the newer NetCDF-4 format, which supports parallel I/O through PHDF5.This requires upgrading the parallel I/O library PIO from v1.7.1 to 1.9.15, and modifying the I/O framework of the MPAS-A model code.Motivated by this study, MPAS-A v4.0, released at the time of writing, supports NetCDF-4 I/O without any need to modify the software framework.
Further, the generation of the model input data becomes a large computational problem, which cannot be fit on a single machine due to time constraints and memory limitations.In MPAS release v3.1, the pre-processing of the data is partly a serial process running on one CPU core only.Hence, a parallelisation of the pre-processing is required in addition to the above changes to the I/O routines.These changes are applied to the basic MPAS-A v3.1 code, and this modified version is used in the following scaling experiments.
Another difference from the default model configuration presented in Sect.2.2 is that each test is run for 1 h model time only.This implies that the update of the surface data (sea surface temperature, sea ice fraction), which occurs every 24 h in the above moderate scaling tests, is dropped.Also, with a global resolution of 3 km, it is generally agreed that no convection scheme is required, since the microphysics scheme is able to generate the convective precipitation systems on the grid scale.These modifications are reflected in the model namelist in Appendix C.
The pre-processing consists of two steps.First, a static data set is produced (static.nc),which maps invariants such as terrain height, landmask and land use classification onto the 3 km mesh.These invariant fields are required as input for the subsequent generation of the initial conditions (init.nc), in part because the terrain field is necessary for the generation of the height-based vertical grid.The parallel pre-processing steps require special mesh partitions, different from the partitions generated by METIS and used for the model runs.Here, both steps are conducted with 576 cores, spread across 80 nodes on the ForHLR1 to provide sufficient memory for each task.Each step takes about 1 h 15 min real time and requires around 4.6 TB of the available 5.1 TB of memory.The resulting initial conditions file has a size of 1.2 TB and is transferred to Juqueen over the 10 Gb s −1 internet connection between the two HPC sites.Since both preprocessing steps are only required once, no further investigation or optimisation of the run times is attempted.

First attempts and optimisations
Initial test runs at 3 km resolution revealed previously unknown problems on the system.As described in Sect.2.2, a bootstrapping step is required during the model initialisation to set up the grid and instruct individual tasks with whom to share information about neighbouring grid cells.In the MPAS code, this is implemented using hash tables.In order to complete the bootstrapping in a reasonable time, the hash table size (parameter TABLESIZE) is increased from the default value 27 183 to 6 000 000.After this adjustment, the bootstrapping step takes between 18 and 29 min on Juqueen, depending on the number of MPI tasks (see Table 3).A second bottleneck is the reading of the initial conditions file.Performance improvements for this step With these optimisations, the parallel reading of the initial conditions file improves slightly with the number of tasks, since they are located in different racks.Conversely, the bootstrapping step takes longer for larger numbers of tasks.Hence, the overall model initialisation takes approximately 45 min for the 3 km mesh and varies only slightly with the number of nodes.One notable exception here is the run on 8192 nodes, for which the initial I/O is only 50 % of that of the other runs.The exact reasons for this behaviour need to be investigated, but we think that this combination of file size and I/O tasks is a sweet spot on Juqueen.

Execution of extreme scaling tests
The substantial memory requirements for the 3 km mesh do not allow one to run the model on 1024 or 2048 nodes.The baseline for our scaling experiment is therefore the run on 4096 nodes (4 racks, 65 536 MPI tasks, 512 I/O tasks, 65 TB memory).Contrary to the model initialisation, the time integration step scales very well up to the entire machine, with a parallel efficiency of 87 % for 24 576 nodes (393 216 tasks, 167 cells per task) compared to the baseline (see Table 3).The test run on the full system with 28 672 nodes (458 752 tasks, 143 cells per task) shows a lower performance than the run on 24 racks and a parallel efficiency of nearly 70 %, which confirms the above-mentioned limit of approximately 150 owned cells per task, below which the parallelisation becomes inefficient.
All scaling tests are conducted with an 18 s model integration time step for a 1 h model time.However, we find that in order to keep the model stable when starting off the 3 km mesh from initial conditions derived from a 48 km reanalysis data set (CFSR), a more conservative time step is required.MPAS currently lacks a dynamical initialisation system (e.g.digital filters, adaptive time-stepping), which could avoid this issue.Model instabilities leading to NaNs can affect the performance in different ways: (1) the performance might increase in cases where if-NaN-tests causes the code to return early from computationally intensive physics routines, or (2) the performance might decrease due to the continuous generation of floating-point exceptions.We note that on Blue Gene systems, MPAS does not by default include the -qflttrap compiler flag to trap floating-point exceptions, and, consequently, the model will continue to run even when the simulation generates NaNs.We therefore repeat the runs for 4096, 8192 and 16 384 nodes with a 12 s time step to obtain a stable model run.The measured real times for the three 12 s runs are very close to 1.5 times the real times for the corresponding 18 s runs, which gives us confidence that we can scale the results for 24 576 and 28 672 nodes with an 18 s time step to a 12 s time step, despite the fact that the runs with an 18 s time step produced NaNs.
Table 3 and Fig. 13 summarise the required times of the individual steps of the 3 km runs with an 18 s integration time step.Due to wall-time constraints, we only conduct runs without writing output to the disk.The last column in Table 3 estimates how many hours the 3 km model can be advanced within 24 h wall time, and is calculated as follows: a 12 s model integration time step is assumed, and the real time required is scaled from the 18 s runs by a factor of 1.5 for 24 576 and 28 672 nodes, for which no 12 s runs are conducted.For a typical production run, diagnostic output files of 13 GB size are written every 3 h model time, while comprehensive output files of approximately 250 GB size are written every 24 h model time.A restart file of 2.1 TB size is written at the end of the model run.Based on a parallel write performance of 0.6 GB s −1 , we make a conservative estimate that roughly 2 hours of the 24 h wall time will be used up by writing these files to the disk.Tables H1 and H2 list the file sizes and the cheapest and fastest model runs for the 3 km mesh on Juqueen.
We conclude from this extreme scaling test that the dynamical solver of MPAS scales on massively parallel systems out to hundreds of thousands of cores.Our results confirm that the model behaves similar for the 3 km mesh than for the significantly smaller problem sizes and that the parallel efficiency is limited by the same factors, namely the increasing number of halo cells and the amount of communication for a large number of tasks.This occurs around 150 owned cells per task, which corresponds to roughly 27 300 nodes for the 3 km mesh.However, we find that the model initialisation and the disk I/O become increasingly important and at the same time difficult to improve for extremely large test cases.Compared to the model integration, the time D. Heinzeller et al.: MPAS: an extreme scaling experiment required for the model initialisation and for reading and writing data is largely independent of the number of tasks.For a maximum wall time of 24 h on Juqueen, these steps consume up to 3 h or 10-15 % of the total job time -in other words, hundreds of thousands of CPUh.The cheapest run on Juqueen utilises 4096 nodes (4 racks, 65 536 tasks), consumes 1.3 Mio CPUh for a 24 h model integration and gives a speed-up of 1.2 × real time.For 24 576 nodes (24 racks, 393 216 tasks), a speed-up of 6.3 × real time is achieved at the slightly higher expense of 1.5 Mio CPUh.

Conclusions
In this study, we analyse the atmospheric model MPAS-A in detail for its numerical performance and for its physical accuracy.We conduct scaling tests for three medium-size problems using regular and variable meshes of different complexity on four different HPC facilities.We confirm an overall good scaling ( 70 % parallel efficiency) of MPAS across all systems and find that a value 150 cells owned by each task provides a good and quick estimate for the breakdown of the parallel performance when set-up and I/O costs are included in the scaling.This limit depends weakly on the interconnect of the system (absolute but also relative to the core speed), with faster interconnects corresponding to lower values, but also on the I/O performance.Accordingly, it is slightly lower for the Blue Gene system (between 100 owned cells per task for the 60-12 km mesh and 150 owned cells per task for the 3 km mesh).Taking into account that the set-up costs are amortised over longer runs, MPAS-A maintains a parallel efficiency of 80 % or better for more than 150 owned cells (100 owned cells on Juqueen) per task.Based on these findings, we provide numbers on the typical file sizes and optimal model configurations for conducting research and operational runs on the different HPC systems.
An in-depth analysis of the properties of different graph partitions for one of the meshes shows that the impact of non-contiguous graph partitions in form of changes in the number of halo cells is negligible for any reasonable number of tasks for a given problem size.We further employ the parallel profiling tools Scalasca and Score-P to identify the bottlenecks in the MPAS-A code when the parallel performance breaks down.Our findings confirm that most of the time in such cases is spent waiting during the communication with neighbouring tasks, but also demonstrate the negative impact that computational imbalances can have on the model performance.
We also study the accuracy of MPAS-A for one common and challenging problem in climate research, namely the capability to reproduce the dynamics of the WAM and its associated precipitation.We conduct 11-month simulations for two meshes, a regular 120 km mesh and a variable 60-12 km mesh, and compare the model output to a number of observation data sets, selected re-analyses and a reference model run.The reference model run is chosen from a novel set of regional climate simulations over West Africa within the framework of the WASCAL programme and employs the regional climate model WRF, from which MPAS inherits several aspects of the dynamical solver and all of its physical parameterisation schemes.
We find that MPAS-A is able to model the monsoon dynamics and the northwards movement of the monsoon rain band in this pilot study.Despite using an out-of-the-box configuration of the model, both runs reproduce the timing of the onset of the monsoon season better than the optimised WRF reference run.We find that the precipitation in the early monsoon season is overestimated, which we attribute to the choice of physics parameterisations.The MPAS model runs also show a cold bias in the near-surface temperature and consequently fail to place the SHL at the correct location, which we believe stems from a too short spin-up time of the model.We would like to stress that our pilot study aims primarily at investigating the feasibility to conduct climate modelling research with MPAS for the West African region.In order to be able to draw general conclusions on the ability of MPAS to reproduce observed climatological properties, an ensemble of model runs over longer time periods with sufficiently long spin-up times is required.
In the last part of this study, we conduct extreme scaling tests on a global 3 km mesh with more than 65 million grid cells on up to 458 752 cores on Juqueen, the IBM Blue Gene/Q at the Forschungszentrum Jülich.We describe the issues that arise when attempting such an experiment for the first time -up to now, MPAS-A has been run for real-data cases, which include a full physics suite, on a maximum number of 16 384 tasks on Yellowstone -and provide solutions that allow one to conduct the scaling test in the first place and improve the model performance.We find that the model scales very well up to the entire machine with a parallel performance of nearly 70 % for 458 752 tasks.We confirm that the limitations and rules to estimate the scaling, derived for moderate problem sizes, are also valid for the extremely large test case.This gives confidence for planning model experiments and estimating required run times, storage and computational resources.
Furthermore, we identify additional aspects in the model that become increasingly relevant for larger problem sizes: the model initialisation and the disk I/O.We describe strategies to improve the performance of the model, which are partly machine dependent.We further give estimates on required run times and resources for conducting scientific experiments with the 3 km mesh on Juqueen.
Our next steps will be to conduct a number of longer simulation experiments on regular and variable-resolution meshes with a moderate number of grid cells.Specifically, we plan to pursue the study on the dynamics of the WAM using a variable-resolution grid such as the 60-12 km mesh and a regular grid with a similarly fine resolution.This will allow us to compare the accuracy of the model after a full spin-up of the soil conditions and to assess the impact of the variable mesh on the model results.It will also allow us to study physical processes such as the teleconnection between the oceans and the African monsoon systems, and investigate the impact of climate and land use changes in a consistent approach.
In conclusion, the MPAS-A model is a novel atmospheric model that scales well on a range of architectures for small up to extremely large numbers of tasks.Based on an unstructured Voronoi mesh, it allows to conduct global simulations with local refinement regions and smooth transition in-between them.This makes it possible to study local-scale processes in regions of interest with a full coupling to the large-scale motions and a physical consistency within the model.This is shown here in a pilot study of the WAM.We also demonstrate that it is possible to conduct global, convection-resolving atmospheric simulations with MPAS on current and future massively parallel systems.However, it is evident that the application of models such as MPAS for extremely large problem sizes and numbers of tasks requires substantial efforts to optimise the model to the problem and to the machine it is run on.In order to do so, interdisciplinary approaches and more intensive training of scientist on hardware, software and programming techniques are paramount.

Figure 1 .
Figure 1.Upper left: sketch of a variable-resolution Voronoi mesh used for the horizontal grid; upper right: C-grid staggering of state variables (adapted from Skamarock et al., 2012); lower left: a block of owned cells (blue) assigned to an Message Passing Interface (MPI) task, along with two layers of halo cells (red, orange); lower right: partitioning of the regular 120 km mesh with 40 962 grid cells for 64 tasks.

Figure 2 .
Figure 2. Parallel efficiency (%) for the 120 km regular mesh with 40 962 grid cells, the variable 100-25 km mesh with 163 842 grid cells and the variable 60-12 km mesh with 535 554 grid cells.

FigureFigure 3 .
Figure 3. Communication volume as function of number of tasks for the three test cases (left: 120 km; middle: 100-25 km; right: 60-12 km).The dashed lines correspond to an exponential relation with a power-law index of 0.52.For large numbers of tasks, a runaway growth can be seen for the 120 km and to some extent also for the 100-25 km mesh.

Figure 4 .
Figure 4. (Left) approximate mesh cell sizes and (right) distribution of mesh cell sizes for (top) the variable 100-25 km mesh with 163 842 grid cells and (bottom) the variable 60-12 km mesh with 535 554 grid cells.

Figure 7 .
Figure7.Ratio of (left) total number of cells, (mid) number of owned cells, and (right) number of halo cells per task between the non-contiguous and the contiguous graph partition of the variable 100-25 km mesh for 300 tasks.

Figure 5 .
Figure5.Ratio of (left) total number of cells, (middle) number of owned cells and (right) number of halo cells per task between the noncontiguous and the contiguous graph partition of the variable 100-25 km mesh for 300 tasks.

Figure 7 .
Figure7.Required time for a 24 h integration (s) split into time integration, model set-up and disk I/O for the three test cases on Jtest-full.Also displayed are the total time and a combination of time integration and disk I/O to reflect the parallel performance of MPAS-A for longer model runs, for which the initial set-up costs are amortised.

Figure 8 .
Figure 8. Model topography (terrain height in metres) for the WRF reference and the two MPAS model runs for the West African WRF domain.The left panel also indicates five distinct agro-climatical zones, following a gradient of decreasing annual precipitation from south to north.

Figure 9 .
Figure 9. (Top) mean near-surface temperature (over land) in • C for July 1982 for the three observational data sets CRU, UDEL, GHCN, the WRF reference and the two MPAS model runs; (bottom) annual cycle of mean near-surface temperature over the entire land area and the five agro-climatical zones depicted in Fig. 8.

Figure 10 .
Figure 10.(Top) monthly precipitation (over land) in millimetres for July 1982 for the three observational data sets CRU, UDEL, GPCC, the WRF reference and the two MPAS model runs; (bottom) annual cycle of monthly precipitation over the entire land area and the five agro-climatical zones depicted in Fig. 8.

Figure 11 .
Figure 11.(Top) mean sea level pressure, (middle) mean soil temperature and (bottom) mean relative soil moisture over land for July 1982 for the WRF reference and the MPAS model runs.

Figure 12 .
Figure 12.Annual cycle of (left) mean sea level pressure in hPa, (middle) mean soil temperature in • C and (right) mean relative soil moisture in % over land for the WRF reference and the MPAS model runs.

Figure 13 .
Figure 13.Required times for individual steps of the 3 km test runs on Juqueen (18 s time step).

Table 1 .
Specifications of the four HPC facilities used for the medium-size scaling tests in Sect. 2.
a Thin nodes only, ignoring fat nodes (larger memory) and hybrid nodes (GPU accelerators) b Using dedicated I/O nodes, typically 128 I/O nodes per 1024 nodes (1 rack)