We examine an alternative approach to heterogeneous cluster-computing in the many-core era for Earth system models, using the European Centre for Medium-Range Weather Forecasts Hamburg (ECHAM)/Modular Earth Submodel System (MESSy) Atmospheric Chemistry (EMAC) model as a pilot application on the Dynamical Exascale Entry Platform (DEEP). A set of autonomous coprocessors interconnected together, called Booster, complements a conventional HPC Cluster and increases its computing performance, offering extra flexibility to expose multiple levels of parallelism and achieve better scalability. The EMAC model atmospheric chemistry code (Module Efficiently Calculating the Chemistry of the Atmosphere (MECCA)) was taskified with an offload mechanism implemented using OmpSs directives. The model was ported to the MareNostrum 3 supercomputer to allow testing with Intel Xeon Phi accelerators on a production-size machine. The changes proposed in this paper are expected to contribute to the eventual adoption of Cluster–Booster division and Many Integrated Core (MIC) accelerated architectures in presently available implementations of Earth system models, towards exploiting the potential of a fully Exascale-capable platform.
The ECHAM/MESSy Atmospheric Chemistry (EMAC) model is a numerical chemistry and climate simulation system that includes sub-models describing tropospheric and middle atmosphere processes and their interaction with oceans, land, and human influences (Jöckel et al., 2010). It uses the second version of the Modular Earth Submodel System (MESSy2) to link multi-institutional computer codes. The core atmospheric model is the 5th generation European Centre for Medium Range Weather Forecasts Hamburg general circulation model (ECHAM5, Roeckner et al., 2003, 2006).
The EMAC model runs on several platforms, but it is currently unsuitable for massively parallel computers, due to its scalability limitations and large memory requirements per core. EMAC employs complex Earth-system simulations, coupling a global circulation model (GCM) with local physical and chemical models. The global dynamical processes are strongly coupled and have high communication demands, while the local physical processes are inherently independent with high computation demands. This heterogeneity between different parts of the EMAC model poses a major challenge when running on homogeneous parallel supercomputers.
We test a new approach for a novel supercomputing architecture as proposed by the DEEP project (Eicker et al., 2013, 2015; Mallon et al., 2012, 2013; Suarez et al., 2011), an innovative European response to the Exascale challenge. Instead of adding accelerator cards to Cluster nodes, the DEEP project proposes to use a set of interconnected coprocessors working autonomously (called Booster), which complements a standard Cluster. Together with a software stack focused on meeting Exascale requirements – comprising adapted programming models, libraries, and performance tools – the DEEP architecture enables unprecedented scalability. The system-level heterogeneity of DEEP, as opposed to the common node-level heterogeneity, allows users to run applications with kernels of high scalability alongside kernels of low scalability concurrently on different sides of the system, avoiding at the same time over- and under-subscription.
The Cluster–Booster architecture is naturally suited to global atmospheric circulation–chemistry models, with global components running on the Cluster nodes exploiting the high-speed Xeon processors and local components running on the highly parallel Xeon Phi coprocessors. By balancing communication vs. computation, the DEEP concept provides a new degree of freedom, allowing us to distribute the different components at their optimal parallelization.
The EMAC model comprises two parts, the dynamical base model ECHAM, using a non-local, spectral algorithm with low scalability, and the modular framework MESSy, linking local physical and chemical processes to the base model, with high scalability. While the number of processors used for the base model is limited by the non-local spectral representation of global dynamical processes, local physical and chemical processes described by framework submodels run independently of their neighbours and present very high scalability.
The ECHAM base model runs in parallel in the distributed-memory paradigm using the Message Passing Interface (MPI, Aoyama and Nakano, 1999) library for communication; the MESSy framework inherits the parallel decomposition defined by the base model. While ECHAM has been shown to be able to exploit the shared-memory paradigm using the Open Multi-Processing (OpenMP, 2015) library (Dagum and Menon, 1998), no such effort had been undertaken for the MESSy model so far.
It is, however, currently not possible to delegate the whole MESSy subsystem to full multi-threaded execution as some physical processes are naturally modelled in a column-based approach, and are strongly dependent on the system states at their vertically adjacent grid points. The implementation of submodels simulating these processes consequently relies on the column structure inherited from the base model, e.g. sunlight intensity at lower grid points depending on the absorption at higher grid points, and precipitation depending on the flux of moisture from vertically adjacent grid cells.
Describing gas-phase chemical kinetics, the MESSy MECCA submodel (Sander et al., 2011) executes independently of its physical neighbours and is not limited by vertical adjacency relations. As more than half of the total run time is spent in MECCA for a typical model scenario, it seems adequate to concentrate on the MECCA kernel with strong algorithmic locality and small communication volume per task. As sketched in Fig. 1 the current implementation of MECCA, developed in the DEEP project, is delegated to the Booster using a task-based approach, while both ECHAM and the remaining MESSy submodels are executed on the Cluster in the distributed-memory paradigm.
Phases of EMAC. Green phases run on the Cluster; blue phases run on the Booster.
Implementing a spectral model of the dynamical state of the atmosphere, the ECHAM phase comprises six transform and six transposition operations in each time step. The data in memory for each time step (data size scales with the square of the model horizontal resolution) are transposed in an all-to-all communication pattern, and this phase is dominated by network bandwidth.
Figure 2 displays one model cycle traced with Extrae/Paraver (Extrae, 2015; Paraver, 2015) starting with the end of the grid-point calculations of the last time step calculated – in which most processors are already idle (orange) due to load imbalance and waiting for process 14 (blue) to finish running. This is followed by the transpositions in the beginning of a new time step, and Fourier and Legendre transformations (magenta), which execute simultaneously, as further analysis showed. After the transpositions a short interval with all processors running (blue) can be identified with the time-step integration in spectral space, followed by the inverse transformations and transpositions and transport calculations in ECHAM.
Paraver trace of major processor usage of one time step. Time is along the horizontal and each bar corresponds to a separate CPU core. Blue colour depicts computation; orange corresponds to idle time due to load imbalance. The grid-space transpositions and Fourier and Legendre transformations are shown in magenta.
While the pattern described so far repeats towards the end of the displayed interval, the major fraction of the time step is spent without communication, running (blue) or waiting (orange) in calculations in MESSy in grid space. The MESSy phase comprises some 30 submodels that are tightly coupled by exchanging the atmospheric observables using global variables. Investigations during the first phase of the project determined the load imbalance visible in Fig. 2 to be caused by chemical processes computed in the MECCA submodel.
The observed load imbalance is one of the main factors determining application scalability. It is caused by an adaptive time-step integrator solving a system of differential equations. As the stiffness of these equations representing photochemical reactions varies by up to 2 orders of magnitude due to changes in the intensity of sunlight, the adaptive integrator demands varying amounts of run time accordingly.
In the MECCA phase the algorithmically complex adaptive time-step differential equation integrator operates on variables representing chemical concentrations of a total data size of the order of a few kilobytes per grid point. Yet, as observed using Scalasca (Scalasca, 2015), this phase consumes the major proportion of the total execution time (76 %), and it is compute-bound and an obvious candidate for offloading to accelerators. It should be noted though that in a conventional architecture, accelerating this highly parallel phase will not eliminate the load imbalance.
To test the model scalability out-of-the-box, the EMAC application has been ported to the JUDGE cluster at the Jülich Supercomputing Centre (JSC), and a representative benchmark with a horizontal resolution of 128 grid points in the longitudinal direction and 64 grid points in the latitudinal direction with 90 vertical levels and a spin-up period of 8 simulated months has been compiled, frozen, and packaged to be used for measurements. Table 1 details the experimental set-up for the results shown in this section. The default E5M2 KPP chemistry batch option, along with model namelist set-up (NML_SETUP) E5M2/02b_qctm, without dynamics nudging and with the diagnostic submodels D14CO, DRADON, S4D, and VISO switched off, is used throughout this work. All disk output channels were turned off for timing purposes.
Experimental set-up for JUDGE scalability test out-of-the-box.
The EMAC strong scaling was benchmarked with different numbers of processors
on JUDGE in order to determine the run-time behaviour of the total
application. As shown in Fig. 3, the application scales up to 384 processes
(16 nodes
Wall time for 1 simulated day vs. the number of nodes on JUDGE.
System set-up details for the analysis done on the JUDGE system.
While the number of processors used for the distributed-memory part of the code is limited by the scalability of the non-local representation of global dynamical processes in ECHAM, the local processes in MESSy running independently of their neighbours scale very well. The MESSy subsystem has not been designed for multi-threaded execution, though, and contains non-local code due to characteristics of the physical processes and algorithmic design decisions. Some physical processes are naturally modelled in a column-based approach, because they are strongly dependent on the system states at vertically adjacent grid points, e.g. sunlight intensity at lower grid points depending on the absorption at higher grid points, and precipitation depending on the flux of moisture from vertically adjacent grid cells. Sub-models simulating these processes consequently rely on the column structure implemented in the current model.
In the existing distributed-memory parallel decomposition, the three-dimensional model grid is split horizontally using two run-time parameters, setting the number of processes in the latitudinal and longitudinal directions. As work is distributed independently for each direction, a rectangular decomposition is obtained. In ECHAM5, two such rectangular sets of grid points, symmetric with respect to the Equator and balancing the load distribution between the Earth's hemispheres, are assigned to one processor. Further, the first transposition in the dynamical core reorders the variables assigned to a processor to rectangular stripes extending along the full latitudes. Assigning the two run-time parameters mentioned above so that the rectangular sets are elongated in the latitudinal direction reduces the inter-processor communication in the first transposition.
The physical load imbalance, caused by photo-chemical processes in the lower stratosphere and natural and anthropogenic emissions (lightning, soil bacteria, fuel combustion–motor vehicles, industrial, and utility), appears in the run time spent for each grid point when examining the benchmark calculations. In Fig. 4 the maximal MECCA kernel execution wall time for one grid point in each column differs by up to a factor of 4. The load imbalance is caused by the adaptive time-step integrator solving the differential equations that describe the chemical equations computed in the MECCA submodel.
Column-maximum MECCA kernel execution wall time in microseconds for a single time step. The adaptive time-step integrator shows a non-uniform run time caused by stratospheric photochemistry and natural and anthropogenic emissions.
At high levels of parallelization, the load imbalance becomes a limiting
factor, and the factors determining scalability in absolute numbers in Fig. 5
are both communication and computation. For the ECHAM phase (blue), when
scaling to beyond eight nodes, the communication demands of the underlying
spectral model involving several all-to-all communication patterns start to
dominate. Computing time for MESSy is still decreasing and the computing time
for MECCA decreases more strongly than MESSy (note the logarithmic scale and
the distinction of MECCA from MESSy). For reference, the MareNostrum 3
heterogeneous compute nodes each have 2
Run time of each phase of EMAC, when running on MareNostrum 3.
In Fig. 6 the point at which communication and computation require equal times around 8 nodes is clearly apparent; this ultimately limits the maximum number of cores that can be used with high efficiency and 16 nodes (equivalent to 256 cores) are commonly used in production runs of the EMAC atmospheric model as a scientific application to balance efficiency and total required wall time.
Percentage of run time of each phase of EMAC, when running on MareNostrum 3.
The EMAC model atmospheric chemistry code (MECCA) was taskified using OmpSs
(Bueno et al., 2011, 2012; Duran et al., 2011; Sainz et al., 2014)
directives. OmpSs allows the user to specify inputs and outputs for blocks of
code or functions, giving enough information to the Nanos
The new integration subroutine performs identical functions to the kpp_integrate subroutine but on unique elements:
The statements COPY_INOUT and COPY_IN refer to the OmpSs specification
and declare memory dependence and declare task dependencies. These
dependencies are used by the Nanos
The new version of EMAC, running ECHAM with MPI processes and MECCA with shared-memory OmpSs tasks, outperforms the old EMAC using pure MPI, and continues to scale beyond the region where the original implementation scaling performance plateaus. This can be seen in Fig. 7, which shows the performance using multi-threading on the DEEP Cluster.
Performance of OmpSs threading in the DEEP Cluster.
In DEEP, OmpSs has been extended to support offloading tasks to remote nodes (Beltran et al., 2015). This mimics the behaviour of other accelerator APIs that move data from the host to the device, compute in the device, and return the results to the host. However, OmpSs adds two very important features: (i) it allows offloading to remote nodes, not just locally available coprocessors/accelerators, which is a key functionality to effectively use the Booster; and (ii) it allows use of the Booster as a pool of coprocessors, so tasks can be offloaded to any Booster node with enough free cores. The latter enables one to eliminate the load imbalance caused by sunlight gradients in MECCA. The second source of imbalance by heterogeneous reactions is also automatically alleviated by the dynamical load balance using the massive parallelization in the Booster. Our proposed solution is agnostic to the specific origin of load imbalance in the chemistry calculation.
In a shared-memory taskification the data are already shared between threads, and no memory copies are necessary. However, in DEEP, to leverage the Booster, these data have to be copied to the Booster nodes. Keeping that in mind, the new task-based MECCA implementation was optimized and the memory and network footprint of the distributed-memory offloading were reduced by 3 orders of magnitude. To minimize the memory footprint for offloaded tasks, the number of computational grid elements issued to MESSy is further split into individual elements for each task, by rearranging the grid-point arrays in each time step to implement data locality at the grid-point level, resulting in a reduction of the total memory footprint from 2.7 MB down to 6.3 KB for each task. This was the result of refactoring both the data and code structures in MECCA.
In order to create multiple threads integrating the chemical equations simultaneously, all data describing the state of a single grid point have been identified and separated into local variables. Using these variables an integrator kernel has been created that can be offloaded onto threads running on the main processor or hardware accelerators. The integrator kernel needs the variables describing the local state of the atmosphere and the integrator parameters defining the solution of the chemical equations. As the chemical mechanism is compiled by the Kinetic PreProcessor (KPP) (Damian et al., 2002), some module variables contain the data of a set of grid points that is accessed using the index K inside the integrator. In order to maintain the code for the integration of different chemical mechanisms created by the KPP compiler, these input variables are passed in their original size together with the column grid-point index K:
The keywords INOUT and IN refer to the OmpSs specification and declare task
dependencies. ONTO sets the offload target – left blank it targets any
available accelerator device. The input arrays have been reduced in the
dimension of NBL (see below), and now carry the memory of a single grid
point. The only two-dimensional array left is conc, and only one slice (k,:)
is passed to and from containing the concentrations of all chemical species.
Prior to passing each element all two-dimensional arrays above are transposed
in the outermost dimension, to have sequential memory regions, as in the
Fortran language specification. The compiler directives are implemented as
pragmas (comments in the source code), automatically controlled with
definitions at compile time (being invoked only if SMP/Mercurium is present);
thus, no additional user input is required. The kernel data sizes depend on
parameters in the EMAC environment
NBL
When offloading tasks, to maximize massive parallelization, the parameter NBL
is reduced at run time to NBL
At a typical parallelization of 256 processes for the less-scaling spectral
parts of the model, each MPI process creates NBL
Additionally, the distributed-memory offloading code was redesigned to exploit shared memory within the Xeon Phi many-core processors by nesting an OmpSs shared-memory region within Cluster-to-Booster tasks encompassing a variable, run-time-defined number of individual grid-point calculations. The run-time parameter can be tweaked to change the total memory required by the number of grid points per task to fit within the device shared memory, to minimize run-time overhead (avoiding many tasks spawned over the network, which has a huge penalty). Thus, the number of tasks to be sent to the Booster can be controlled and optimized for each architecture, and host-specific configuration allows for optimum task size based on bandwidth, reducing task communication overheads.
With this approach, the specifics of the DEEP system architecture, and in particular the hardware present in MIC coprocessors, are exploited by massively parallelizing the chemistry calculations at the grid-point level and offloading to the Booster, exposing a significant amount of thread parallelism. At the same time the load imbalance observed in MECCA is automatically alleviated through OmpSs' dynamic load balancing by selecting a sufficiently fine task size and decoupling the model-domain location of the grid point from the task execution on the physical CPU.
The computational kernel of MECCA conceptually accesses several single elements of the global arrays describing the state of the atmosphere. In order to create data locality, in a first step the module variables used in the current version of MECCA to represent the model state were refactored within the mecca_physc function to be passed explicitly to the kpp_integrator routine that computes the chemical equations of one vector of grid points. Subsequently, to increase memory adjacency, some fields were transposed from the memory order optimized for vectorization along longitudes as given by the ECHAM core model to arrays with adjacent elements representing chemical concentrations and reaction rate constants. In a third step, the computation of the equations of a single grid point were encapsulated in a new kpp_integrator_kernel function that calls the functions update_rconst and integrate, which are created automatically by the KPP compiler. This innermost routine is called in a loop over all grid points in a vector along the virtual longitude defined by the ECHAM infrastructure:
By maintaining the interfaces both to the MESSy infrastructure and the code implementing the KPP integrator, the changes to the EMAC code base were kept minimal, and a transfer of the changes to other MESSy submodels using the KPP integrator should be feasible.
At the time of writing this paper, the DEEP Booster is in the bring-up phase, and is not available to users. In order to project the performance of the full DEEP system, Xeon-based measurements on the DEEP Cluster were combined with Xeon Phi-based measurements on MareNostrum 3. The “Pure MPI” nominal time on the DEEP Cluster for all phases (red line), the time for each phase, and the theoretical performance when offloading to the Booster are shown in Fig. 8. The DEEP Cluster reference data weighted by the relative factors for each phase derived from the metrics measurements exhibit a performance maximum for the base model (ECHAM) and MESSy (excluding MECCA) at eight nodes (light blue line), representing a good estimate for the optimal parallelization of that phase on the Cluster. This estimate of 375 s per simulated day for the low-scaling Cluster phases was used to extrapolate the attainable performance (green line), merging this result at eight nodes, with the Xeon Phi data retrieved from MareNostrum 3, where benchmarks using one node had been run with varying numbers of processing elements within one Xeon Phi processor.
Time per simulated day in DEEP using a pure MPI approach, and a
theoretical performance with offloading to Xeon Phi, based on the metrics
collected on MareNostrum 3. The theoretical MPI
While the number of Booster nodes required to attain similar performance to
the original distributed-memory based implementation corresponds to regular
accelerator architectures with individual boosters directly attached to
cluster nodes, the projected DEEP performance (green line) scales beyond the
optimal performance achieved so far. The EMAC atmospheric chemistry global
climate model seems therefore well suited to exploit an architecture
providing considerably more hardware acceleration than provided by regular
systems. To find the attainable performance time
The ECHAM/MESSy Atmospheric Chemistry (EMAC) global climate model is used to study climate change and air quality scenarios. The EMAC model is constituted by a non-local dynamical part (ECHAM) with low scalability, and local physical (MESSy) and chemical (MECCA) processes with high scalability. The model's structure naturally suits the DEEP architecture using the Cluster nodes for the non-local part and the Booster nodes for the local chemistry processes. Different implementations of the code's memory and workload divisions were developed and benchmarked to test different aspects of the achievable performance on the proposed architecture. The use of the OmpSs API largely frees the programmers from implementing the offloading logic and, given that EMAC is developed and used in a large community working on all aspects of the model, can facilitate adoption of the concept in the MESSy community.
The chemistry mechanism was taskified at the individual grid-point level using OmpSs directives. The chemistry code was refactored to allow for memory adjacency of vector elements. The OmpSs taskification with remote offload allows for massive task parallelization and the implementation of optional two-stage offload to control Cluster–Booster task memory size and optimum bandwidth utilization.
The computational load imbalance arising from a photochemical imbalance is alleviated at moderate parallelization by assigning grid points with differing run times to each process and distributing the load over all processes. Due to the physical distribution of sunlight this load balancing does not require an explicit algorithm at moderate parallelization; instead, the implicit assignment of the model grid in rectangular blocks suffices for this purpose. At higher numbers of processors this implicit load-balancing decreases and the resulting load imbalance has to be solved by active balancing. The dynamic scheduling provided by the OmpSs run-time system balances the computational load without a possible, but expensive, prediction for the current time step.
With these approaches, the specifics of the DEEP system architecture, and in particular the hardware present in MIC coprocessors, can be exploited by massively parallelizing the chemistry calculations at the grid-point level and offloading to the Booster, exposing a significant amount of thread parallelism. At the same time the load imbalance observed in MECCA will be automatically alleviated through dynamic load balancing by minimizing the individual task size to one grid box and decoupling the model-domain location from the task execution on the physical CPU, and transferring it to any available core on the Booster.
Benchmark projections based on available hardware running the DEEP software stack suggest that the EMAC model requires the large numbers of Xeon Phi accelerators available in the DEEP architecture to scale beyond the current optimal performance point and exploit Amdahl's law with the highly scalable grid-point calculations while capitalizing on the high performance and fast communication for the spectral base model on Intel Xeon processors.
In the longer term, when the optimal speedup of the chemical kinetics computation is approached, the overall parallel speedup of EMAC will be limited by the remaining non-accelerated submodels. To achieve further speedup, additional computationally expensive submodels like the aerosol submode GMXe could be addressed. GMXe is column-bound and calls KPP up to four times during one time step (for grid-/subgrid-scale liquid/ice clouds), with load imbalance caused by the distribution of clouds over the model domain. In principle, the MECCA implementation is directly applicable to the case of SCAV. The actual performance gain would be dependent on the exact set-up and remains to be tested.
The changes proposed in this paper are expected to contribute to the eventual adoption of MIC accelerated architectures for production runs, in presently available implementations of Earth system models, towards exploiting the potential of a fully Exascale-capable platform.
The Modular Earth Submodel System (MESSy) is continuously further developed
and applied by a consortium of institutions. The usage of MESSy and access to
the source code are licensed to all affiliates of institutions which are
members of the MESSy Consortium. Institutions can become a member of the
MESSy Consortium by signing the MESSy Memorandum of Understanding. More
information can be found on the MESSy Consortium website
(
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287530. Edited by: D. Roche Reviewed by: two anonymous referees