Towards European-Scale Convection-Resolving Climate Simulations with GPUs : A study with COSMO 4 . 19

The representation of moist convection in climate models represents a major challenge, due to the small scales involved. Using horizontal grid spacings of O(1km), convection-resolving weather and climate models allow to explicitly resolve deep convection. However, due to their extremely demanding computational requirements, they have so far been limited to short simulations and/or small computational domains. Innovations in supercomputing have led to new hybrid node designs, mixing conventional multi-core hardware and accelerators such as graphics processing units (GPUs). One of the 5 first atmospheric models that has been fully ported to these architectures is the COSMO model (Consortium for Small-Scale Modeling). Here we present the convection-resolving COSMO model on continental scales using a version of the model capable of using GPU accelerators. The verification of a week-long simulation containing winter storm Kyrill shows that, for this case, convection-parameterizing simulations and convection-resolving simulations agree well. Furthermore we demonstrate 10 the applicability of the approach to longer simulations by conducting a three-month long simulation of the summer season 2006. Its results corroborate the findings found on smaller domains such as more credible representation of the diurnal cycle of precipitation in convection-resolving models and a tendency to produce more intensive hourly precipitation events. Both simulations also show how the approach allows for the representation of interactions between synoptic-scale and meso-scale atmospheric circulations at scales ranging from 1000 to 10 km. This includes the formation of sharp cold frontal structures, 15 convection embedded in fronts and small eddies, or the formation and organization of propagating cold pools. Finally we assess the performance gain from using heterogeneous hardware equipped with GPUs relative to multi-core hardware. With the COSMO model, we now use a weather and climate model that has all the necessary modules required for real-case convection-resolving regional climate simulations on GPUs.

Abstract.The representation of moist convection in climate models represents a major challenge, due to the small scales involved.Using horizontal grid spacings of O(1km), convection-resolving weather and climate models allows one to explicitly resolve deep convection.However, due to their extremely demanding computational requirements, they have so far been limited to short simulations and/or small computational domains.Innovations in supercomputing have led to new hybrid node designs, mixing conventional multi-core hardware and accelerators such as graphics processing units (GPUs).One of the first atmospheric models that has been fully ported to these architectures is the COSMO (Consortium for Small-scale Modeling) model.
Here we present the convection-resolving COSMO model on continental scales using a version of the model capable of using GPU accelerators.The verification of a week-long simulation containing winter storm Kyrill shows that, for this case, convection-parameterizing simulations and convectionresolving simulations agree well.Furthermore, we demonstrate the applicability of the approach to longer simulations by conducting a 3-month-long simulation of the summer season 2006.Its results corroborate the findings found on smaller domains such as more credible representation of the diurnal cycle of precipitation in convection-resolving models and a tendency to produce more intensive hourly precipitation events.Both simulations also show how the approach allows for the representation of interactions between synopticscale and meso-scale atmospheric circulations at scales ranging from 1000 to 10 km.This includes the formation of sharp cold frontal structures, convection embedded in fronts and small eddies, or the formation and organization of propagat-ing cold pools.Finally, we assess the performance gain from using heterogeneous hardware equipped with GPUs relative to multi-core hardware.With the COSMO model, we now use a weather and climate model that has all the necessary modules required for real-case convection-resolving regional climate simulations on GPUs.

Introduction
The inadequate representation of clouds and moist convection represents a major challenge of state-of-the-art climate models (Stevens and Bony, 2013).An important component of the problem are the scale interactions between small-scale turbulent and convective processes at scales around and below 1 km, and larger-scale/meso-scale weather systems at scales around O(10-1000 km).Within these scale interactions, individual convective cells may organize into mesoscale weather systems such as squall lines or meso-scale convective systems.Current global and regional climate models typically operate at grid spacings on the order of 10-300 km, and are thus unable to explicitly represent many of these interactions.
In conventional models, convective processes need to be treated with subgrid-scale parameterization schemes, which entail major uncertainties in the representation of clouds and precipitation (Randall et al., 2003).These uncertainties not only raise concerns about the model's abilities to represent the associated feedback processes (Hohenegger et al., 2009), but also regarding uncertainties in climate change projections (Bony et al., 2015).
Refining the model resolution to the kilometer scale allows omitting the parameterization of deep convection, since at this resolution the associated processes can be represented explicitly.In the last decades, this approach has successfully been followed in idealized studies (e.g., Weisman et al., 1997) and for numerical weather prediction purposes (e.g., Benoit et al., 2002).Convective processes are then represented much closer to first principles and thus allow for an improved skill in quantitative precipitation forecasting (Mass et al., 2002;Richard et al., 2007), and ultimately for an improved representation of the water cycle.Recent studies have applied this approach to limited-area climate modeling: in their decade-long, regional simulations over England and the Alps, Kendon et al. (2012) and Ban et al. (2014) found significant improvements in the representation of sub-daily precipitation events over land, in particular regarding rainfall intensity, duration, spatial extent, as well as the timing of the diurnal cycle of precipitation, especially for high precipitation percentiles (Ban et al., 2015).Following promising validation, decade-long simulations for climate scenarios have been conducted (Kendon et al., 2014;Ban et al., 2015).
While the convection-resolving approach shows very promising results, turbulent and convective motions are still under-resolved (Wyngaard, 2004).Grid spacings of O(1 km) are comparable to the size of the particularly energetic convective eddies in the planetary boundary layer (Zhou et al., 2014).At this resolution, shallow clouds still need to be parameterized, and deep-convective clouds tend to be too large, too laminar, too vicious and too widely spaced apart (Clark et al., 2016).Using numerical simulations of an idealized squall line, Bryan et al. (2003) showed that a horizontal grid spacing of 250 m and below is needed to accurately predict the details of deep convection.Associated with this limitation is a high sensitivity of condensation processes with respect to grid spacing (Bryan and Morrison, 2012).However, Langhans et al. (2012) found that large-scale bulk properties of atmospheric convection, such as moisture and temperature tendencies, converge at a grid spacing of about 2-4 km.Their findings indicate that for real-case simulations, kilometerscale resolution is often sufficient, provided the focus is on bulk properties and feedbacks rather than the structure of the convective clouds.While this type of convergence addresses the precipitation process within convective systems, Langhans et al. (2012) did not address issues related to radiative cloud feedbacks.
Convection-resolving simulations have proven to be very useful tools for climate simulations and numerical weather prediction (Mass et al., 2002;Lean et al., 2008;Attema et al., 2014).However, the fine grid spacing and small time steps involved represent a major challenge for current supercomputers, in particular for large spatial domains and long time-scales.Therefore, climate simulations with convectionresolving resolution have so far been limited to comparatively small domains (Knote et al., 2010;Kendon et al., 2012;Prein et al., 2013;Ban et al., 2014).On the global scale, this challenge is even more ambitious (Wehner et al., 2008(Wehner et al., , 2011;;Palmer, 2014).Nevertheless, the exponential growth in compute power led to a number of computational breakthroughs for global simulations: Miyamoto et al. (2013) demonstrated a 12 h long global simulation at a grid spacing of 870 m, Miura et al. (2007) performed a week-long simulation with a horizontal grid spacing of 3.5 km, recently Skamarock et al. (2014) performed a 20-day-long simulation with a horizontal grid spacing of 3 km, and Bretherton and Khairoutdinov (2015) simulated several months on a near-global aquaplanet at a grid spacing of 4 km.While these efforts portray the limit of what is achievable today, they also illustrate the benefits of global model formulations that overcome convection parameterization schemes.
Although designed for a wide range of potential applications, high-performance computers are not necessarily optimal for convection-resolving atmospheric models (Donofrio et al., 2009;Bauer et al., 2015).This gap can, to a large degree, be tied to the scaling properties of different types of algorithms: while the arithmetic intensity (ratio of floatingpoint operations to total data movement) increases linearly for many dense-linear-algebra operations, it remains low for the stencil computations typically found in the dynamical cores of atmospheric models (Schulthess, 2015).Consequently, the stencil operations, which are commonly used in the most time-consuming parts of the code (the dynamical cores), are usually limited by the available memory bandwidth rather than the potentially available floating-point performance (Christen et al., 2008;Gysi et al., 2015).
In the last years, electrical energy constraints for supercomputers have led to heterogeneous computer architectures that involve conventional multi-core hardware as well as attached accelerators such as graphics processing units (GPUs).For weather and climate models, GPUs are particularly interesting because their "parallelism is substantial" and because they "prioritize throughput over latency" (Owens et al., 2008).Hence, they have the potential to close the performance gap needed for more extensive convectionresolving simulations.Other proposals of new computer architectures useful for weather and climate modeling are other accelerators such as the Intel Xeon Phi architecture or FPGAaccelerators (Deest et al., 2016), custom chips (Donofrio et al., 2009), or inexact hardware (Düben et al., 2014).
Multiple efforts to port existing weather and climate codes to GPUs have been undertaken: with their pioneering work on the Weather Research and Forecast (WRF) model, Michalakes and Vachharajani (2008) demonstrated the applicability of the approach to weather and climate codes.An effort that has meanwhile been continued by Mielikainen et al. (2012) and others.Similar attempts have been made by Shimokawabe et al. (2010) to accelerate the next version of the ASUCA dynamical core of the Japan Meteorological Agency or Demeshko et al. (2013) that report on a GPU implementation of the nonhydrostatic icosahedral atmospheric model (NICAM) shallow water module.A team at the National Oceanic and Atmospheric Administration (NOAA) have demonstrated promising performance increases for the dynamical core of the non-hydrostatic icosahedral model (NIM) and are now working towards porting the NIM physics package (Henderson et al., 2011;Govett et al., 2014).In the large-eddy simulation domain, Schalkwijk et al. (2015) have fully ported the Dutch Atmospheric Large-Eddy Simulation (DALES) model to GPUs allowing also on-the-fly visualization.Since the introduction of general purpose GPU computing, substantial speedups have been reported for dynamical cores, physics and diagnostics and adapted techniques for inter-node communication have been outlined.However, although some of the models have been used for real-case weather simulations (Schalkwijk et al., 2015), they usually did not include the full suite of parameterizations or were driven by a vertical profile rather than by time-dependent lateral boundary conditions.A proof of concept of a climate simulation using a production quality model, computed on heterogeneous architectures, has not yet been accomplished.
In this study we demonstrate the capabilities of GPUaccelerated simulations in the area of regional climate simulations, addressing week-and month-long simulations on a European-scale computational domain.We use a new version of the COSMO (Consortium for Small-scale Modeling) model enabled for GPUs (Fuhrer et al., 2014).In contrast to other projects discussed above, this model executes all the code required for the time stepping on GPUs (dynamics, physics and diagnostics), including the halo exchange at subdomain boundaries.Execution of the entire time stepping algorithm on the accelerators is essential, in order to minimize expensive data movements between the central processing unit (CPU) and the accelerator.The code developments have recently been integrated into the operational numerical weather prediction suite at MeteoSwiss (operating with a grid spacing of 1 km) and will soon become available to the wider COSMO community.
Using results from week-long and season-long simulations, we assess the applicability of the convection-resolving COSMO model on continental scales.We start by presenting an outline of the methodology (Sect.2).In terms of results, we provide insights into simulated meso-scale features such as the formation of narrow cold frontal rain bands, the evolution of diurnal convection over Europe during the summer season, and the role of propagating cold pools in the initiation of convective cells (Sects.3 and 4).Afterwards we discuss the performance gained from using GPUs for real-case simulations (Sect.5) and finally conclude the study (Sect.6).

Model description
This study utilizes a refactored version of the COSMO weather and climate model (based on version 4.19).The version is capable of running on heterogeneous hardware architectures (Fuhrer et al., 2014).The COSMO model is a non-hydrostatic limited-area model that solves the fully compressible governing equations using finite difference methods on a structured grid (Steppeler et al., 2003;Förstner and Doms, 2004).It employs a split-explicit third-order Runge-Kutta discretization to integrate the variables forward in time (Wicker and Skamarock, 2002) and is discretized on a rotated latitude-longitude grid using terrain-following surfaces.The horizontal advection scheme is a fifth-order upwind scheme, and in the vertical direction an implicit Crank-Nicholson scheme (Baldauf et al., 2011) is used.The multi-dimensional advection of scalar fields is implemented using the onedimensional (1-D) Bott scheme (Bott, 1989) with time splitting (Schneider and Bott, 2014).The resulting model is suitable for weather and climate simulations with spatial resolution ranging from 50 km to the kilometer scale.
The physical parameterizations used in this study include a radiative transfer scheme based on the δ-two-stream approach (Ritter and Geleyn, 1992), a single-moment bulk cloud-microphysics scheme that uses five species (cloud water, cloud ice, rain, snow, and graupel; Reinhardt and Seifert, 2005), the multilayer soil model TERRA_ML (Heise et al., 2006) with eight active soil layers with varying layer thicknesses between 1 cm and 7.48 m and a total soil depth of 15.24 m.Furthermore a turbulent-kinetic-energy-based parameterization is used in the planetary boundary layer (PBL), and for surface transfer (Mellor and Yamada, 1982;Raschendorfer, 2001).In addition and depending upon resolution, sub-grid convection is parameterized using an adapted version of the Tiedtke mass-flux scheme with moistureconvergence closure (Tiedtke, 1989).

Enabling COSMO on heterogeneous architectures
The approach to port COSMO to heterogeneous hardware architectures with GPUs is as follows (Fig. 1): the most compute intensive module (the dynamics) has been rewritten in C++, using the Stencil Loop Language (STELLA; Gysi et al., 2015).STELLA is an embedded domain-specific language, specialized for computing stencils on structured grids.It allows for aggressive low-level architecture-specific performance optimizations and the use of platform-specific programming models, while maintaining a single code syntax at higher levels of the code.During code compilation, the stencil templates are then translated to an implementation specific to the target architecture.
For the physics, diagnostics, and most of the handling of the lateral boundary conditions, a less disruptive approach has been chosen (Lapillonne and Fuhrer, 2014).Here execution and data movement is organized using OpenACC (2011) compiler directives.Directives are instructions specifying additional guidance to the pre-processor or the compiler.OpenACC directives allow a programmer to mark kernels (the body of a loop) that can be off-loaded from a host CPU to an attached accelerator, and also organize their execution as well as data movement between CPU and accelerator.Although directives grant less flexibility to optimize for a specific hardware architecture (for instance changing the loop and storage order), they allow to mostly retain the existing FORTRAN code, and make it possible to port large portions of code quite fast.
In large simulations, the computational domain is usually split into smaller sub-domains (domain decomposition).The data exchange required at the sub-domain boundaries (i.e., halo exchange) is handled using a re-usable communication framework.It guarantees performance portability across different high-performance computing architectures by leveraging the capabilities of the Generic Communication Library (Bianco, 2012).Similar to STELLA, this library abstracts the complicated pathways that move data through heterogeneous machines.With this approach, the time stepping runs entirely on accelerators.This property is fundamental to a fast performance, as the memory footprint of the prognostic variables in the simulations to be presented amounts to 96 bytes per grid point.Moving such a large footprint each time step (between CPU and GPU), while only performing a comparatively small amount of floating-point operations per transfer, would be prohibitively expensive.In other words, the memory transfer between GPU and CPU is simply too slow to make back and forth transfers worthwhile at each time step.The modules written in C++ and FORTRAN are integrated by a C++ interface, which provides FORTRAN bindings.For a detailed outline of the software engineering approach of the COSMO-GPU port please see Fuhrer et al. (2014).

Model setup
The model is used in two configurations (Fig. 2): the first configuration uses parameterized shallow and deep convection at a grid spacing of 12 km and a domain size of 355 × 355 × 60 grid points (CTRL12) with a time step of 90 s.
The second configuration has a convection-resolving horizontal grid spacing of 2.2 km and 1536 × 1536 × 60 grid points (CTRL2) with a time step of 20 s.In this configuration, the deep-convection parameterization is switched off, but the shallow-convection scheme remains active.Here, the parameterized fraction of (shallow) convective clouds is nonprecipitating and has a maximum vertical extent of 250 hPa, while deep convection is treated explicitly.Following the recommendations by Baldauf et al. (2011), in CTRL2 the Mellor-Yamada asymptotic length scale in the PBL parameterization is reduced by a factor of 2.5 to calibrate the triggering of convection.In both models, the vertical direction is discretized using 60 stretched model levels from the surface to the model top at 23.5 km.The respective layer thickness widens from 20 m at the surface to 1.2 km near the model top.Aside from the domain size, we generally follow the setup defined in Ban et al. (2014).
The CTRL12 domain spans about 4300 × 4000 km and thereby covers most of continental Europe, including the Mediterranean.The domain for the CTRL2 simulation is approximately 500 km smaller than the CTRL12 domain (on each side), but still covers most of western and central Europe (Fig. 2).The necessary initial and boundary conditions for the CTRL12 simulation are derived from the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA-Interim reanalysis (Dee and Uppala, 2011) and are updated every 6 h.Using two-step one-way nesting, the results from the CTRL12 simulation are subsequently used to derive boundary conditions for CTRL2 at an hourly interval.The lateral boundary relaxation zone has a width of 9 and 25 grid points for CTRL12 and CTRL2, respectively.
The analysis domain excludes grid columns close to or within the relaxation zone (50 km distance to the CTRL2 boundary) and contains 1536 × 1536 grid points (2900 × 2900 km 2 ).It should therefore be large enough for small-scale processes to fully develop (Leduc and Laprise, 2009;Brisson et al., 2016).Additionally, a simulation with a grid spacing of 50 km and a time step of 300 s has been performed (CTRL50).Apart from the horizontal resolution and the associated time step, it has the same setup as CTRL12.This simulation portrays the current generation of highresolution global climate models.

Numerical experiments
Here we present results for two model integrations: a weeklong winter case with strong synoptic forcing, and a seasonal integration of a summer case that is characterized by a rather weak synoptic forcing.For the winter case, the model chain has been initialized on 16 January 2007, 00:00 UTC and integrated for 7 days until 23 January 2007, 00:00 UTC.For the summer case, the 2 km simulation is initialized on 1 May 2006 and integrated until the end of August.
To provide adequately spun-up soil moisture fields for the summer 2006 simulation, the soil layers in CTRL12 have been initialized on 1 May 2001 based on the soil-moisture fields from the CCLM EURO-CORDEX simulation (Kotlarski et al., 2014), and thereafter integrated for 5 years.Subsequently, CTRL2 has been initialized on 1 May 2006 and integrated for 4 months.The analysis period for this simulation has been defined as 1 June to 31 August 2006 (JJA), leaving 1 month of CTRL2 integration for spin-up.
For the summer 2006 simulation, we also tested a parameter calibration based on the findings of Bellprat et al. (2016).They demonstrated a pronounced reduction of the summer warm bias by introducing objectively calibrated values of eight model parameters.Application of their calibration to the current setup of CTRL2 resulted in a reduction (domain-mean, all land points) of the warm bias by about 0.7 • C (see Supplement Fig. S1) and an increase of the seasonal mean-precipitation by about 0.2 mm day −1 (see Supplement Fig. S2).
For the winter case, the model chain has been initialized on 16 January 2007 and integrated for 7 days.As winter simulations are less sensitive to soil conditions, the initial soil and snow data were directly taken from ERA-Interim.The initial 36 h of the simulation are considered spin-up and are not analyzed in any detail.

Visualization of clouds
To visualize clouds, we combine bulk-diagnostic cloud fractions into a brightness B. During the model integration, the 3-D cloud fields are aggregated onto three 2-D cloud-fraction fields (low f lc , mid f mc and high clouds f hc ).This step dramatically reduces the data volume.Essentially the cloud fraction diagnostic provides a three-variable summary of the cloud-covered fraction of a grid cell.
The conversion of the three cloud fractions into one single brightness is accomplished by using four calibrated parameters m as follows: in a first step B srf is assigned a surface brightness value based on the underlying land-cover (m srf over land and 0 over sea): Next, each cloud layer is assumed to saturate the brightness up to a specified value.For instance, if f mc = 1, then the respective grid point is assigned a mid-level brightness of B m = m mc .Using linear relations, the three cloud layers are then successively stacked on each other from bottom to top, while also taking into account the brightness of the lower layers.This is The final quantity, i.e., B = B h , is meant to mimic a brightness that can qualitatively be compared with satellite images.To this end, the parameters m are calibrated as follows: An alternative method to visualize clouds is to compute synthetic brightness temperatures during model integration through the use of a forward radiative transfer model.In the COSMO model, the RTTOV (Radiative Transfer for TOVS) satellite simulator is being used for this task (Keil and Reinhardt, 2006).Unfortunately, this functionality is not yet available in the GPU version used in this study.However, the visualization employed yields satisfactory results and qualitatively compares reasonably against full RTTOV visualization.A visual comparison of the pseudo-synthetic satellite www.geosci-model-dev.net/9/3393/2016/Geosci.Model Dev., 9, 3393-3412, 2016 images (showing B as defined above) and the synthetic RT-TOV images can be found in the Supplement (Fig. S3).
3 Week-long simulation of winter storm Kyrill In January 2007 the devastating winter-storm Kyrill passed over northern Europe.While often referred to as "Kyrill", the storm actually was a sequence of two extratropical cyclones.Based on a backward trajectory analysis Fink et al. (2009) outlined that the first storm (Kyrill I) emerged from a "cold front located underneath the eastward side of an upperlevel longwave trough over northeastern Arkansas (USA)" on 14 January 2007.Traversing the North Atlantic, the storm underwent rapid cyclogenesis while crossing the jet-stream from the warm to the cold-side.On 18 January at 00:00 UTC a second storm (Kyrill II) formed on the occluded front of Kyrill I northwest of Ireland.Ludwig et al. (2015) describe the dynamical forcing leading to the Kyrill II cyclogenesis as an interaction between frontolytic strain acting on a low-level potential vorticity filament of the occluded front of Kyrill I, and a developing upper-level shortwave trough.In a series of sensitivity experiments, they also determined that the diabatic heating processes between 800 and 500 hPa posed an additional crucial ingredient for the cyclogenesis of Kyrill II.Using a convection-parameterizing 25 km COSMO setup, they show that latent heating accelerated cyclogenesis, and also increased the core pressure drop by 10-15 hPa.These studies show that close interaction between upper-level divergence and low-level baroclinicity, but also that diabatic processes, were key in its development.A useful feature of this episode was the presence of "a remarkable pressure gradient of more than 70 hPa [. . .] between [. . .] the North Sea and [. . .] the Iberian Peninsula" (Fink et al., 2009), imposing a particularly strong synoptic forcing.

Kyrill evolution
The overall surface development of Kyrill II on 18 January is as follows (Fig. 3): the Kyrill II storm develops along a pronounced baroclinic zone northwest of Ireland around 00:00 UTC, rapidly propagates over the UK, intensifies in the North Sea (around 12:00 UTC), before reaching the continent (around 18:00 UTC).In the simulations, the surface pressure and 2 m temperature fields of CTRL2 (and CTRL12) exhibit small-scale wave patterns, in particular in the vicinity of mountainous areas.We interpret these features as gravity-wave signals and small-scale temperature variations associated with the underlying topography.Note that Fig. 3 displays the variables in their native resolution without any applied smoothing and hence some additional artifacts may be present, due to reducing surface-pressure to mean-sea-level pressure.
While the overall solutions of CTRL12 and CTRL2 agree well with ERA-Interim, there are also differences present, which are worth pointing out.Shortly after cyclogenesis, the horizontal temperature gradient along the warm front is steeper in both simulations, although the core pressure in the reanalysis is lower.During cyclogenesis of Kyrill II and while passing the British Isles (at 12:00 UTC), the simulations show an additional small low-pressure system to the west of the Norwegian coast.The simulations expose two separate low-pressure centers, while in ERA-Interim the 974 hPa contour encloses both.Consequently, the spatial extent of Kyrill II appears to be smaller.For a visualization of the simulated storm, with high temporal resolution; see the video by Leutwyler et al. (2015a).
Figure 4 compares the cyclone's surface core pressure of our simulations (CTRL12 and CTRL2) with two simulations of Ludwig et al. (2015) (LW25 and LW7), and with the ERA-Interim reanalysis.It should be noted that the observational reference is rather weak, as this was a rapidly developing small-scale cyclone.In comparison to ERA-Interim, all four simulations exhibit a higher initial surface pressure (lowest local minimum) at the time of cyclogenesis and maintain it until the intensification phase starts around 10:00 UTC (Fig. 4).Then the simulations exhibit a core pressure drop of about 9 to 12 hPa in 7 h, significantly below the ERA-Interim estimate.While this behavior is rather distinct from the evolution in ERA, the four simulations qualitatively agree.However, LW25 and LW7 show a recovery towards the ERA-Interim values when Kyrill II makes landfall, while in our simulations core pressures below 960 hPa prevail until the storm exits the domain.An additional simulation with the CTRL12 configuration, but using the same domain as LW25 (with a smaller computational domain), followed the core pressure recovery of LW25.This could be indicative of ERA assimilating some data from within our computational domain, data which were not reflected in our lateral boundary conditions.
As shown by Ludwig et al. (2015), the case is strongly sensitive with respect to latent heating.Weaker latent heating rates reduce the core pressure drop and delay cyclogenesis.Since we generally observe more precipitation in CTRL2 than in CTRL12 (domain-mean precipitation rate in Fig. 5 (top-middle) CTRL12: 2.7 mm day −1 , (top-right) CTRL2: 3.55 mm day −1 ), we consequently also expect deeper core pressures.

Precipitation along cold fronts
On 18 January 2007, 18:00 UTC, the storm center of Kyrill II is located east of Denmark and its attached, elongated cold front spans over northern Germany and France (Figure 5).As expected from the increased resolution, CTRL12 reveals a higher level of detail than CTRL50, stronger gradients, and smaller precipitating regions of higher intensity.CTRL2 also exhibits a consistent large-scale structure of clouds and pre- cipitation, and particularly heavy (explicitly treated) convective activity along and in the vicinity of the cold front.With increasing resolution, noticeable changes in the meso-scale structure can be found close the storm core (Fig. 5, bottom panels).In CTRL12, the front is split into successive precipitation bands with maximum precipitation rates up to 20 mm h −1 .CTRL2 additionally features small-scale embedded convection located in the vicinity and along the front, and a more coherent organization.The frontal rainbands (precipitation > 5 mm h −1 ) are typically 30-40 km wide in CTRL12, and substantially narrower in CTRL2 (8-10 km).
The narrow cold-frontal rainbands seen in the bottom-right panel of Fig. 5 are of distinctly convective origin.They are associated with precipitation rates > 20 mm h −1 , located on the leading edge of the fronts, and aligned with the cold front in an oblique angle.These systems have been extensively dis-  cussed in the literature (Houze, 2014), and studied using (airborne) radar (e.g., Jorgensen et al., 2003).We expect differences in location and intensity, due to the ability of CTRL2 to explicitly resolve the underlying dynamical processes.A display of another cold-frontal passage can be found in the Supplement (Fig. S4); similar to Fig. 5 (bottom panels), a narrowing and strengthening of the cold-frontal rainbands can be observed.

Representation of a meso-scale vortex
In all three simulations a meso-scale vortex can be inferred from the bend in the 850 hPa geopotential height contour (Fig. 6), which is located behind the cold front to the northeast of the UK on 17 January 2007, 12:00 UTC.Here the increased resolution enables a more coherent organization of convective cells.In particular downstream of the vortex, CTRL12 produces many isolated precipitating grid points, while CTRL2 shows well-developed signs of organization and wrap up (Fig. 6, top row).The vertically integrated distribution of hydrometeor mass (rain, snow and graupel; Fig. 6, middle row) is spatially more confined in CTRL2 and thus testifies the role of significant grid-scale updrafts, while in CTRL12 significant grid-scale hydrometeor loads can only be identified at the precipitation maximum (13 • W, 55 • W).The temperature fields at 850 hPa also reveal a consistent picture (Fig. 6, bottom row).While CTRL2 exhibits a distinct wrap-up structure, an eddy-like pattern can hardly be identified in the lower-resolution simulations.Additionally, the diagrams reveal small-scale superimposed anomalies stemming from diabatic heating.In CTRL2 they are arranged in a circular fashion around the eddy core, while they are much less organized in the convection-parameterizing simulations.
It should be stressed that a thorough validation is here not attempted for several reasons.First, as can be deduced from alternate simulations that were initialized 6, 12, 18, and 24 h hours earlier (not shown), the predictability of this particular small-scale vortex is very low.Second, as the current version of COSMO-GPU lacks a GPU-enabled version of the RT-TOV (Keil and Reinhardt, 2006), a thorough validation with satellite pictures would be dubious.However, the preference of CTRL2 to form small-scale vortex-like features (that wrap up) is very common in the simulation discussed.Consistent with these results McInnes et al. (2011) found that decreasing the grid spacing to the kilometer scale improves their representation of polar lows.Among other things, they found a more realistic wind field and a more realistic distribution of convection.

Seasonal simulation of the summer 2006
Persistent large-scale anticyclonic flow was the dominant circulation pattern in Europe during the summer season 2006.Strong diurnal convection and thunderstorms could be observed, and July 2006 was the hottest month measured in Europe to this date (Rebetez et al., 2009).Not only for that reason this month has been the subject of some previous studies (e.g., Hohenegger et al., 2009).During such anticyclonic episodes, the lateral boundary conditions typically exert less control on the atmospheric circulation, and local drivers become more important.In these situations, regional climate models (RCM) can develop flow patterns that substantially deviate from the driving model.In order to test the model also under these conditions, a 3-months-long simulation of this episode was conducted.

Seasonal statistics
An over prediction of summer temperature is a long standing issue for COSMO-CLM and other RCMs, in convection-parameterizing (Kotlarski et al., 2014) as well as in convection-resolving setups (Ban et al., 2014).Validation of the CTRL2 average summer 2 m temperature (JJA), using the ENSEMBLES E-OBS data set (v. 10) as observational reference (Haylock et al., 2008), shows that this behavior is still an issue (Fig. 7a).After accounting for differences in topography and spatial resolution (height correction with a lapse-rate of 0.65 K/100 m), the resulting domainmean warm bias amounts to about 1.4 • C, with the largest biases in northern Africa and eastern Europe.The large warm biases in northern Africa and eastern Europe are known RCM biases and not only related to model biases, but also to data sparsity in the verification data sets (Kotlarski et al., 2014;Panitz et al., 2014, and references therein).
The spatial distribution of precipitation is well captured (Fig. 7b-c).Simulated precipitation over elevated topography is much larger than the observed precipitation, but this is, at least partly, related to the sparse observational network used to create the E-OBS precipitation data set (Hofstra et al., 2009;Isotta et al., 2015;Prein and Gobiet, 2016).This ob-Geosci.Model Dev., 9, 3393-3412, 2016 www.geosci-model-dev.net/9/3393/2016/servational bias is also attenuated by the biased distribution of rain gauges, which are predominantly located in valleys where precipitation is typically much smaller than at elevated locations.In addition, the systematic rain gauge undercatch has not been corrected in the available data sets.
Overall, there is an evident increase of precipitation with the model resolution.This increase is also reflected in the domain average land precipitation, which is 2.1 mm day −1 in CTRL2, 1.9 mm day −1 in CTRL12, and 1.8 mm day −1 for E-OBS.
While the spatial distribution of precipitation agrees well between CTRL12 and CTRL2, their behavior on the subdaily timescale is fundamentally different (Fig. 8): The different timing of the diurnal cycle (Fig. 8a) is remarkable.While the convection-parameterizing CTRL12 simulation is already at its peak around noon, precipitation in CTRL2 is still building up and peaks only later in the afternoon.Furthermore, daily maximum precipitation is higher in CTRL2 (Fig. 8c) throughout the event spectrum, and also produces larger hourly precipitation maxima (Fig. 8b).It has previously been shown for smaller domains (Kendon et al., 2012;Ban et al., 2014) that the behavior of the convectionresolving model fits observation much better in terms of subdaily precipitation.Our results are qualitatively consistent with these studies, although the differences in daily precipitation statistics are larger for our simulation.Note, however, that Ban et al. (2014) considered the statistics from 10 summers, while here only one summer season is considered.A more detailed validation of precipitation will be conducted www.geosci-model-dev.net/9/3393/2016/Geosci.Model Dev., 9, 3393-3412, 2016   in a subsequent study using a 10-year-long climate simulation.
A snapshot on a typical summer day at noon illustrates how the different precipitation event distributions come about (Fig. 9 and Leutwyler et al., 2015b).In the cloud field of CTRL2, the convective cells are visible as highreaching, initially circular, cloud features.In CIRL12, on the other hand, the convection-parameterization scheme adjusts the vertical stability of the atmosphere before grid-scale convective motions can develop, and consequentially convective cells are absent.In CTRL12 precipitation is characterized by widespread drizzle and light rain (mostly below 2 mm h −1 ).In contrast, CTRL2 shows smaller isolated cells and convective cores with an intensity above 10 mm h −1 .This behavior of CTRL2 leads to higher hourly peak amounts, as noted in Fig. 8.However, peak precipitation in CTRL2 is later during the day.

Propagating cold pools
Cold-pools are formed by cold negatively buoyant air, stemming from evaporation of falling hydrometeors.The associated downdrafts penetrate into the planetary boundary layer and locally enhance the variability of the moisture, temperature, and wind fields.Their role in the initiation and organization of deep convection over land has been studied using radar observations (Lothon et al., 2011;Dione et al., 2014) as well as convection-resolving and large-eddy simulations (Tompkins, 2001;Grabowski et al., 2006;Khairoutdinov and Randall, 2006;Boing et al., 2012;Schlemmer and Hohenegger, 2014).While flat semi-arid regions provide an ideal environment for the formation of large cold pools, they are less pronounced in more heterogeneous areas, such as in continental Europe.
The use of high-resolution models provides a tool to study cold pools in heterogeneous areas.Here we focus on the subdomain indicated by the red box in Fig. 9.At 12:00 UTC a few small precipitating cells are present.An hour later, at 13:00 UTC, the cells have grown in size and a number of them exhibit signatures typical for cold pools (Fig. 10).For instance in the vertical wind field at the 900 hPa level, circular downdrafts, surrounded by a ring of updrafts, appear below precipitating convective cells.They overlap with distinct local temperature minima.In the subsequent snapshots, at 13:30 and 14:00 UTC, the cold pools grow in size and some of the cells start to develop strong dry downdrafts.At the same time, the anvil clouds are expanding.Further details of the convection-resolving simulation are illustrated in a video (Leutwyler et al., 2016).
In order to assess whether new cells are triggered along propagating cold pools, a subjective tracking of cold-pool signatures is applied.To this end, the convective cells, visible in the snapshots taken at 13:30 and 14:00 UTC, which are not present in the previous snapshot, have been marked with a black circle in the left-most panels.Subsequently, the same locations have been marked in the respective vertical wind panel of the previous snapshot.It can be seen that a number of black circles are co-located with the moving edges of the cold pools, but also with convergence lines and topographic features.The co-location of new cells and leading cold-pool edges confirms that propagating cold pools are relevant for the initiation of new convective cells in convection-resolving simulations, also in complex terrain.The results are qualitatively consistent with large-eddy simulation results found in idealized studies (Schlemmer and Hohenegger, 2014).As expected, no corresponding signatures have been found in the CTRL12 simulation (not shown).

Computational requirements
What are the computational requirements to perform a convection-resolving simulation on the European scale?Here we restrict the analysis to two key performance metrics: 1. Strong scaling: the achievable time to solution for a fixed simulation domain, fixed grid spacing, and domain size, while increasing the computational resources.For linear scaling, the time to solution will increase inverse proportionally to the computational resources, allocated to the problem.Here the time step (which is constrained by the grid spacing through the Courant stability criterion) can be kept constant and hence the computational task has a fixed size 2. Weak scaling: the achievable time to solution when the domain size is increased proportionally with the computational resources, while keeping the grid spacing and the time step fixed.For linear scaling, the time to solution would remain the same for all domain sizes.
The weak-scaling approach used here is slightly different for weak-scaling experiments with global simulations, because in these experiments the domain size cannot be varied, except by shrinking and expanding the size of the planet.In some global experiments the grid spacing is varied while keeping the time step constant at the value required by the simulation with the finest grid (e.g., Wehner et al., 2011).
Finally, we assess the performance gain from using GPUs with respect to conducting simulations on multi-core hardware.Here we test a single code version that is able to run on different hardware architectures (with and without GPUs).In contrast to Fuhrer et al. (2014), we use a real-case climate configuration close to what has been described in Sect.2.4, also accounting for disk input and output.
On a distributed memory system, the problem considered here needs to be split into smaller chunks and hence messages have to be communicated across the network.In COSMO, this is achieved by decomposing the simulation domain along the horizontal dimensions.This domain decomposition yields a communication pattern where four messages are transferred to the four neighboring compute nodes: north, south, east and west.When a computation is distributed onto an increasing number of nodes, the ratio between the amount of computation per node and the amount of information exchange with neighboring nodes decreases.In a simple performance model, the speedup from parallelization will saturate towards a theoretical value and is proportional to the square root of the number of sub-domains and a machine constant (Wehner et al., 2008).On most multi-core hardware, this limitation creates a lower bound on the time to solution, which can be achieved for strong scaling.On heterogeneous hardware equipped with GPUs, the end of strong scalability may be reached earlier (Fuhrer et al., 2014).

Setup of scaling experiment
The full strong-scaling experiment corresponds to a 24 h simulation on a domain of 1536 × 1536 × 60 grid points.Input for this simulation consists of the lateral boundary conditions at hourly resolution, amounting to about 120 GB for the whole simulation.Additionally, an output workload consisting of about 6 GB is written to the file system.All performance results have been obtained on a heterogeneous Cray XC30 system, located at the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland (Piz Daint).The Piz Daint supercomputer consists of a heterogeneous node architecture with an eight-core Intel Xeon E5-2670 CPU and an NVIDIA Tesla K20X GPU per node (Fig. 11), and Cray's Aries interconnect using a three-level dragonfly topology to connect the compute nodes.To normalize the performance metrics, they are defined as per socket.In the case of our  A socket is the electrical component that provides the connection between the circuit board and the chip sitting on top of it.The advantage of the per-socket metric is its flexibility across architectures, which also allows for comparing with individual sockets on a multi-GPU node (fat node).On a fat node, a socket still hosts only a single GPU chip, even if multiple GPU sockets are installed on a PCI express card or on a node.However, for the node configuration found in Piz Daint, this metric is a bit unfair towards the multi-core systems, since GPUs (today) still need an accompanying CPU hosting the operating system and instructing the GPU.With the socket-based metric, we do not account for that additional CPU.Another metric would be node-to-node comparison, assuming that a node can either consist of one CPU and a GPU, or two CPUs.For such a configuration, the second option would be fairer for the multi-core architecture.In general, node-to-node comparison is useful to compare the various possible node configurations one may find in a supercomputer.However, we believe that for the current study the per-socket performance metric is more useful than nodeto-node comparisons, also because presently fat nodes are commercially available.
The new version of the COSMO code can be executed on both, multi-core CPUs and GPUs, and hence allows comparison between the two architectures.However there is a small difference in the way the domain decomposition is done.Currently, the parallelization strategy, implemented in the COSMO-CPU version, makes use of distributed memory by leveraging the Message Passing Interface (MPI Forum, 2015).Using an entire eight-core CPU socket in Piz Daint therefore requires placing eight MPI tasks on a node.In contrast, the COSMO-GPU version allows for using shared

Results of scaling experiments
In the first experiment (strong scaling), the time to solution for a 24 h simulation on 1536 × 1536 grid columns, distributed among an increasing number of sockets, is measured (Fig. 12, left-hand panel).The time to solution for execution on the multi-core hardware decreases approximately linearly up to 900 sockets.Towards the end of the curve, at 1760 sockets, inter-node communication starts to limit the speedup the additional CPU-sockets provide.Execution on the GPU shows saturation already at 256 sockets, which corresponds to 4096 grid columns per socket.Consequently when using GPUs, a larger number of grid points per socket is needed to efficiently utilize the hardware (Little, 1961;Gysi et al., 2016).A similar behavior was found by Fuhrer et al. (2014) in their experiments using the same model, but with periodic boundary conditions and without I/O.They found a linear scaling behavior for experiments with more than 128 × 128 grid columns per socket, but also early saturation, as the workload per socket decreases.In the current study we reach the upper memory limit of the sockets earlier and therefore are not able to reproduce the linear strong scaling regime they found.We nevertheless find a significant speedup when using GPUs.For our reference setup with 128 × 128 grid columns per socket (as used in Sects.3 and 4), we measured a speedup of about a factor of 3.6.For a similar time to solution with the conventional multi-core architecture, 4.9 times more sockets would be needed.
In the second experiment (weak scaling), the number of grid columns per socket is kept constant (at 128 × 128 when using GPUs), while proportionally increasing the domain size and the number of sockets (Fig. 12, right-hand panel).Execution on the CPU and the GPU both show only a slight upward trend for the time to solution.Since the performance for the physics and dynamics modules as well as data copy (to and from the GPU memory) mostly stays constant, the trend is likely related to the increase in the amount of data written to disk as the domain size increases.In the GPU version disk input/output is done in a synchronous manner, meaning that the model integration is stopped during filesystem access.This limitation has already been addressed in a later version of the COSMO model, and in the future we will use asynchronous I/O.For now the time-compression ratio (simulation period divided by time to solution) for our reference setup (1536 × 1536 grid columns distributed onto 144 nodes) is 1 : 70 with model output and 1 : 90 without.

Assessment
Based on these benchmarks we now assess the feasibility of a large convection-resolving climate modeling experiment, using the same domain as EURO-CORDEX (Jacob et al., 2014).This model inter-comparison and climate projection effort consists of contributions from multiple RCMs.It involves a 20-year evaluation experiment driven by reanalysis, as well as a 55-year control experiment and for each emission scenario considered a 94-year-long transient scenario simulations driven by a global climate model (GCM).For a simulation with 2.2 km grid spacing, the EURO-CORDEX domain consists of roughly 2300 × 2300 × 60 grid points.On this domain, the COSMO GPU version would yield a time-compression ratio of about 1 : 60 when executed on 324 Nvidia K20x sockets and about 1 : 20 when 324 Intel E5-2670 sockets are used.Projecting the time-compression ratios on the EURO-CORDEX experiment yields a time to solution of about 4 months for the 20-year-long evaluation period, 11 months for the 55-year-long control experiment and 1.6 years for each of the 94-years-long transient scenario simulations.
For climate simulations, the required operational timecompression ratios are more relaxed than in operational weather forecasting.While the workflow for weather forecasting typically imposes strict time-to-solution constraints, the required throughput for climate simulations is governed by more practical matters such as the duration of a project.In this regard, imposing a maximum time-to-solution constraint of 3 months would entail a time-compression ratio of 1 : 500 to accomplish a transient convection-resolving climate experiment.For comparison, in their assessment of global convection-resolving models, Wehner et al. (2008) impose a much tougher constraint of 1 : 1000.However, their CMIP5-type experiments (Taylor et al., 2012) also involve a large simulation ensemble and hundreds of years of simulation for ocean spin-up.Given the 1 : 500 constraint, our model would require an additional speedup of about a factor of 8-10 to meet the required time-compression constraints for an extensive CORDEX-type experiment.
The above results indicate that, for the COMSO model, using GPU accelerators, permits to perform multi-year, convection-resolving simulations on large, continental-scale domains.However, for century-long simulations at the current resolution, or for simulations with finer grid spacing (and decreased time step), further performance improvements are needed.We suggest that future work should focus on trying to push the strong scalability further and thus to reduce the time required to update a grid point by one time step, for example by exposing more parallelism (in the vertical, across modules in the code, by asynchronous execution of independent work).Another interesting application in the RCM domain would be to increase the time step (at coarser grid spacing) and downscale a large number of GCM scenario realizations.At the 12.5 km grid spacing, used in the EURO- What does this mean for global simulations?For an idealized setup, Fuhrer et al. (2014) demonstrated perfect linear weak scaling behavior of the COSMO-GPU version up to 2000 nodes.So let us assume linear weak scaling (essentially neglecting input/output from/to a file system) and availability of the entire Piz Daint supercomputer (5272 hybrid Cray XC30 nodes).At a grid spacing of 2.2 km, it should technically be possible to extend the domain size to cover about half of Earth's surface, while still retaining the same time to solution as demonstrated above.At a grid spacing of 2.8 km the whole planet could be covered.The analysis indicates that decadal global convection-resolving atmospheric simulations are feasible today on large dedicated supercomputing systems, provided the code scales similar as the regional COSMO model used in the current study.More specifically, COSMO could in principle be scaled up to a global latitude-longitude configuration and supplemented with the additional code to deal with the poles.
Whether the above assessment can be transferred to GCMs, also depends upon the time-stepping algorithm and its implementation.Many global non-hydrostatic models invoke semi-Lagrangian or spectral approaches (where the total communication costs increases faster than the number of grid points).In such cases, linear weak scaling will be more difficult to achieve than with the current split-explicit scheme.

Conclusions and outlook
The last decades have seen a tremendous increase in the complexity of the memory and compute architecture in high-performance computers.Until about a decade ago, many weather and climate supercomputing centers featured shared-memory, vector, and massively parallel processes machines.Following this there has been a trend towards hierarchical systems with distributed memory (for Piz Daint, this is visualized in Fig. 11).This trend has further been reinforced by the introduction of heterogeneous supercomputers exploiting accelerators.Given the technical and economical drivers behind this process, this trend will likely continue into the future (Schulthess, 2015).It is thus essential, that weather and climate models are enabled to make use of these systems.
In this study, the applicability of the convection-resolving climate simulation approach has been demonstrated on European scales with a new version of the COSMO weather and climate model, capable of running on GPUs.Both convection-parameterizing and convection-resolving simulations have been considered, at resolutions of 12 and 2 km, respectively.Validation of a week-long simulation of the winter storm Kyrill showed a high level of agreement between the two simulations regarding the synoptic and meso-alpha-scale development and their patterns of clouds and precipitation.However, simulations also exhibit significant differences in terms of frontal rainbands and precipitation statistics.
Such differences are even more pronounced in the simulation of the summer season 2006.The simulation revealed a very different character of summer convection for the simulation with resolved convection.The precipitation field of the convection-resolving 2 km simulation, showed high precipitation rates over small areas, while the convectionparameterizing 12 km simulation showed low precipitation rates over larger areas.A comparison of the diurnal cycle of precipitation of the convection-parameterizing and the convection-resolving simulations showed that the results found in previous studies also apply to European-scale domains.That is, convection-parameterizing simulations have a distorted diurnal cycle with a precipitation peak around noon, while the convection in the 2 km simulation peaks only in the late afternoon.The step to resolved convection also dramatically reduces the hourly frequency of light precipitation.
The simulations also demonstrated how the approach allows for the representation of interactions between synopticscale and meso-scale atmospheric circulations at scales ranging from 1000 to 10 km.Three examples of such interactions were discussed: narrow cold frontal rain bands, small vortices over the ocean in winter, and the formation and organization of propagating cold pools in complex terrain.Note that, although we highlighted individual meso-scale systems, we did not verify their realism, structure, and evolution in much detail.These results illustrate some advantages of formulating weather and climate models closer to physical first principles and portrays the benefits of using continental-scale domains for convection-resolving models.
A substantial speedup of the simulations was realized when executing COSMO on GPU accelerators.However, at least for our hardware environment, a minimum of 128 × 128 × 60 grid points per GPU were required to have sufficient work available and efficiently utilize the hardware.With the current code and the current generation of GPUs, century-long convection-resolving simulations (or further increasing the resolution) will still be challenging.For now, the GPU version of COSMO enables us to increase the size of the computational domains for decade-long simulations, or to perform experiments with a large number of ensemble members at lower resolution for centennial simulations.
Our next simulation target is a 10-year-long reanalysisdriven simulation covering the time period 1999-2008, using the same setup as in the current study.This simulation has already been completed and will be analyzed in a subsequent study.It allows for a more robust validation with observational data sets.Together these simulations will serve as a proof of concept and demonstrate that convectionresolving climate simulations are feasible on continental scales.Once established, such simulation capabilities will enable investigations of continental-scale climate feedbacks, sensitive to the treatment of deep convection, or assembling model-climatologies of interactions between convective meso-/small-scale and synoptic-scale systems.

Code and data availability
The particular version of the COSMO model used in this study is only a prototype and will be discontinued soon.However, the code developments are currently in the process of being re-integrated into the mainline COSMO version and will soon be available to the wider research community.COSMO itself may be used for operational and for research applications by the members of the consortium.Moreover, within a license agreement, the COSMO model may be used for operational and research applications by other national (hydro-) meteorological services, universities and research institutes.The model output encompasses 15 TBytes and is available upon request.
The Supplement related to this article is available online at doi:10.5194/gmd-9-3393-2016-supplement.

Figure 1 .
Figure 1.Workflow of the COSMO model on GPUs.Boundary conditions, physics, diagnostics and I/O have been ported using OpenACC (blue).Dynamics and Halo updates have been rewritten in C++ (green).

Figure 2 .
Figure 2. Integration domains and model topography.The outermost black box shows the domain of the convection-parameterizing simulation with grid spacing of 12 km, and the bolder inner box that of the convection-resolving simulation with 2.2 km grid spacing.The sub-domain used in the analysis is indicated.The two smaller black boxes indicate the domains used in two state-of-theart convection-resolving climate simulations over the southern UK and the Greater Alpine Region.

Figure 3 .
Figure 3. Snapshots of the Kyrill II winter storm in ERA-Interim (left column), CTRL12 (middle column) and CTRL2 (right column) in their native resolution.The shading denotes raw 2 m temperature [ • C] and the black contours mean sea-level pressure [hPa].The contour-level spacing is 4 hPa.

Figure 4 .
Figure4.Core pressure evolution of the Kyrill II wind storm from 18 January 2007, 00:00 UTC onwards.The black dots represent the 6-hourly ERA-Interim data, the blue diamonds the CTRL12 and the blue squares CTRL2.The green diamonds show the storm core pressure for the 25 km grid-spacing simulation (LW25) performed inLudwig et al. (2015), and the green squares their simulation with a horizontal grid spacing of 7 km (LW7).

Figure 5 .
Figure 5. Snapshot of Kyrill II on 18 January 2007, 18:00 UTC.The colored shading indicates the rain rate [mm h −1 ], the white shading a cloud cover visualization (section 2.4.1), and the white contours geopotential height at 850 hPa [gpdm] using a line spacing of 4 gpdm; (left) CTRL50 simulation, (middle) CTRL12 simulation, and (right) CTRL2 simulation.The red boxes in the left-hand column denote zoomedareas.An animation of this episode is available on the internet(Leutwyler et al., 2015a).

Figure 6 .
Figure 6.Representation of a meso-scale low with increasing grid spacing; (left column) CTRL50, (middle column) CTRL12 and (right column) CTRL2.(Top) colored shading indicates the rain rate [mm h −1 ], the white shading the cloud cover visualization, and the white contours geopotential height at 850 hPa [gpdm], (middle) vertically integrated sum of snow and graupel hydrometeors [mm m −2 ], and (bottom) temperature at 850 hPa [ • C].For the location of the meso-scale low please consult Fig. S4 in the Supplement.

Figure 7 .
Figure 7. Validation of CTRL2 seasonal means: (a) temperature bias [K], (b) simulated precipitation [mm day −1 ], and (c) observations from E-OBS.To account for differences in topography and spatial resolution the model 2 m temperature has been height corrected assuming a lapse rate of 0.65 K/100 m, before calculating the bias.

Figure 8 .
Figure 8.(a) Average diurnal cycle of precipitation over land, (b) cumulative frequency-intensity distributions of daily-maximum 1 h precipitation, and (c) daily precipitation.

Figure 9 .
Figure 9. Summertime convection over continental Europe.Snapshots on 13 June 2006, 12:00 UTC from (top) CTRL2 and (bottom) CTRL12.The colored shading denotes the 15 min precipitation [mm h −1 ], and the grey shading a cloud visualization.The red box denotes a zoomed area used in Fig. 10.An animation of this display is available on the internet (Leutwyler et al., 2015b).

Figure 10 .
Figure 10.Three consecutive snapshots showing (left) precipitation rate [mm h −1 ] and clouds, (middle) vertical wind at 900 hPa [m s −1 ] and terrain contours [100 m contour], (right) temperature at 900 hPa [ • C].The domain corresponds to the red rectangle in Fig. 9.The black circles in the vertical wind figures denote locations of new convective cells in the next snapshot.In the succeeding snapshot the same convective cells are marked in the left column.

Figure 11 .
Figure 11.Heterogenous node architecture in Piz Daint.Each heterogenous node contains an eight-core Intel Xeon E5-2670 CPU and an Nvidia K20X GPU with 15 Streaming Multiprocessors (SMX).Each SMX contains 192 single precision compute cores and 64 double precision compute cores.

Figure 12 .
Figure 12.Time to solution for a 24 h simulation for execution on CPUs architecture (red) and on the same number of GPUs (blue).(Left) strong scaling for a domain with 1536 × 1536 × 60 grid points.The number of sockets is increased while keeping the problem size fixed.(Right) weak scaling with a per-node size of 128 × 128 × 60 grid points.Increasing problem size while keeping the grid points per socket constant.Contributions form several modules: (orange) dynamics, (dark green) physics, (light green) input/output, and (light blue) data copy to and from accelerator.
D.Leutwyler et al.:Towards European-scale convection-resolving climate simulations www.geosci-model-dev.net/9/3393/2016/Geosci.Model Dev., 9, 3393-3412, 2016 memory (see Sect. 2.2), while inter-node communication is still based on MPI and distributed memory.Hence, executing COSMO on a node with one GPU, requires placing a single MPI task on each node.Accordingly, the number of MPI tasks on a fat node would correspond to the number of GPU sockets it contains.