Coupling of the regional climate model COSMO-CLM using OASIS3-MCT with regional ocean, land surface or global atmosphere model: description and performance

We present the prototype of a regional climate system model based on the COSMO-CLM regional climate model coupled with several model components, analyze the performance of the couplings and present a strategy to find an optimum configuration with respect to computational costs and time to solution. The OASIS3-MCT coupler is used to couple COSMO-CLM with two land surface models (CLM 5 and VEG3D), a regional ocean model for the Mediterranean Sea (NEMO-MED12), two ocean models for the North and Baltic Sea (NEMO-NORDIC and TRIMNP+CICE) and the atmospheric component of an earth system model (MPI-ESM). We present a unified OASIS3-MCT interface which handles all couplings in a similar way, minimizes the model source code modifications and describes the physics and numerics of the couplings. Furthermore, we discuss solutions for specific regional 10 coupling problems like handling of different domains, multiple usage of MCT interpolation library and efficient exchange of 3D fields. A series of real-case simulations over Europe has been conducted and the computational performance of the couplings has been analyzed. The usage of the LUCIA tool of the OASIS3-MCT coupler enabled separation of the direct costs of: coupling, load imbalance and additional compu15 tations. The resulting limits for time to solution and costs are shown and the potential of further improvement of the computational efficiency is summarized for each coupling. It was found that the OASIS3-MCT coupler keeps the direct coupling costs of communication and horizontal interpolation small in comparison with the costs of the additional computations and 1 Geosci. Model Dev. Discuss., doi:10.5194/gmd-2016-47, 2016 Manuscript under review for journal Geosci. Model Dev. Published: 20 April 2016 c © Author(s) 2016. CC-BY 3.0 License.

sulting limits for time to solution and cost are shown and the potential of further improvement of the computational efficiency is summarized. 20 It was found that the OASIS3-MCT coupler keeps the direct coupling cost of communication and horizontal interpolation below 5 % of the extra cost of coupling for all investigated couplings. For the first time this could be demonstrated for an exchange of approximately 450 2D fields per time step necessary for the atmosphere-atmosphere coupling between COSMO-CLM and MPI-ESM.
A procedure for finding an optimum configuration for each of the couplings was developed consid-25 ering the time to solution and cost of the simulations. The optimum configurations are presented for sequential, concurrent and mixed (sequential+concurrent) coupling layouts. The procedure applied can be regarded as independent on the specific coupling layout and coupling details.

Introduction
Most of the current Regional Climate Models (RCMs) lack frameworks for the interactivity between The neglected meso-scale feedbacks and inconsistencies of the boundary conditions (Laprise 35 et al., 2008;Becker et al., 2015) might be well accountable for a substantial part of large-and regional-scale biases found in RCM simulations at 10-50 km horizontal resolution (see e.g. Kotlarski et al. (2014) for Europe). This hypothesis gains further evidence from the results of convectionpermitting simulations, in which these processes are not regarded either. These simulations provide more regional-scale information and improve e.g. the precipitation distribution in mountainous re-40 gions but they usually do not show a reduction of the large-scale biases (see e.g. Prein et al. (2013)).
Besides various improvements, a significant increase of climate change signal was found by Somot et al. (2008) in the ARPEGE model with the horizontal grid refined over Europe and two-way 50 coupled with a regional ocean for the Mediterranean Sea. These results strongly suggest that building Regional Climate System Models (RCSMs) with explicit modeling of the interaction between meso scales in the atmosphere, ocean and land-surface, with large scales in the atmosphere (and ocean) is necessary to consistently represent regional climate dynamics and gain further insights into regional climate change. nis et al. (2012) showed that a cost reduction by a factor of three or less can be achieved using an optimal layout of model components. Later Alexeev et al. (2014) presented an algorithm for finding an optimum model coupling layout (concurrent, sequential) and processor distribution between the model components minimizing the load imbalance in CESM.
These results indicate that the optimized computational performance is weakly dependent on the 95 computing architecture or on the individual model components but depends on the coupling method.
Furthermore, the application of an optimization procedure was found beneficial.
In this study we present a detailed analysis of coupled COSMO-CLM performances on the IBM POWER6 machine Blizzard located at DKRZ, Hamburg. We calculate the speed and costs of the individual model components and of the coupler itself and identify the causes of reduced speed or 100 increased costs for each coupling and reasonable processor configurations. We suggest an optimum configuration for different couplings considering costs and speed of the simulation and discuss the current and potential performances of the coupled systems. Particularities of the performance of a coupled RCM are highlighted together with the potential of the coupling software OASIS3-MCT. We suggest a procedure of optimization of an RCSM, which can be generalized. However, we will show 105 that some relevant optimizations are possible only due to features available with the OASIS3-MCT coupler.
The paper is organized as follows: The coupled model components are described in section 2.
Section 3 focuses on the OASIS3-MCT coupling method and its interfaces for the individual couplings. The coupling method description encompasses the OASIS3-MCT functionality, method of 110 the coupling optimization and particularities of coupling of a regional climate model system. The model interface description gives a summary of the physics and numerics of the individual couplings. In section 4 the computational efficiency of individual couplings is presented and discussed.
Finally, the conclusions and an outlook are given in section 5. For improved readability, Tables 1 and 2 provide an overview of the acronyms frequently used throughout the paper and of the investigated 115 couplings.

Description of model components
The further development of the COSMO model in Climate Mode (COSMO-CLM) presented here aims at overcoming the limitations of the regional soil-atmosphere climate model, as discussed in the introduction, by replacing prescribed vegetation, lower boundary condition over sea surfaces and 120 the lateral and top boundary conditions with interactions between dynamical models.
The models selected for coupling with COSMO-CLM need to fulfill the requirements of the intended range of application which are (1) the simulation at varying scales from convection-resolving up-to-50 km grid spacing, (2) local-scale up to continental-scale simulation domains and (3) full capability at least for European model domains. We decided to couple the NEMO ocean model for 125 the Mediterranean Sea (NEMO-MED12) and the Baltic and Northern Seas (NEMO-NORDIC), alternatively the TRIMNP regional ocean model together with the sea ice model CICE for the Baltic and Northern Seas (TRIMNP+CICE), the Community Land Model (CLM) of soil and vegetation (replacing the multi-layer soil model TERRA), alternatively the VEG3D soil and vegetation model and the global Earth System Model MPI-ESM for two-way coupling with the regional atmosphere. Table 2 gives an overview of all coupled-model systems investigated, their components and institutions at which they are maintained. An overview of the coupled models selected for coupling with COSMO-CLM (CCLM) is given in table 3 together with the main model developer, configuration details of high relevance for computational performance, the model complexity (see Balaji et al. (2017) and a reference in which a detailed model description can be found. The model domains are 135 plotted in Figure 1. More information on the availability of the model components can be found in Appendix A.
In the following, the model components used are briefly described with respect to model history, space-time scales of applicability and model physics and dynamics relevant for the coupling.

COSMO-CLM
140 COSMO-CLM is the COSMO model in climate mode. COSMO model is a non-hydrostatic limitedarea atmosphere-soil model originally developed by Deutscher Wetterdienst for operational numerical weather prediction (NWP). Additionally, it is used for climate, environmental (Vogel et al., 2009) and idealized studies (Baldauf et al., 2011).
The COSMO physics and dynamics are designed for operational applications at horizontal reso-145 lutions of 1 to 50 km for NWP and RCM applications. The basis of this capability is a stable and efficient solution of the non-hydrostatic system of equations for the moist, deep atmosphere on a spherical, rotated, terrain-following, staggered Arakawa C grid with a hybrid z-level coordinate. The model physics and dynamics are discribed in Doms et al. (2011) amd Doms and Baldauf (2015) respectively. The features of the model are discussed in Baldauf et al. (2011). 150 The COSMO model's climate mode (Rockel et al., 2008) is a technical extension for long-time simulations and all related developments are unified with COSMO regularly. The important aspects of the climate mode are time dependency of the vegetation parameters and of the prescribed SSTs and usability of the output of several global and regional climate models as initial and boundary conditions. All other aspects related to the climate mode e.g. the restart option for soil and atmosphere, 155 the NetCDF model in-and output, online computation of climate quantities, and the sea ice module or spectral nudging can be used in other modes of the COSMO model as well.
The model version cosmo_4.8_clm19 is the recommended version of the CLM-Community (Kotlarski et al., 2014) and it is used for the couplings but for CCLM+CLM and for stand-alone simulations. CCLM as part of the CCLM+CLM coupled system is used in a slightly different version 160 (cosmo_5.0_clm1). The way this affects the performance results is presented in section 4.4.

MPI-ESM
The global Earth System Model of the Max Planck Institute for Meteorology Hamburg (MPI-ESM; Stevens et al. (2013)) consists of subsystem models for ocean, atmo-, cryo-, pedo-and the bio-sphere.
The hydrostatic general circulation model ECHAM6 uses the transform method for horizontal com- 165 putations. The derivatives are computed in spectral space, while the transports and physics tendencies on a regular grid in physical space. A pressure-based sigma coordinate is used for vertical discretization. The ocean model MPIOM  is a regular grid model with the option of local grid refinement. The terrestrial bio-and pedo-sphere component model is JSBACH Schneck et al., 2013). The marine biogeochemistry model used is HAMOCC5 (Ilyina 170 et al., 2013). A key aspect is the implementation of the bio-geo-chemistry of the carbon cycle, which allows e. g. investigation of the dynamics of the greenhouse gas concentrations (Giorgetta et al., 2013). The subsystem models are coupled via the OASIS3-MCT coupler  which was implemented recently by I. Fast of DKRZ in the CMIP5 model version. This allows parallelized and efficient coupling of a huge amount of data, which is a requirement of atmosphere-atmosphere 175 coupling.
The reference MPI-ESM configuration uses a spectral resolution of T63, which is equivalent to a spatial resolution of about 320 km for atmospheric dynamics and 200 km for model physics. Vertically the atmosphere is resolved by 47 hybrid sigma-pressure levels with the top level at 0.01 hPa.
The reference MPIOM configuration uses the GR15L40 resolution which corresponds to a bipolar 180 grid with a horizontal resolution of approximately 165 km near the Equator and 40 vertical levels, most of them within the upper 400 m. The North and the South Pole are located over Greenland and Antarctica in order to avoid the "pole problem" and to achieve a higher resolution in the Atlantic region .

185
The Nucleus for European Modelling of the Ocean (NEMO) is based on the primitive equations.
It can be adapted for regional and global applications. The sea ice (LIM3) or the marine biogeochemistry module with passive tracers (TOP) can be used optionally. NEMO uses staggered variable positions together with a geographic or Mercator horizontal grid and a terrain-following σ-coordinte (curvilinear grid) or a z-coordinate with full or partial bathymetry steps (orthogonal grid). A hybrid 190 vertical coordinate (z-coordinate near the top and σ-coordinate near the bottom boundary) is possible as well (for details see Madec (2011)).
COSMO-CLM is coupled to two different regional versions of the NEMO model, adapted to specific conditions of the region of application. For the North and Baltic Seas, the sea ice module (LIM3) of NEMO is activated and the model is applied with a free surface to enable the tidal forcing.

195
Whereas in the Mediterranean Sea, the ocean model runs with a classical rigid-lid formulation in which the sea surface height is simulated via pressure differences. Both model setups are briefly introduced in the following two sub-sections. Lebeaupin et al. (2011), Beuvier et al. (2012 and Akhtar et al. (2014) (Madec, 2008) to the regional ocean conditions of the Mediterranean Sea, hereafter called NEMO-MED12. It covers the whole Mediterranean Sea excluding the Black Sea. The NEMO-MED12 grid is a section of the standard irregular ORCA12 grid (Madec, 2008) with an eddy-resolving 1/12 • horizontal resolution, stretched in latitudinal direction, equivalent to 6-8 km horizontal resolution.

Mediterranean Sea
In the vertical, 50 unevenly spaced levels are used with 23 levels in the top layer of 100 m depth. A 205 time step of 12 min is used.
The initial conditions for potential temperature and salinity are taken from the Medatlas (MEDAR-Group, 2002). The fresh-water inflow from rivers is prescribed by a climatology taken from the RivDis database (Vörösmarty et al., 1996) with seasonal variations calibrated for each river by Beuvier et al. (2010) based on Ludwig et al. (2009). In this context, the Black Sea is considered as a river 210 for which climatological monthly values are calculated from a dataset of Stanev and Peneva (2002).
The water exchange with the Atlantic Ocean is parameterized using a buffer zone west of the Strait of Gibraltar with a thermohaline relaxation to the World Ocean Atlas data of Levitus et al. (2005). Norway. The horizontal resolution is 2 nautical miles (about 3.7 km) with 56 stretched vertical levels.

North and Baltic Seas
The time step used is 5 min. No fresh-water flux correction for the ocean surface is applied. NEMO-NORDIC uses a free top surface to include the tidal forcing in the dynamics. Thus, the tidal potential has to be prescribed at the open boundaries in the North Sea. Here, we use the output of the global tidal model of Egbert and Erofeeva (2002).

225
The lateral fresh-water inflow from rivers plays a crucial role for the salinity budget of the North and Baltic Seas. It is taken from the daily time series of river runoff from the E-HYPE model output operated at SMHI (Lindström et al., 2010). The World Ocean Atlas data (Levitus et al., 2005) are used for the initial and lateral boundary conditions of potential temperature and salinity.

TRIMNP and CICE
TRIMNP (Tidal, Residual, Intertidal Mudflat Model Nested Parallel Processing) is the regional ocean model of the University of Trento, Italy (Casulli and Cattani, 1994;Casulli and Stelling, 1998). The domain of TRIMNP covers the Baltic Sea, the North Sea and a part of the North East Atlantic Ocean with the north-west corner over Iceland and the south-west corner over Spain at the Bay of Biscay. TRIMNP is designed with a horizontal grid mesh size of 12.8 km and 50 vertical layers.

235
The thickness of the top 20 layers is each 1 m and increases with depth up to 600 m for the remaining layers . The model time step is 240 s. Initial states and boundary conditions of water temperature, salinity, and velocity components for the ocean layers are determined using the monthly ORAS-4 reanalysis data of ECMWF (Balmaseda et al., 2013). The daily Advanced Very High Resolution Radiometer AVHRR2 data of the National Oceanic and Atmospheric Administration of USA are 240 used for surface temperature and the World Ocean Atlas data (Levitus and Boyer, 1994) for surface salinity. No tide is taken into account in the current version of TRIMNP. The climatological means of fresh-water inflow of 33 rivers to the North Sea and the Baltic Sea are collected from Wikipedia. The sea ice model CICE version 5.0 is developed at the Los Alamos National Laboratory, USA (http://oceans11.lanl.gov/trac/CICE/wiki), to represent dynamic and thermodynamic processes of 245 sea ice in global climate models (for more details see Hunke et al. (2013)). In this study CICE is adapted to the region of the Baltic Sea and Kattegat, a part of the North Sea, on a 12.8 km grid with five ice categories. Initial conditions of CICE are determined using the AVHRR2 SST.

VEG3D
VEG3D is a multi-layer soil-vegetation-atmosphere transfer model (Schädler, 1990) designed for 250 regional climate applications and maintained by the Institute of Meteorology and Climate Research at the Karlsruhe Institute of Technology. VEG3D considers radiation interactions with vegetation and soil, calculates the turbulent heat fluxes between the soil, the vegetation and the atmosphere, as well as the thermal transport and hydrological processes in soil, snow and canopy.
The radiation interaction, the moisture and turbulent fluxes between soil surfarce and the atmo-255 sphere are regulated by a massless vegetation layer located between the lowest atmospheric level and the soil surface, having its own canopy temperature, specific humidity and energy balance. The multi-layer soil model solves the heat conduction equation for temperature and the Richardson equation for soil water content. Thereby, vertically differing soil types can be considered within one soil column, comprising 10 stretched layers with its bottom at a depth of 15.34 m. The heat conductivity 260 depends on the soil type and the water content. In case of soil freezing the ice-phase is taken into account. The soil texture has 17 classes. Three classes are reserved for water, rock and ice. The remaining 14 classes are taken from the USDA Textural Soil Classification (Staff, 1999).
Ten different landuse classes are considered: water, bare soil, urban area and seven vegetation types. Vegetation parameters like the leaf area index or the plant cover follow a prescribed annual 265 cycle.
Up to two additional snow layers on top are created, if the snow cover is higher than 0.01 m.
The physical properties of the snow depend on its age, its metamorphosis, melting and freezing. A snow layer on a vegetated grid cell changes the vegetation albedo, emissivity and turbulent transfer coefficients for heat as well.

270
An evaluation of VEG3D in comparison with TERRA in West Africa is presented by Köhler et al. (2012).

Community Land Model
The Community Land Model (CLM) is a state-of-the-art land surface model designed for climate applications. Biogeophysical processes represented by CLM include radiation interactions with ve-275 getation and soil, the fluxes of momentum, sensible and latent heat from vegetation and soil and the heat transfer in soil and snow. Snow and canopy hydrology, stomatal physiology and photosynthesis are modeled as well.
Subgrid-scale surface heterogeneity is represented using a tile approach allowing five different land units (vegetated, urban, lake, glacier, wetland). The vegetated land unit is itself subdivided into 280 17 different plant-functional types (or more when the crop module is active). Temperature, energy and water fluxes are determined separately for the canopy layer and the soil. This allows a more realistic representation of canopy effects than by bulk schemes, which have a single surface temperature and energy balance. The soil column has 15 layers, the deepest layer reaching 42 meters depth.
Thermal calculations explicitly account for the effect of soil texture (vertically varying), soil liquid 285 water, soil ice and freezing/melting. CLM includes a prognostic water table depth and groundwater reservoir allowing for a dynamic bottom boundary conditions for hydrological calculations rather than a free drainage condition. A snow model with up to five layers enables the representation of snow accumulation and compaction, melt/freeze cycles in the snow pack and the effect of snow aging on surface albedo.

290
CLM also includes processes such as carbon and nitrogen dynamics, biogenic emissions, crop dynamics, transient land cover change and ecosystem dynamics. These processes are activated optionally and are not considered in the present study. A full description of the model equations and input datasets is provided in Oleson et al. (2010) (for CLM4.0) and Oleson et al. (2013) (for CLM4.5).
An offline evaluation of CLM4.0 surface fluxes and hydrology at the global scale is provided by 295 Lawrence et al. (2011).
CLM is developed as part of the Community Earth System Model (CESM) (Collins et al., 2006;Dickinson et al., 2006) but it has been also coupled to other global (NorES) or regional (Steiner et al., 2005(Steiner et al., , 2009Kumar et al., 2008) climate models. In particular, an earlier version of CLM (CLM3.5) has been coupled to COSMO (Davin et al., 2011;Davin and Seneviratne, 2012) using a "sub-routine" approach for the coupling. Here we use a more recent version of CLM (CLM4.0 as part of the CESM1_2.0 package) coupled to COSMO via OASIS3-MCT rather than through a sub-routine call. Note that CLM4.5 is also included in CESM1_2.0 and can be also coupled to COSMO using the same framework.
3 Description and optimization of COSMO-CLM couplings via OASIS3-MCT

305
The computational performance, usability and maintainability of a complex model system depend on the coupling method used, the ability of the coupler to run efficiently in the computing architecture, and on the flexibility of the coupler to deal with different requirements on the coupling depending on model physics and numerics.
In the following, the physics and numerics of the coupling of COSMO-CLM with the different

OASIS3-MCT coupling method and performance optimization
Lateral-, top-and/or bottom-boundary conditions for regional geophysical models are traditionally read from files and updated regularly at runtime. We call this approach offline (one-way) coupling.

320
For various reasons, one could decide to calculate these boundary conditions with another geophysical model -at runtime -in an online (one-way) coupling. If this additional model in return receives information from the first model modifying the boundary conditions provided by the first to the second, an online two-way coupling is established. In any of these cases, model exchanges must be synchronized. This could be done by (1) reading data from file, (2) calling one model as a subroutine 325 of the other or (3) by using a coupler which is a software that enables online data exchanges between models.
Communicating information from model to model boundaries via reading from and writing to a file is known to be quite simple to implement but computationally inefficient, particularly in the case of non-parallelized I/O and high frequencies of disc access. In contrast, calling component 330 models as COSMO-CLM subroutines exhibits much better performances because the information is exchanged directly in memory. Nevertheless, the inclusion of an additional model in a "subroutine style" requires comprehensive modifications of the source code. Furthermore, the modifications need to be updated for every new source code version. Since the early 90s, software solutions have been developed, which allow coupling between geophysical models in a non-intrusive, flexible and 335 computationally efficient way.
One of the software solutions for coupling of geophysical models is the OASIS coupler, which is widely used in the climate modeling community (see for example Valcke (2013) and Maisonnave et al. (2013)). Its latest fully parallelized version, OASIS3-MCT version 2.0 , proved its efficiency for high-resolution quasi-global models on top-end supercomputers (Masson 340 et al., 2012).
In the OASIS coupling paradigm, each model is a component of a coupled system. Each component is included as a separate executable up to OASIS3-MCT version 2.0. Using the version 3.0 this is not a constraint anymore. At runtime, all components are launched together on a single MPI context. The parameters defining the properties of a coupled system are provided to OASIS via an ASCII file called namcouple.  This ensures the modularity and interoperability of any OASIS-coupled system.
As previously mentioned, OASIS3-MCT includes the MCT library, based on MPI, for direct parallel communications between components. To ensure that calculations are delayed only by receiving of coupling fields or interpolation of these fields, MPI non-blocking sending is used by OASIS3-MCT so that sending coupling fields is a quasi-instantaneous operation. The SCRIP library (Jones, 385 1997) included in OASIS3-MCT provides a set of standard operations (for example bilinear and bicubic interpolation, Gaussian-weighted N-nearest-neighbor averages) to calculate, for each source grid point, an interpolation weight that is used to derive an interpolated value at each (non-masked) target grid point. OASIS3-MCT can also (re-)use interpolation weights calculated offline. Intensively tested for demanding configurations , the MCT library performs the definition of 390 the parallel communication pattern needed to optimize exchanges of coupling fields between each component's MPI subdomain. It is important to note that unlike the "subroutine coupling" each component coupled via OASIS3-MCT can keep its parallel decomposition so that each of them can be used at its optimum scalability. In some cases, this optimum can be adjusted to ensure a good load balance between components. The two optimization aims that strongly matter for computational 395 performance are discussed in the next section.

The coupled-system synchronization and optimization
A coupled model component receiving information from one or several other components has to wait for the information before it can perform its own calculations. In case of a two-way coupling this component provides information needed by the other coupled-system component(s). As mentioned 400 earlier, the information exchange is quasi-instantaneously performed, if the time needed to perform interpolations can be neglected which is the case even for 3D-field couplings (as discussed in section 4.6). Therefore, the total duration of a coupled-system simulation can be separated into two parts for In a coupled system this time can be shorter than in the uncoupled mode, since the reading of boundary conditions from file (in stand-alone mode) is partially or entirely replaced by the coupling. It is also important to note that components can perform their calculations sequentially or concurrently.
The coupled-system's total sequential simulation time can be expected to be equal to the sum of Thus, the strategy of synchronization of the model components depends on the layout of the coupling (sequential or concurrent) in order to reduce the waiting time as much as possible. It is important to note that huge differences in computational performance can be found for different coupling layouts due to different scalability of the modular model components.

430
Since computational efficiency is one of the key aspects of any coupled system the various aspects affecting it are discussed. These are the performances of the model components, of the coupling library and of the coupled system. Hereby the design of the interface and the OASIS3-MCT coupling parameters, which enables optimization of the efficiency, are described.
The model component performance depends on the component's scalability. The optimum parti-435 tioning has to be set for each parallel component by means of a strong scaling analysis (discussed in section 4.1). This analysis, which results in finding the scalability limit (the maximum speed) or the scalability optimum (the acceptable level of parallel efficiency), can be difficult to obtain for each component in a multi-component context. In this article, we propose to simply consider the previously defined concept of the computing time (excluding the waiting time from the total time 440 to solution). In chapter 4 we will describe our strategy to separate the measurement of computing and waiting times for each component and how to deduce the optimum MPI partitioning from the scaling analysis.
The optimization of OASIS3-MCT coupling library performance is relevant for the efficiency of the data exchange between components discretized on different grids. The parallelized interpolations 445 are performed by the OASIS3-MCT library routines called by the source or by the target component.
An interpolation will be faster if performed (1) by the model with the larger number of MPI processes available (up to the OASIS3-MCT interpolation scalability limit) and/or (2) by the fastest model (until the OASIS3-MCT interpolation together with the fastest model's calculations last longer than the calculations of the slowest model).

450
A significant improvement of interpolation and communication performances can be achieved by coupling of multiple variables that share the same coupling characteristics via a single communication, that is, by using the technique called pseudo-3D coupling. Via this option, a single interpolation and a single send/receive instruction are executed for a whole group of coupling fields, for example, all levels and variables in an atmosphere-atmosphere coupling at one time instead of all coupling 455 fields and levels separately. The option groups several small MPI messages into a big one and, thus, reduces communications. Furthermore, the amount of matrix multiplications is reduced because it is performed on big arrays. This functionality can easily be set via the 'namcouple' parameter file (see section B.2.4 in Valcke et al. (2013)). The impact on the performance of COSMO-CLM atmosphereatmosphere coupling is discussed in section 4.6). See also Maisonnave et al. (2013).

460
The optimization of the performance of a coupled-system relies on the allocation of an optimum number of computing resources to each model. If the components' calculations are performed concurrently the waiting time needs to be minimized. This can be achieved by balancing the load of the two (or more) components between the available computing resources: the slower component is granted more resources leading to an increase in its parallelism and a decrease in its computing time.

465
The opposite is done for the fastest component until an equilibrium is reached. Chapter 4 gives examples of this operation and describes the strategy to find a compromise between each component's optimum scalability and the load balance between all components.
On all high-performance operating systems it is possible to run one process of a parallel application on one core in a so-called single-threading (ST) mode ( fig. 2a). Should the core of the 470 operating system feature the so-called simultaneous multi-threading (SMT) mode, two (or more) processes/threads of the same (in a non-alternating processes distribution ( fig.2b)) or of different (in an alternating processes distribution ( fig.2c)) applications can be executed simultaneously on the same core. Applying SMT mode is more efficient for well-scaling parallel applications leading to an increase in speed in the order of magnitude of 10 % compared to the ST mode. Usually it is possible 475 to specify, which process is executed on which core (see fig. 2). In this cases the SMT mode with alternating distribution of model component processes can be used, and the waiting time of sequentially coupled components can be avoided. Starting each model component on each core is usually the optimum configuration, since the reduction of waiting time of cores outperforms the increase of the time to solution by using ST mode instead of SMT mode (at each time one process is executed 480 on each core). In the case of concurrent couplings, however, it is possible to use SMT mode with a non-alternating processes distribution.
The optimization procedure applied is described in more detail in section 4.3 for the couplings considered. The results are discussed in section 4.6.
3.1.3 Regional climate model coupling particularities 485 In addition to the standard OASIS functionalities, some adaptation of the OASIS3-MCT API routines were necessary to fit special requirements of the regional-to-regional and regional-to-global couplings presented in this article.
A regional model covers only a portion of earth's sphere and requires boundary conditions at its domain boundaries. This has two immediate consequences for coupling: first, two regional models 490 do not necessarily cover exactly the same part of earth's sphere. This implies that the geographic boundaries of the model's computational domains and of coupled variables may not be the same in the source and target components of a coupled system. Second, a regional model can be coupled with a global model or another limited-area model and some of the variables which need to be exchanged are three-dimensional as in the case of atmosphere-to-atmosphere or ocean-to-ocean coupling.

495
A major part of the OASIS community uses global models. Therefore, OASIS standard features fit global model coupling requirements. Consequently, the coupling library must be adapted or used in an unconventional way, described in the following, to be able to cope with the extra demands mentioned.
Limited-area field exchange has to deal with a mismatch of the domains of the coupled model Interpolation of 3D fields is necessary in an atmosphere-to-atmosphere coupling. The OASIS3-MCT library is used to provide 3D boundary conditions to the regional model and a 3D feedback to the global coarse-grid model. OASIS is not able to interpolate the 3D fields vertically, mainly An exchange of 3D fields, which occurs in the CCLM+MPI-ESM coupling, requires a more inten-535 sive usage of the OASIS3-MCT library functionalities than observed so far in the climate modeling community. The 3D regional-to-global coupling is even more computationally demanding than its global-to-regional opposite. Now, all grid points of the COSMO-CLM domain have to be interpolated instead of just the grid points of a global domain that are covered by the regional domain.
The amount of data exchanged is rarely reached by any other coupled system of the community due 540 to (1) the high number of exchanged 2D fields, (2) the high number of exchanged grid points (full COSMO-CLM domain) and (3) the high exchange frequency at every ECHAM time step. In addition, as will be explained in section 3.2, the coupling between COSMO-CLM and MPI-ESM needs to be sequential and, thus, the exchange speed has a direct impact on the simulation's total time to solution.

545
Interpolation methods used in OASIS3-MCT are the SCRIP standard interpolations: bilinear, bicubic, first-and second-order conservative. However, the interpolation accuracy might not be sufficient and/or the method is inappropriate for certain applications. This is for example the case with the atmosphere-to-atmosphere coupling CCLM+MPI-ESM. The linear methods turned out to be of low accuracy and the second-order conservative method requires the availability of the spatial derivatives 550 on the source grid. Up to now, the latter cannot be calculated efficiently in ECHAM (see section 3.2 for details). Other higher-order interpolation methods can be applied by providing weights of the source grid points at the target grid points. This method was successfully applied in the CCLM+MPI-ESM coupling by application of a bicubic interpolation using a 16-point stencil. In section 3.2 to 3.5 the interpolation methods recommended for the individual couplings are given. point and spectral space. Since the simulation results of COSMO-CLM need to become effective in ECHAM dynamics, the two-way coupling is implemented in ECHAM after the transformation from spectral to grid point space and before the computation of advection (see Fig. 8 and DKRZ (1993) for details).
ECHAM provides the boundary conditions for COSMO-CLM at time level t = t n of the three time 570 levels t n − (∆t) E , t n and t n + (∆t) E of ECHAM's leap frog time integration scheme. However, the second part of the Assilin time filtering in ECHAM for this time level has to be executed after the advection calculation in dyn (see Fig. 8) in which the tendency due to two-way coupling needs to be included. Thus, the fields sent to COSMO-CLM as boundary conditions do not undergo the second part of the Assilin time filtering. The COSMO-CLM is integrated over j time steps between the 575 ECHAM time level t n−1 and t n . However, the coupling time may also be a multiple of an ECHAM time step.
A complete list of variables exchanged between ECHAM and COSMO-CLM is given in Table 4.
The data sent by ECHAM are the 3D variables of COSMO-CLM temperature, u-and v-components of the wind velocity, specific humidity, cloud liquid and ice water content and the two-dimensional 580 fields surface pressure, surface temperature and surface snow amount. At initial time the surface geopotential is sent to COSMO-CLM for calculation of the orography differences between the model grids. After horizontal interpolation to the COSMO-CLM grid via the bilinear SCRIP interpolation 1 the 3D variables are vertically interpolated to the COSMO-CLM grid keeping the height of the 300 hPa level constant and using the hydrostatic approximation. Afterwards, the horizontal wind 585 vector velocity components of ECHAM are rotated from the geographical (lon, lat) ECHAM to the rotated (rlon, rlat) COSMO-CLM coordinate system. Here send_fld ends and the interpolated data are used to initialize the boundlines at next COSMO-CLM time levels t m = t n−1 + k · (∆t) C ≤ t n , with k ≤ j = (∆t) E /(∆t) C . However, the final time of COSMO-CLM integration t m+j = t m + j · (∆t) C = t n is equal to the time t n of the ECHAM data received.

590
After integrating between t n − i · (∆t) E and t n the 3D fields of temperature, u-and v velocity components, specific humidity and cloud liquid and ice water content of COSMO-CLM are vertically interpolated to the ECHAM vertical grid following the same procedure as in the COSMO-CLM receive-interface and keeping the height of the 300 hPa level of the COSMO-CLM pressure constant.
The wind velocity vector components are rotated back to the geographical directions of the ECHAM 595 grid. The 3D fields and the hydrostatically approximated surface pressure are sent to ECHAM, horizontally interpolated to the ECHAM grid by OASIS3-MCT 2 and received in ECHAM grid space.
In ECHAM the COSMO-CLM solution is relaxed at the lateral and top boundaries of the COSMO-CLM domain by means of a cosine weight function over a range of five to ten ECHAM grid boxes using a weight between zero at the outer boundary and one in the central part of the COSMO-CLM The two-way coupled system CCLM+MPI-ESM with prescribed COSMO-CLM solution within the COSMO-CLM domain (weight=1) provides a stable solution over climatological time scales. A 605 strong initialization perturbation is avoided by slowly increasing the maximum coupling weight to 1 with time, following the function weight = weight max · (sin((t/t end ) · π/2)), with t end equal to 1 month.

CCLM+NEMO-MED12
COSMO-CLM and the NEMO ocean model are coupled concurrently for the Mediterranean Sea 610 (NEMO-MED12) and for the North and Baltic Sea (NEMO-NORDIC). Table 5 gives an overview of the variables exchanged. Bicubic interpolation between the horizontal grids is used for all variables.
At the beginning of the NEMO time integration (see Fig. 7) the COSMO-CLM receives the sea surface temperature (SST) and -only in the case of coupling with the North and Baltic Sea -also the sea ice fraction from the ocean model. At the end of each NEMO time step COSMO-CLM 615 sends average water, heat and momentum fluxes to OASIS3-MCT. In the NEMO-NORDIC setup COSMO-CLM additionally sends the averaged sea level pressure (SLP) needed in NEMO to link the exchange of water between North and Baltic Sea directly to the atmospheric pressure. The sea ice fraction affects the radiative and turbulent fluxes due to different albedo and roughness length of ice. In both coupling setups SST is the lower boundary condition for COSMO-CLM and it is used 620 to calculate the heat budget in the lowest atmospheric layer. The averaged wind stress is a direct momentum flux for NEMO to calculate the water motion. Solar and non-solar radiation are needed by NEMO to calculate the heat fluxes. E − P ("Evaporation minus Precipitation") is the net gain (E − P > 0) or loss (E − P < 0) of fresh water at the water surface. This water flux adjusts the salinity of the uppermost ocean layer.

625
In all COSMO-CLM grid cells where there is no active ocean model underneath, the lower boundary condition (SST) is taken from ERA-Interim re-analyses. The sea ice fraction in the Atlantic Ocean is derived from the ERA-Interim SST where SST < −1.7 • C which is a salinity-dependent freezing temperature.
On the NEMO side, the coupling interface is included similar to COSMO-CLM, as can be seen

635
In the CCLM+TRIMNP+CICE coupled system (denoted as COSTRICE; Ho-Hagemann et al. (2013)), all fields are exchanged every hour between the three models COSMO-CLM, TRIMNP and CICE running concurrently. An overview of variables exchanged among the three models is given in Table   5. The "surface temperature over sea/ocean" is sent to CCLM instead of "SST" to avoid a potential inconsistency in case of sea ice existence. As shown in Fig. 7, COSMO-CLM receives the skin 640 temperature (T Skin ) at the beginning of each COSMO-CLM time step over the coupling areas, the North and Baltic Seas. The skin temperature T skin is a weighted average of sea ice and sea surface temperature. It is not a linear combination of skin temperatures over water and over ice weighted by the sea ice fraction. Instead, the skin temperature over ice T Ice and the sea ice fraction A Ice of CICE are sent to TRIMNP where they are used to compute the heat flux HF L, that is, the net out-645 going long-wave radiation. HF L is used to compute the skin temperature of each grid cell via the Stefan-Boltzmann Law.
At the end of the time step, after the physics and dynamics computations and output writing, COSMO-CLM sends the variables listed in Table 5 to TRIMNP and CICE for calculation of wind stress, fresh water, momentum and heat flux. TRIMNP can either directly use the sensible and latent 650 heat fluxes from COSMO-CLM (considered as flux coupling method; see e.g. Döscher et al. (2002)) or compute the turbulent fluxes using the temperature and humidity density differences between air and sea as well as the wind speed (considered as the coupling method via state variables; see e.g. Rummukainen et al. (2001)). The method used is specified in the subroutine heat_flux of TRIMNP.

655
In addition to the fields received from COSMO-CLM, the sea ice model CICE requires from TRIMNP the SST, salinity, water velocity components, ocean surface slope, and freezing/melting potential energy. CICE sends to TRIMNP the water and ice temperature, sea ice fraction, fresh-water flux, ice-to-ocean heat flux, short-wave flux through ice to ocean and ice stress components. The horizontal interpolation method applied in CCLM+TRIMNP+CICE is the SCRIP nearest-neighbour 660 inverse-distance-weighting fourth-order interpolation (DISTWGT).
Note that the coupling method differs between CCLM+TRIMNP+CICE and CCLM+NEMO-NORDIC (see section 3.3). In the latter, SSTs and sea ice fraction from NEMO are sent to CCLM so that the sea ice fraction from NEMO affects the radiative and turbulent fluxes of CCLM due to different albedo and roughness length of ice. But in CCLM+TRIMNP+CICE, only SSTs are passed 665 to CCLM. Although these SSTs implicitly contain information of sea ice fraction, which is sent from CICE to TRIMNP, the albedo of sea ice in CCLM is not taken from CICE but calculated in the atmospheric model independently. The reason for this inconsistent calculation of albedo between these two coupled systems originates from a fact that a tile-approach has not been applied for the CCLM version used in the present study. Here, partial covers within a grid box are not accounted for, 670 hence, partial fluxes, i.e. the partial sea ice cover, snow on sea ice and water on sea ice are not considered. In a water grid box of this CCLM version, the albedo parameterisation switches from ocean to sea ice if the surface temperature is below a freezing temperature threshold of −1.7 • C. Coupled to NEMO-NORDIC, CCLM obtains the sea ice fraction, but the albedo and roughness length of a grid box in CCLM are calculated as a weighted average of water and sea ice portions which is a 675 parameter aggregation approach.
Moreover, even if the sea ice fraction from CICE would be sent to CCLM, such as done for NEMO-NORDIC, the latent and sensible heat fluxes in CCLM would still be different to those in CICE due to different turbulence schemes of the two models CCLM and CICE. This different calculation of heat fluxes in the two models leads to another inconsistency in the current setup which 680 only can be removed if all model components of the coupled system use the same radiation and non-radiation energy fluxes. These fluxes should preferably be calculated in one of the models at the highest resolution, for example in the CICE model for fluxes over sea ice. Such a strategy shall be applied in future studies, but is beyond the scope with the CCLM version used in this study.

685
The two-way coupling between COSMO-CLM and the land surface models VEG3D or CLM is similar to the other in several respects. First, the call to the LSM (OASIS send and receive; see Fig. 7) is placed at the same location in the code as the call to COSMO-CLM's native land surface scheme, TERRA_ML, which is switched off when either VEG3D or CLM is used. This ensures that the sequence of calls in COSMO-CLM remains the same regardless of whether TERRA_ML, VEG3D or CLM is used. In the default configuration used here COSMO-CLM and CLM (or VEG3D) are executed sequentially, thus mimicking the "subroutine"-type of coupling used with TERRA_ML.
Note that it is also possible to run COSMO-CLM and the LSM concurrently but this is not discussed here. Details of the time step organization of VEG3D and CLM are described in the appendix and shown in Fig. 12 and 13 .

695
VEG3D runs at the same time step and on the same horizontal rotated grid ( 0.44 • here) as COSMO-CLM with no need for any horizontal interpolations. CLM uses a regular lat-lon grid and the coupling fields are interpolated using bilinear interpolation (atm to LSM) and distance-weighted interpolation (LSM to atm). The time step of CLM is synchronized with the COSMO-CLM radiative transfer scheme time step (one hour in this application) with the idea that the frenquency of the 700 radiation update determines the radiative forcing at the surface.
The LSMs need to receive the following atmospheric forcing fields (see also Table 6): the total amount of precipitation, the short-and long-wave downward radiation, the surface pressure, the wind speed, the temperature and the specific humidity of the lowest atmospheric model layer.  One specificity of the coupling concerns the turbulent fluxes of latent and sensible heat. In its turbulence scheme, COSMO-CLM does not directly use surface fluxes. It uses surface states (surface temperature and humidity) together with turbulent diffusion coefficients of heat, moisture and momentum. Therefore, the diffusion coefficients need to be calculated from the surface fluxes received by COSMO-CLM. This is done by deriving, in a first step, the coefficient for heat (assumed to be the 715 same as the one for moisture in COSMO-CLM) based on the sensible heat flux. In a second step an effective surface humidity is calculated using the latent heat flux and the derived diffusion coefficient for heat.

Computational efficiency
Computational efficiency is an important property of numerical model's usability and applicability 720 and has many aspects. A particular coupled model systems can be very inefficient even if each component model has a high computational efficiency in stand-alone mode and in other couplings. Thus, optimizing the computational performance of a coupled model system can save a substantial amount of resources in terms of simulation time and cost. We focus here on aspects of computational efficiency related directly to coupling of different component models overall tested in other applications 725 and use real case model configuration.
We use a three step approach.

740
A parallel program's runtime T (n, R) mainly depends on two variables: the problem size n and the number of cores R, that is, the resources. In scaling theory, a weak scaling is performed with the notion to solve an increasing problem size in the same time, while as in a strong scaling a fixed problem size is solved more quickly with an increasing amount of resources. Due to resource limits on the common high-performance computer we chose to conduct a strong-scaling analysis with a 745 common model setup allowing for an easier comparability of the results. By means of the scalability study we identified an optimum configuration for each coupling which served as basis to address two central questions: (1) How much does it cost to add one (or more) component(s) to COSMO-CLM?
(2) How big are the cost of different components and of OASIS3-MCT to transform the information between the components' grids? The first question can only be answered by a comparison to a ref-  Table 7 for details).
Usually, HP SY 1 is the time to solution of a model component executed serially, that is, using one 775 process (R = 1) and HP SY 2 is the time to solution if executed using R 2 > R 1 parallel processes.
Some model components, like ECHAM, cannot be executed serially. This is why the reference number of threads is R 1 ≥ 2 for all coupled-system components.
If the resources of a perfectly scaling parallel application are doubled, the speed would be doubled and therefore the cost would remain constant, the parallel efficiency would be 100 %, and the speed-

Strategy for finding an optimum configuration
The optimization strategy that we pursue is rather empirical than strictly mathematical, which is why we understand "optimum" more as "near-optimum". Due to the heterogeneity of our coupled systems, a single algorithm cannot be proposed ( as in Balaprakash et al. (2014)). Nonetheless, our results show that these empirical methods are sufficient, regarding the complexity of the couplings 825 investigated here, and lead to satisfying results.
Obviously, "optimum" has to be a compromise between cost and time to solution. In order to find a unique configuration we suggest the optimum to have a parallel efficiency higher than 50 % of the cost of the reference configuration, until which increasing cost can be regarded as still acceptable. In the case of scalability of all components and no substantial cost of necessary additional calculations, this guarantees that the coupled-system's time to solution is only slightly bigger than that of the component with the highest cost. However, such "optimum" configuration depends on the reference configuration. In this study for all couplings the one-node configuration is regarded to have 100 % parallel efficiency.
An additional constraint is sometimes given by the CPU accounting policy of the computing 835 centre, if consumption is measured "per node" and not "per core". This leads to a restriction of the "optimum" configuration (r 1 , r 2 , · · · , r n ) of cores r i for each model component of the coupled system to those, for which the total number of cores R = i r i is a multiplex of the number of cores

The optimum configurations
We applied the strategy for finding an optimum configuration described in section 4.3 to the CCLM couplings with a regional ocean (TRIMNP+CICE or NEMO-MED12), an alternative land surface scheme (CLM or VEG3D) or the atmosphere of a global earth system model (MPI-ESM). The optimum configurations found for CCLM sa and all coupled systems are shown in Fig. 6 and in 875 more detail in Table 8. The parallel efficiency used as criterion of finding the optimum configuration is shown in Fig. 5.
The minimum number of cores, which should be used is 32 (one node). For sequential coupling an alternating distribution of processes is used and thus one CCLM and one coupled component model (VEG3D, CLM) process are started on each core. For CCLM+VEG3D and CCLM+CLM the CCLM 880 is more expensive and thus the scalability limit of CCLM determines the optimum configuration.
In this case the fair reference for CCLM is CCLM stand-alone (CCLM sa ) on 32 cores in single threading (ST) mode. As shown in Fig. 5 the parallel efficiency of 50 % for COSMO stand-alone in ST mode is reached at 128 cores or 4 nodes and thus the 128 core configuration is selected as optimum.

885
For concurrent coupling the SMT mode with non-alternating distribution of processes is used, which is more efficient than the alternating SMT and the ST modes. The cores are shared between CCLM and the coupled component models (NEMO-MED12 and TRIMNP+CICE). For these couplings CCLM is the most expensive component as well and thus the reference for CCLM is CCLM sa on 16 cores (0.5 node) in SMT mode. As shown in Fig. 5 the parallel efficiency of 50 % 890 for COSMO stand-alone in SMT mode using 16 cores as reference is reached at approximately 100 cores. For CCLM+NEMO-MED12 coupling a two nodes configuration with 78 cores for CCLM and 50 cores for NEMO-MED12 was resulting in an overall decrease in load imbalance to an acceptable 3.1 % of the total cost. Increasing the number of cores beyond 80 for COSMO-CLM did not change much the time to solution, because COSMO-CLM already approaches the parallel-efficiency limit 895 by using 78 cores. This prevented finding the optimum configuration using three nodes. The corresponding NEMO-MED12 measurements at 50 cores are a bit out of scaling as well. This is probably caused by the I/O which increased for unknown reasons on the machine used between the time of conduction of the first series of simulations and of the optimized simulations.
For CCLM+TRIMNP+CICE no scalability is found for CICE. As shown in Fig. 5 a parallel effi-900 ciency smaller than 50 % is found for CICE at approximately 15 cores. As shown in Fig. 3 the time to solution for all core numbers investigated is higher for CICE than for CCLM in SMT mode. Thus, a load imbalance smaller than 5 % can hardly be found using one node. The optimum configuration found is thus a one-node configuration using the CCLM reference configuration (16 cores).
The CCLM+MPI-ESM coupling is a combination of sequential coupling between CCLM and 905 ECHAM and concurrent coupling between ECHAM and the ocean model MPIOM. As shown in Fig. 4 MPIOM is much cheaper than ECHAM and thus, the coupling is dominated by the sequential coupling between CCLM and ECHAM. As shown in Fig. 3 ECHAM is the most expensive component and it exhibits no decrease of time to solution by increasing the number of cores from 28 to 56, i.e. it exhibits a very low scalability. Thus, as described in the strategy for finding the optimum 910 configuration, even if a parallel efficiency higher than 50 % for up to 64 cores (see Fig. 5) is found, the optimum configuration is the 32 core (one node) configuration, since no significant reduction of the time to solution can be achieved by further increasing the number of cores.
An analysis of additional cost of coupling requires a definition of a reference. We use the cost of CCLM stand-alone at optimum configuration (CCLM sa,OC ). We found the SMT mode with 915 non-alternating distribution of processes and 64 cores to be the optimum configuration for CCLM resulting in a time to solution of 3.6 HPSY and cost of 230.4 CHPSY. As shown in section 4.2, SMT mode with non-alternating processes distribution is the most efficient and the scalability limit is reached at approximately 80 cores in SMT mode due to limited number of grid points used. The double of 64 cores is beyond the scalability limit of this particular model grid.  Table 8). They use 128 cores for each component model in SMT mode with alternating processes distribution (line 1.5 in Table 8 Table 8). The 128 CCLM processes of our reference optimum con-  Table 8). The 5 times higher cost of VEG3D in comparison with CCLM is due to low scalability of VEG3D (see Fig. 3). The OASIS horizontal interpolations (line 3.3.2 in Table 8) produce 6.3 % extra cost in CCLM+CLM. No extra cost occurs due to hori-960 zontal interpolation in CCLM+VEG3D coupling, since the same grid is used in CCLM and VEG3D, and due to load imbalance, which is obsolete in sequential coupling. The remaining extra cost are assumed to be the cost difference between the coupled CCLM and CCLM sa,OC . They are found to be 55.4 % and 29.7 % for CLM and VEG3D coupling respectively. A substantial part of the relatively high extra cost of CCLM in coupled mode of CCLM+CLM might be explained by higher cost 965 of cosmo_5.0_clm1, used in CCLM+CLM, in comparison with cosmo_4.8_clm19, used in all other couplings (see line 1.7 in Table 8). CCLM sa performance measurements with both versions (but on a different machine than Blizzard) reveal a cosmo_5.0_clm1 time to solution 45 % smaller than for cosmo_4.8_clm19.  Table 8). The load imbalance of 6.9 % of CCLM sa,OC is below the intended limit of 5 % of the cost of the coupled system. The extra cost of CCLM OC of 19 % are smaller than for the land surface scheme couplings.
The optimum configuration of the coupling with TRIMNP+CICE for the North and Baltic Sea (CCLM+TRIMNP+CICE) has a time to solution of 18 HPSY and cost of 576 CHPSY. This is 3.5 times longer than CCLM sa,OC due to lack of scalability of the sea ice model CICE and 1.5 times 980 more expensive than CCLM sa,OC (line 2.3 and 3.3 of Table 8  The cost of CCLM stand-alone using the same mapping (CCLM sa,sc ) as for CCLM coupled to MPI-ESM is 4.3 % higher than the cost of CCLM sa,OC (line 3.3.4 in Table 8). Interestingly, the cost of OASIS horizontal interpolations is 3.3 % only. This achievement is discussed in more detail in the next section. Finally, the extra cost of CCLM in coupled mode of CCLM+ECHAM+MPIOM are 77.4 %. They are the highest of all couplings. Additional internal measurements allowed to identify additional computations in CCLM coupling interface to be responsible for a substantial part of these cost. The vertical spline interpolation of the 3D fields exchanged between the models was found to consume 51.8 % of CCLM sa,OC cost, which are 2/3 of the extra cost of CCLM OC .
Interestingly, a direct comparison of complexity and grid point number G (see definition in Balaji et al. (2017)) given in Table 3 with extra cost of coupling given in Table 8 exhibits, that the couplings 1015 with short time to solution and lowest extra cost are those of low complexity. On the other hand, the most expensive coupling with longest time to solution is that of highest complexity and with largest number of gridpoints.

Coupling cost reduction
The CCLM+MPI-ESM coupling is one of the most intensive couplings that has up to now been 1020 realized with OASIS3(-MCT) in terms of number of coupling fields and coupling time steps: 450 2D fields are exchanged every ECHAM coupling time step, that is, every ten simulated minutes (see section 3.2). Most of these 2D fields are levels of 3D atmospheric fields. We show in this section that a conscious choice of coupling software and computing platform features can have a significant impact on time to solution and cost.

1025
To make the CCLM+MPI-ESM coupling more efficient, all levels of a 3D variable are sent and received in a single MPI message using the concept of pseudo-3D coupling, as described in section 3.1.2, thus reducing the number of sent and received fields (see Table 4). The change from 2D to pseudo-3D coupling lead to a decrease of the cost of the coupled system running on 32 cores by 3.7 % of the coupled system, which corresponds to 25 % of CCLM sa,OC cost. At the same time the The combined effect of usage of 3D-field exchange and of an alternating processes distribution lead to an overall reduction of the total time to solution and cost of the coupled system CCLM+MPI-ESM by 39 %, which corresponds to 261 % of the CCLM sa,OC cost.

Conclusions
We present couplings between the regional land-atmosphere climate model COSMO-CLM and  Fig. 7 to 13). The next step is development of the UOI for multiple couplings which allows regional climate system modelling over Europe.
A series of simulations has been conducted with an aim to analyse the computational performance of the couplings. The CORDEX-EU grid configuration of COSMO-CLM on a common computing The scaling of COSMO-CLM was found to be very similar in stand-alone and in coupled mode.

1075
The weaker scaling, which occurred in some configurations, was found to originate from additional computations which do not scale but are necessary for coupling. In some cases the model physics or the I/O routines exhibited a weaker scaling, most probably due to limited memory.
The results confirm that parallel efficiency is decreasing substantially if the number of grid points per core is below 80. For the configuration used (132x129 grid points), this limits the number of 1080 cores, which can be used efficiently to 80 in SMT mode and 160 in ST mode.
For the first time a sequential coupling of approximately 450 2D fields using the parallelized coupler OASIS3-MCT was investigated. It was shown that the direct cost of coupling by OASIS3-MCT (interpolation and communication) are negligible in comparison with the cost of the coupled atmosphere-atmosphere model system. We showed that the exchange of one (pseudo-)3D field in-1085 stead of many 2D fields reduces the cost of communication drastically. Furthermore, the idling of cores due to sequential coupling could be avoided by a dedicated launching of one process of each of the two sequentially running models on each core making use of the multi-threading mode available on the machine Blizzard used and on several other machines.
A strategy for finding an optimum configuration was developed. Optimum configurations were 1090 identified for all investigated couplings considering all three aspects of climate modeling performance: time to solution, cost and parallel efficiency. The optimum configuration of a coupled system, that involves a component not scaling well with available resources, is suggested to be used at minimum cost, if time to solution cannot be decreased significantly. This is the case for CCLM+MPI-ESM and CCLM+TRIMNP+CICE couplings. An exception is the CCLM+VEG3D coupling. VEG3D

1105
The optimum configuration of land surface scheme couplings exhibit same speed and doubling of cost in comparison with COSMO-CLM stand-alone. It was found to be close to its absolute optimum, since 60 % to 75 % of the extra cost of coupling are unavoidable. These are the extra cost of (1) keeping the speed of the coupled system high, resulting in an unavoidable increase of cost with core number, (2) the need of using the less efficient single threading mode to avoid 50 % of idle The optimum configuration of the regional ocean coupling for the Mediterranean CCLM+NEMO-MED12 exhibits same speed and doubling of cost as well. In this case the cost of the ocean model 1115 are much higher and the extra cost of mapping are much smaller, which is due to usage of concurrent coupling.
The optimum configuration of the regional ocean coupling for the North and Baltic Sea CCLM+- The procedure of finding an optimum configuration was found applicable to each coupling layout investigated and thus it could be applicable to other coupled model systems as well.
The Analysis of extra cost of coupling was found to be a useful step of development of a Regional Climate System Model, which is coupling several model components. Bottle-necks of coupling have 1140 been identified in the CCLM+TRIMNP+CCLM and the CCLM+MPI-ESM couplings. The results for time to solution, cost and parallel efficiency of different couplings can serve as a starting point for finding an optimum coupling layout and configuration for multiple couplings.
Appendix A: Source code availability COSMO-CLM is an atmosphere model coupled to the soil-vegetation model TERRA. Other regional processes in the climate system like ocean and ice sheet dynamics, plant responses, aerosol-cloud interaction, and the feedback to the GCM driving the RCM are made available by coupling COSMO-CLM via OASIS3-MCT with other models.
The COSMO-CLM model source code is freely available for scientific usage by members of the CLM-Community (www.clm-community.eu). The CLM-Community is a network of scientists who accept the CLM-Community agreement. For details on how to become a member, please check the CLM webpage.
The current recommended version of COSMO-CLM is COSMO_131108_5.0_clm9 5 . It comes together with a recommendation for the configurations for the European domain.
The development of fully coupled COSMO-CLM is an ongoing research project within the CLM-

1155
Community. The unified coupling interface via OASIS3-MCT is available by contacting one of the authors and will be part of a future official COSMO-CLM version. All other components, including OASIS interface, are available by contacting the authors. The OASIS3-MCT coupling library can be downloaded at https://verc.enes.org/oasis/ .
The two way coupled system CCLM+MPIESM was developed at BTU Cottbus and FU Berlin. In the following, the time step organisation within the coupled models is described. This aims at providing a basis of understanding of the coupling between the models. The solution at the next time level t m +(∆t) c is relaxed to the solution prescribed at the boundaries 1200 using an exponential function for the lateral boundary relaxation and a cosine function for the top boundary Rayleigh damping (Doms and Baldauf, 2015). At the lower boundary a slip boundary condition is used together with a boundary layer parameterisation scheme (Doms et al., 2011). The time loop (stepon) has three main parts. It begins with the computations in spectral space, followed by grid space and spectral-space computations. In scan1 the spatial derivatives (sym2, ewd, fft1) are computed for time level t n in Fourier space followed by the transformation into 1215 grid-space variables on the lon/lat grid. Now, the computations needed for two-way coupling with COSMO-CLM (twc) are done for time level t n variables followed by advection (dyn, ldo_advection) at t n , the second part of the time filtering of the variables at time t n (tf2), the calculation of the advection tendencies and update of fields for t n+1 (ldo_advection). Now, the first part of the time filtering of the time level t n+1 (tf1) is done followed by the computation of physical tendencies at t n (physc). The remaining spectral-space computations in scan1 begin with the reverse fourier transformation (fftd).

B3 NEMO-MED12
In Fig. 9 the flow diagram of NEMO 3.3 is shown. At the beginning the mpp communication is initialized by cpl_prism_init. This is followed by the general initialisation of the NEMO model. Weather Rev., 139, 3887-3905, 2011. of the Regional Mediterranean Earth System, Mercator Ocean Quarterly Newsletter, 46, 60-66, 2012. Bülow, K., Dietrich, C., Elizalde, A., Gröger, M., Heinrich, H., Hüttl-Kabos, S., Klein, B., Mayer, B., Meier, H. M., Mikolajewicz, U., Narayan, N., Pohlmann, T., Rosenhagen, G., Schimanke, S., Sein, D., andSu, J.: Comparison of three regional coupled ocean atmosphere models for the North Sea under today's and future Casulli, V. and Cattani, E.: Stability, Accuracy and Efficiency of a Semi-Implicit Method for Three-Dimensional Technol., 19, 183-204, 2002. Gasper, F., Goergen, K., Shrestha, P., Sulis, M., Rihani, J., Geimer, M., and Kollet, S.: Specific humidity (kg kg −1 ) 3D Specific cloud liquid water content (kg kg −1 ) 3D Specific cloud ice content (kg kg −1 ) 3D Surface pressure (P a) Sea surface temperature SST (K) Surface snow amount (m) Surface geopotential (m s −2 ) SST = (sea_ice_area_f raction · Tsea ice) + (SST · (1 − sea_ice_area_f raction))         Table 8 for details.  Fig. 6). seq refers to sequential and con to concurrent couplings. Thread mode is either the ST or the SMT mode (see Fig. 2). APD indicates whether an alternating processes distribution was used or not. levels in CCLM gives the simulated number of levels and CCLM version is the COSMO-CLM model version used for coupling. Relative Time to solution (%) and Cost (%) are caculated with respect to the reference, which is the CCLM stand-alone configuration CCLM sa using 64 cores and non-alternating SMT mode. The time to solution includes the time needed for OASIS interpolations. All relative quantities in lines 2.2-2.3 and 3.2-3.3.5 are given in percent of   Figure 7: Simplified flow diagram of the main program of the regional climate model COSMO-CLM, version 4.8_clm19_uoi. The red highlighted parts indicate the locations at which the additional computations necessary for coupling are executed and the calls to the OASIS interface take place. Where applicable, the component models to which the respective calls apply are given.