Regional climate modeling on European scales: a joint standard evaluation of the EURO-CORDEX RCM ensemble

. EURO-CORDEX is an international climate downscaling initiative that aims to provide high-resolution climate scenarios for Europe. Here an evaluation of the ERA-Interim-driven EURO-CORDEX regional climate model (RCM) ensemble is presented. The study documents the performance of the individual models in representing the basic spatiotemporal patterns of the European climate for the period 1989–2008. Model evaluation focuses on near-surface air temperature and precipitation, and uses the E-OBS data set as observational reference. The ensemble consists of 17 simulations carried out by seven different models at grid resolutions of 12 km (nine experiments) and 50 km (eight experiments). Several performance metrics computed from monthly and seasonal mean values are used to assess model performance over eight subdomains of the European continent. Results are compared to those for the ERA40-driven ENSEMBLES simulations. analysis ability of to basic time. But it also nonnegligible deﬁciencies simulations for regions seasons. Seasonally and regionally averaged temperature biases are mostly smaller than 1.5 ◦ C, while precipitation biases typically located in the ± 40 % range. Some bias characteristics, such as a predominant cold and wet bias in most seasons and over most parts of Europe and a warm and summer bias over southern and southeastern Europe common model biases. For seasonal mean quantities averaged over large European subdomains, clear beneﬁt of an increased spatial resolution (12 vs. 50 km) can be identiﬁed. The bias ranges of the EURO-CORDEX ensemble mostly correspond to those of the ENSEMBLES simulations, but some improvements in model performance can be identiﬁed (e.g., a less pronounced southern European warm summer bias). The temperature bias spread across different conﬁgurations of one individual model can be of a similar magnitude as the across different models, demon-strating a strong inﬂuence of the speciﬁc choices in physical parameterizations and experimental setup on model performance. Based on a number of simply reproducible metrics, the present study quantiﬁes the currently achievable accuracy of RCMs used for regional climate simulations over Europe and provides a quality standard for future model developments.

Abstract. EURO-CORDEX is an international climate downscaling initiative that aims to provide high-resolution climate scenarios for Europe. Here an evaluation of the ERA-Interim-driven EURO-CORDEX regional climate model (RCM) ensemble is presented. The study documents the performance of the individual models in representing the basic spatiotemporal patterns of the European climate for the period 1989-2008. Model evaluation focuses on near-surface air temperature and precipitation, and uses the E-OBS data set as observational reference. The ensemble consists of 17 simulations carried out by seven different models at grid resolutions of 12 km (nine experiments) and 50 km (eight experiments). Several performance metrics computed from monthly and seasonal mean values are used to assess model performance over eight subdomains of the European continent. Results are compared to those for the ERA40-driven ENSEMBLES simulations.
The analysis confirms the ability of RCMs to capture the basic features of the European climate, including its variability in space and time. But it also identifies nonnegligible deficiencies of the simulations for selected metrics, regions and seasons. Seasonally and regionally averaged temperature biases are mostly smaller than 1.5 • C, while precipitation biases are typically located in the ±40 % range. Some bias characteristics, such as a predominant cold and wet bias in most seasons and over most parts of Europe and a warm and dry summer bias over southern and southeastern Europe reflect common model biases. For seasonal mean quantities averaged over large European subdomains, no clear benefit of an increased spatial resolution (12 vs. 50 km) can be identified. The bias ranges of the EURO-CORDEX ensemble mostly correspond to those of the ENSEMBLES simulations, but some improvements in model performance can be identified (e.g., a less pronounced southern European warm summer bias). The temperature bias spread across different

Published by Copernicus Publications on behalf of the European Geosciences Union.
configurations of one individual model can be of a similar magnitude as the spread across different models, demonstrating a strong influence of the specific choices in physical parameterizations and experimental setup on model performance. Based on a number of simply reproducible metrics, the present study quantifies the currently achievable accuracy of RCMs used for regional climate simulations over Europe and provides a quality standard for future model developments.

Introduction
Assessing the impacts of expected 21st century climate change and developing response strategies requires local-to regional-scale information on the nature of these changes, including a sound assessment of inherent projection uncertainties. Driven by a suite of IPCC (Intergovernmental Panel on Climate Change) assessment reports and accompanied by increasing public awareness of ongoing climate change, the past decades have seen a rapid development in the corresponding methods for climate scenario generation. Part of this evolution has been the development and the refinement of climate-downscaling techniques, which aim at translating coarse-resolution information as obtained from global climate models (GCMs) into regional-and local-scale conditions (e.g., Hewitson and Crane, 1996;Wilby and Fowler, 2011). While statistical downscaling methods attempt to bridge the scale gap by applying empirically derived transfer functions between the coarse resolution climate model output and local weather conditions (e.g., Benestad et al., 2008;Fowler et al., 2007;Maraun et al., 2010;Themeßl et al., 2012;Widmann et al., 2003), dynamical downscaling employs high-resolution regional climate models (RCMs) nested into global model output (e.g., Giorgi, 2006;Laprise, 2008;McGregor, 1997;Wang et al., 2004). This technique allows for a considerably higher spatial resolution over the domain of interest and, hence, for a more realistic representation of important surface heterogeneities (such as topography, coast lines, and land surface characteristics) and of mesoscale atmospheric processes. Dynamical downscaling has originally been developed for the purpose of numerical weather prediction and was first applied in a climate context in the late 1980s and early 1990s (Dickinson et al., 1989;Giorgi, 1990). Since then, considerable efforts were put into further methodological and technical developments, and ever increasing computational resources facilitated simulations of multidecadal length. Large collaborative research projects such as MERCURE (e.g., Hagemann et al., 2004), PRUDENCE , NARCCAP (Mearns et al., 2009), and ENSEMBLES (van der Linden and Mitchell, 2009) constituted major milestones in both regional model development and the usage of regional climate scenarios by the climate impact, adaptation and vulnerability community. Dynamical downscaling of GCM output can today be considered as a well-established standard technique for the generation of regional climate change scenarios. Recent climate scenario products tailored for use in climate impact assessment, such as (1) the CH2011 Swiss climate change scenarios (CH2011, 2011), (2) the German climate impacts and adaptation initiative (Jacob et al., 2008), (3) the German "consortium runs" (Hollweg et al., 2008), (4) the Styrian STMK12 (Klimaszenarien für die Steiermark) scenarios in the eastern Alps (Gobiet et al., 2012), (5) the French high-resolution climate scenarios (Lemond et al., 2011;Vautard et al., 2013a), or (6) the climate change scenarios for the Netherlands (van den  are in large part based on the analysis of RCM ensembles. Concerning the interplay between dynamical and statistical downscaling, recent climate impact applications suggest that a combination of the two approaches is optimal (e.g., Bosshard et al., 2013;Paeth, 2011). Apart from their role in climate scenario development, RCMs also became important tools to advance the understanding of regional-scale climate processes and associated feedbacks (e.g., Fischer and Schär, 2009;Hohenegger et al., 2009;Langhans et al., 2013;Seneviratne et al., 2006).
An integral part of regional model development is the evaluation and quantification of model performance by comparison against observation-based reference data. For this purpose, the standard procedure is to carry out evaluation experiments for the recent decades in a perfect boundary setting, i.e., applying reanalysis products as lateral boundary forcing for the regional model. Although atmospheric reanalyses, themselves, are based on imperfect models and considerable differences can exist between different reanalysis products with corresponding impacts on downscaling results (Brands et al., 2012) this technique allows isolating model biases introduced by the nesting procedure and/or the RCM formulation from biases introduced by a potentially erroneous largescale forcing. Model evaluation in a perfect boundary context is an important component of RCM development. It highlights areas of model deficiencies, though without necessarily uncovering the physical reasons for the found biases. It is furthermore the basis for model calibration efforts (e.g., Bellprat et al., 2012b) and can be used for weighting individual RCMs in multimodel ensembles , and further studies in that Climate Research special issue) or for excluding models with identifiable severe shortcomings. A proper and physically consistent representation of the present-day climate by RCMs is generally considered as a prerequisite for their ability to capture the response of regional climates to enhanced greenhouse gas conditions. As such, model evaluation results are an important piece of information provided to end users of regional climate projections.
A large number of previous studies have been concerned with RCM evaluation. Both perfect-boundary settings and GCM-driven setups, in which RCMs potentially inherit biases from the large-scale boundary forcing, were considered. Over Europe, comprehensive evaluations were carried out in the frame of large research projects such as PRUDENCE and ENSEMBLES. Similar but typically less comprehensive evaluation efforts have been conducted outside of Europe (e.g., Evans and McCabe, 2010;Kim et al., 2013;Nikulin et al., 2012;Paeth et al., 2005). Various aspects of model performance were covered, including long-term mean climatological distributions of temperature and precipitation (the two main parameters required by climate impact modelers; e.g., Bergant et al., 2007;Holtanova et al., 2012;Jacob et al., 2007Jacob et al., , 2012Jaeger et al., 2008;Kotlarski et al., 2005), but also explicitly addressing mesoscale structures  and frequency distributions of these two parameters (Déqué and Somot, 2010;Kjellström et al., 2010; as well as temperature trends (Lorenz and Jacob, 2010) and temperature variability (Fischer et al., 2012;Vidale et al., 2007). Elevation dependencies of near-surface air temperature and precipitation were evaluated by Kotlarski et al. (2012). Given the high impact potential, further studies were concerned with the evaluation of extreme precipitation (Frei et al., 2006;Hanel and Buishand, 2012;Herrera et al., 2010;Lenderink, 2010;Maraun et al., 2012;Rajczak et al., 2013;Wehner, 2013) and temperature (Fischer et al., 2007;Vautard et al., 2013b) as well as extreme wind speeds and related loss potentials (Donat et al., 2010;Kunz et al., 2010). Menut et al. (2013) proposed an evaluation of the key climate parameters driving the onset of air pollution episodes. In order to enhance process understanding and to reveal potential reasons for biases in atmospheric quantities, also surface energy fluxes (Hagemann et al., 2004;Lenderink et al., 2007;Markovic et al., 2008) and nonatmospheric state parameters such as terrestrial water storage (Greve et al., 2013; and snow cover (Räisänen and Eklund, 2012;Salzmann and Mearns, 2012;Steger et al., 2013) have been evaluated. In Europe, several studies explicitly focused on RCM evaluation over the Alps, a region subject to a complex topography and a strong spatial variability of near-surface climates (Frei et al., 2003;Haslinger et al., 2013;Kotlarski et al., 2010;Prömmel et al., 2010;Smiatek et al., 2009;Suklitsch et al., 2008Suklitsch et al., , 2011. In summary, the mentioned studies show that current RCMs are able to reproduce the most important climatic features at regional scales, particularly if driven by perfectboundary conditions, but that important biases remain. Some of these deficiencies are specific to individual models. Others seem to be a common and more systematic feature across different RCMs, such as a dry and warm summer bias in southeastern Europe (Hagemann et al., 2004) and an overestimation of interannual summer temperature variability in central Europe (Fischer et al., 2012;Jacob et al., 2007;. Model biases typically depend on the region analyzed (Jacob et al., 2007(Jacob et al., , 2012Rockel and Geyer, 2008), are partly related to parametric uncertainty and choices in model configuration (e.g., Awan et al., 2011;Bellprat et al., 2012a;de Elía et al., 2008;Evans et al., 2012) and can be affected by internal variability Roesch et al., 2008) as well as by uncertainties of the observational reference data themselves (Bellprat et al., 2012a;Kotlarski et al., 2005;Kyselý and Plavcová, 2010). For certain quantities and seasons a higher grid resolution seems to be associated with reduced biases (Déqué and Somot, 2008;Herrmann et al., 2011;Rauscher et al., 2010;. Concerning the use of RCM projections for climate impact assessment, recent studies suggest a nonstationarity of model biases (Bellprat et al., 2013;Boberg and Christensen, 2012;Buser et al., 2009;Christensen et al., 2008;Ehret et al., 2012;Maraun, 2012), questioning the widely used constant-bias assumption when interpreting simulated climate change signals and challenging bias correction techniques.
While RCM projections from projects such as PRU-DENCE and ENSEMBLES are widely used by the climate impact community and are considered as state-of-the-art, the next generation of regional climate projections is already under way in the frame of the CORDEX (Coordinated Regional Climate Downscaling Experiment) initiative (Giorgi et al., 2009). CORDEX aims to provide an internationally coordinated framework to compare, improve and standardize regional climate downscaling methods, covering both dynamical and empirical-statistical approaches. As part of this effort, model evaluation activities in the individual modeling centers are harmonized and a new generation of regional climate projections for land regions worldwide based on new CMIP5 (Coupled Model Intercomparison Project) GCM projections will be produced. First joint evaluations of CORDEX RCM experiments have recently been published by Nikulin et al. (2012) and Vautard et al. (2013b). EURO-CORDEX, the European branch of CORDEX ), provides regional climate projections for Europe at grid resolutions of about 12 and 50 km, applying an ensemble of RCMs in their most recent versions, driven by the latest GCM projections, thereby complementing the already available PRU-DENCE and ENSEMBLES data with unprecedented high resolution experiments. In its initial phase EURO-CORDEX focuses on model evaluation for present-day climate in a perfect boundary setting. Several aspects of model performance are analyzed by project partners in a series of ongoing studies. The present work is primarily concerned with evaluating the "standard" variables near-surface air temperature (simply referred to as temperature hereafter) and precipitation on European scales and based on monthly and seasonal mean values. These two quantities are typically evaluated by the individual modeling centers in the course of model development and tuning, and European-scale observational reference data exist. Furthermore, temperature and precipitation change signals are used by many climate impact assessments, and the ability of RCMs to reproduce these quantities is a useful information for a wide range of end users. In order to include dynamical aspects, we additionally evaluate the representation of the large-scale mean sea-level pressure. Although simulations carried out at grid resolutions of www.geosci-model-dev.net/7/1297/2014/ Geosci. Model Dev., 7, 1297-1333, 2014 both 12 and 50 km are analyzed, we do not specifically aim to investigate the added value of a higher resolution. This would require reliable observation-based data sets at the European scale with equivalent resolution, which are not available. Added value assessments are therefore allocated to a suite of accompanying studies evaluating aspects such as extreme precipitation characteristics over subdomains of the European continent where corresponding reference data exist (see Sect. 5.1 for further details). The primary aims of the present study are (1) to document the skill of the EURO-CORDEX RCM ensemble in reproducing the present-day European temperature and precipitation climate when driven by realistic boundary conditions, (2) to quantify modeling uncertainties originating from model formulation, (3) to assess a possible progress with respect to the precursor project ENSEMBLES, and (4) to highlight areas of necessary model improvements. For this purpose, we will apply several evaluation metrics covering a range of aspects of model performance. Our study provides a general overview on model performance and is of rather descriptive nature; it does not aim to ultimately explain biases of individual models. We leave these more-detailed investigations to a range of follow-up studies that will address specific aspects of model performance.
The study is organized in the following way: after introducing the RCM ensembles and the observational reference data in Sect. 2, Sect. 3 outlines the evaluation methods applied and introduces the individual performance metrics. Section 4 then presents the evaluation results for the EURO-CORDEX ensemble and relates them to the previous EN-SEMBLES experiments. The results are further discussed in Sect. 5, highlighting the basic model capabilities identified as well as remaining deficiencies in the simulation of the European climate. Section 6 finally concludes the study and provides an outlook on future evaluation activities in the EURO-CORDEX framework.

RCM data
We evaluate a set of 17 RCM simulations carried out in the frame of EURO-CORDEX. In total, six different RCMs plus the global ARPEGE model were applied by nine different institutions at grid resolutions of about 12 km (0.11 • on a rotated grid) and 50 km (0.44 • on a rotated grid). Eight out of the nine 0.11 • experiments have a corresponding partner at 0.44 • grid spacing, carried out with the identical model version and the identical choice of parameterizations (with the exception of REMO, where rain advection is used for the 0.11 • experiments but not for 0.44 • ). All simulations cover the period 1989-2008 and are driven by the ERA-Interim reanalysis (Dee et al., 2011), providing the required atmospheric lateral boundary conditions and sea surface temperatures and sea ice cover over ocean surfaces. The ERA-Interim boundary conditions can be considered to be of very high quality (Dee et al., 2011), particularly in the Northern Hemisphere extratropics where reanalysis uncertainty is negligible (Brands et al., 2013). The prescribed surface forcing over land (e.g., topography, vegetation characteristics, soil texture) is model-specific and can differ between the experiments. For instance, three out of the nine RCM setups analyzed (CLMCOM, KNMI, SMHI) apply a considerable smoothing to surface orography in order to avoid steep orographic grid-cell-to-grid-cell gradients. The ensemble includes three different configurations of the WRF model that differ mainly in the choice of physical parameterization schemes for radiation transport, microphysics and convection (see Table 1). The individual regional model domains can slightly differ from each other, but all models fully cover the focus domain required for EURO-CORDEX experiments ( Fig. 1) and apply an additional lateral sponge zone of individual width for boundary relaxation. A special case is CNRM's ARPEGE model which is a global spectral model with a stretched horizontal grid. ARPEGE was applied here in a special regional setup in which the model is strongly relaxed towards ERA-Interim outside of the common EURO-CORDEX domain (Fig. 1). In the interior domain, the model runs at resolutions of about 12 and 50 km, respectively, and is slightly nudged towards the driving reanalysis. To some extent, the EURO-CORDEX ARPEGE experiments can therefore be considered as RCM simulations with a global sponge zone.  Kain andFritsch (1990, 1993) Rasch and Kristjánsson An overview on all models and all experiments is provided by Table 1. The set of analyzed experiments corresponds to the currently available ERA-Interim-driven EURO-CORDEX ensemble, which might be subject to future extensions. Throughout this paper, the individual simulations will be identified by the acronym of the institution plus the horizontal grid resolution (11 for 0.11 • and 44 for 0.44 • ). For instance, the CCLM experiment carried out at 0.11 • by the CLM Community will be referred to as CLMCOM-11. Experiments that were not carried out on the standard 0.11 • and 0.44 • rotated grids but with comparable grid spacings (e.g., CNRM-11 and CNRM-44) were mapped onto the standard grids applying the nearest-neighbor interpolation method.
For comparing the performance of the EURO-CORDEX ensembles to that of the precursor project ENSEMBLES we additionally consider 16 RCM experiments carried out within the frame of ENSEMBLES with a horizontal grid resolution of about 25 km (0.22 • on a rotated grid). These experiments cover a similar domain and were driven by the ERA40 reanalysis (Uppala et al., 2005) for the period 1961-2000. In the present study only the 20-year period 1981-2000 is considered, including the 12 years 1989-2000 that overlap with the EURO-CORDEX ensembles. The application of different large-scale driving fields in EN-SEMBLES (ERA40) and EURO-CORDEX (ERA-Interim) can be expected to introduce slight inconsistencies in the intercomparison. The overall effect, however, is presumably small (see  for an example over North America). Following the naming convention institution-model according to http://ensemblesrt3.dmi.dk/ extended_table.html, the 16 ENSEMBLES experiments considered are C4I-RCA3, CHMI-Aladin, CNRM-Aladin, DMI-HIRHAM, EC-GEMLAM, ETHZ-CLM, HC-HadRM3Q0, HC-HadRM3Q3, HC-HadRM3Q16, ICTP-RegCM, KNMI-RACMO, METNO-HIRHAM, MPI-REMO, OURANOS-CRCM, SMHI-RCA and UCLM-PROMES. This ensemble will be referred to as ENS-22 in the following.

Observations
As observational reference for evaluating simulated temperature and precipitation we use version 7 of the daily gridded E-OBS data set (Haylock et al., 2008). E-OBS covers the entire European land surface and is based on the ECA&D (European Climate Assessment and Dataset) station data set plus more than 2000 further stations from different archives. It is available at four different resolutions; we here use the rotated 0.22 • version, which applies the same grid rotation as most of the EURO-CORDEX and ENSEMBLES experiments. The E-OBS 0.22 • grid corresponds to a horizontal resolution of about 25 km and exactly matches the grid of the 0.22 • ENSEMBLES simulations. Each E-OBS 0.22 • grid cell contains four cells of the rotated 0.11 • EURO-CORDEX grid, and four E-OBS 0.22 • cells exactly match one rotated 0.44 • EURO-CORDEX cell. Several previous studies have questioned the quality of E-OBS in regions of sparse station density and particularly regarding daily extremes (Bellprat et al., 2012a;Herrera et al., 2012;Hofstra et al., 2009Hofstra et al., , 2010Kyselý and Plavcová, 2010;Maraun et al., 2012;Rajczak et al., 2013) and its effective spatial resolution (e.g., Hanel and Buishand, 2011;Kyselý and Plavcová, 2010). Since the density of the station network is rather low over a considerable part of Europe, the gridding procedure tends to smooth the spatial variability of both temperature and precipitation, and over many regions the effective resolution of E-OBS is presumably lower than the nominal 0.22 • grid spacing. For individual subregions of the European continent more accurate data sets that are based on a larger number of observation stations might exist. The clear advantage of E-OBS is its spatial (entire European land surface) and temporal  coverage, which makes it ideal for an approximate evaluation of RCM-simulated temperature and precipitation characteristics over Europe. As observational uncertainties are not explicitly considered here, potential inaccuracies of E-OBS should however be kept in mind when interpreting the evaluation results. In addition to the issues mentioned above, this applies also to E-OBS precipitation sums, which do not reflect the systematic undercatch of rain gauge measurements (which on average can be of the order of 4-50 % depending on the season and region; e.g., Frei et al., 2003;Rubel and Hantel, 2001;Sevruk 1986) and very likely underestimate true precipitation. To account for this inaccuracy of the observational reference, we deliberately highlight precipitation biases between 0 and +25 % in some of the analyses. Wet biases in this range could be explained by a mean systematic rain gauge undercatch of up to 20 % of true precipitation (i.e., neglecting any seasonal and site-specific variation of the measurement error). Furthermore, note that E-OBS is only available at a maximum spatial resolution of 0.22 • . The 0.11 • EURO-CORDEX experiments can therefore only be evaluated on the coarser E-OBS grid and an in-depth added-value analysis of the 0.11 • experiments compared to the 0.44 • simulations is not possible within this framework. For the evaluation of the spatial pattern of the simulated mean sea-level pressure, the driving reanalysis ERA-Interim itself is used as reference, i.e., the analysis reveals to what extent the individual RCMs distort the large-scale flow imposed by the boundary conditions.

Regional analysis
In order to capture the spatial variability of model performance over Europe, the individual evaluation metrics (see below) were applied to eight different subdomains of the European continent ( www.geosci-model-dev.net/7/1297/2014/ Peninsula (IP), the Mediterranean (MD), Mid-Europe (ME), and Scandinavia (SC). These domains have been specified in the frame of the PRUDENCE project  and have since then been widely used for RCM evaluation and analysis of climate change signals (e.g., Bellprat et al., 2012b;Christensen et al., 2008;Kotlarski et al., 2012;Lenderink, 2010;Lorenz and Jacob, 2010). They represent comparatively homogeneous climatic conditions, although pronounced climatic gradients can exist within individual subdomains. The Alpine domain AL, for instance, covers both high-elevation regions along the Alpine ridge and the low-lying Po Valley in northern Italy. Still, the decomposition of the EURO-CORDEX domain into these eight subdomains allows representing important large-scale climatic gradients (e.g., the transition from maritime climates in the west to continental climates in the east). In the main part of this study the results for only four subdomains are shown, sampling a wide range of climatic settings (EA, IP, ME, SC). For completeness, figures for the remaining subdomains (AL, BI, FR, MD) are presented in Appendix B.

Evaluation metrics
Besides the analysis of seasonal mean biases at grid-point scale for the EUR-11 ensemble and the entire EURO-CORDEX domain, we apply several evaluation metrics to monthly, seasonal (winter: DJF, spring: MAM, summer: JJA, autumn: SON) and annual mean values of temperature and precipitation for all experiments of the EUR-11, EUR-44 and ENS-22 ensembles. These metrics are well-established distance measures that assess the quality of (regional) climate simulations by comparison against a gridded observational reference. They represent spatial and temporal bias characteristics and demonstrate the unavoidable spread of model performances in the reproduction of present-day regional climate. As our aim is not to produce an overall skill score that could be used for model weighting but to document different aspects of model performance, the metrics are presented individually and are not combined into some final performance score. The short evaluation period, leading to a sample size of only 20 seasonal/annual means, also hampers a sound analysis of statistical robustness. We therefore explicitly refrain from assessing the statistical significance of the detected model biases and also do not address any trends of climate parameters. The following metrics are used (exact mathematical formulations are provided in Appendix A; the term "climatological" refers to mean values over the 20-year period 1989-2008): BIAS: the difference (model − reference) of spatially averaged climatological annual or seasonal mean values for a selected subregion (relative difference for precipitation).
95 %-P: the 95th percentile of all absolute grid cell differences (model − reference) across a selected subregion based on climatological annual or seasonal mean values (relative difference for precipitation).
PACO: the spatial pattern correlation between climatological annual or seasonal mean values of model and reference data across all grid points of a selected subregion.
RSV: ratio (model over reference) of spatial standard deviations across all grid points of a selected subregion of climatological annual or seasonal mean values.
TCOIAV: temporal correlation of interannual variability between model and reference time series of spatially averaged annual or seasonal mean values of a selected subregion.
RIAV: ratio (model over reference) of temporal standard deviations of interannual time series of spatially averaged annual or seasonal mean values of a selected subregion.
CRCO: Spearman rank correlation between spatially averaged monthly values of model and reference data of the climatological mean annual cycle of a selected subregion.
ROYA: ratio (model over reference) of yearly amplitudes (differences between maximum and minimum) of spatially averaged monthly values of the climatological mean annual cycle of a selected subregion.

Regridding
Several evaluation metrics require a grid-cell-by-grid-cell comparison between models and observations. Consequently, a remapping of either the EURO-CORDEX RCM output or of E-OBS to a common reference grid was necessary prior to the analysis. In order to ensure a fair evaluation, our strategy was to always use the coarser grid as reference, except for mean sea-level pressure (see below). This means that (1) the evaluation of the EUR-11 ensemble was carried out on the coarser 0.22 • E-OBS grid, and that (2)  Because mean sea-level pressure has a large-scale structure and no quantitative grid cell metrics were calculated for this variable, the comparison between the EUR-11 simulations and the ERA-Interim reference data has also been carried out on the 0.22 • E-OBS grid. For visualizing the spatial pattern of temporal mean biases, the coarser ERA-Interim geographic grid was therefore projected onto the finer (rotated) E-OBS grid.

Spatial bias pattern
Figures 2-4 provide an overview on the spatial distribution of the 20-year mean winter and summer model biases of the EUR-11 ensemble for temperature, precipitation and mean sea-level pressure. For temperature, and in agreement with previous studies (see Sect. 1), this evaluation indicates a good reproduction of the spatial temperature variability by the RCMs, including the north-south temperature gradient and elevation effects (Fig. 2). Still, important biases can occur in individual experiments. In wintertime, temperatures are typically underestimated over large parts of the domain. The largest negative biases exceeding −3 • C are found in northeastern Europe (IPSL-INERIS, CRP-GL, CSC), in Norway (CNRM, KNMI) and along the Alpine ridge (IPSL-INERIS, CRP-GL, CNRM, CSC, SMHI, KNMI). Only two models show a strong warm bias of more than +3 • C over parts of Scandinavia (UHOH) and northeastern Europe (CNRM). CSC and IPSL-INERIS overestimate winter temperatures in the southeast. For a number of RCMs the cold temperature bias, which is widespread in winter, is also found in summer (SMHI, KNMI, DMI). These cold biases, however, are generally less pronounced than in winter and most models have a tendency to overestimate summer temperature in the southeast. CLMCOM and CSC show a pronounced warm summer bias over most parts of southern Europe. A notable feature of the temperature evaluation is the fact that the bias range spanned by the three WRF experiments alone (IPSL-INERIS, UHOH, CRP-GL) nearly corresponds to the bias range of the entire EUR-11 ensemble. This is especially true in wintertime, but does not apply to the southern European warm summer biases, which are largest in CLMCOM, CSC and DMI. A further conspicuous feature of Fig. 2 is the pronounced small-scale spatial variability of temperature biases in CNRM, which is apparently related to orographic patterns.
Concerning mean seasonal precipitation, the evaluation indicates a wet wintertime bias of most models over most parts of Europe (Fig. 3). Biases of more than 50 % are obtained over the central and eastern regions. In contrast, winter precipitation amounts over parts of southern Europe (Portugal, northern Italy) are underestimated in most cases. CNRM shows a dry wintertime bias over large parts of the study area. In summer, most experiments overestimate precipitation sums in northern and northeastern Europe, while three models show a pronounced dry bias in the Mediterranean region (CNRM, CLMCOM, DMI). Again, CNRM considerably underestimates precipitation over most of Europe and, as for temperature, the precipitation bias shows a pronounced variability in space. In contrast to temperature, the three WRF experiments mostly agree in their precipitation bias pattern in winter with a widespread overestimation. In summer, UHOH underestimates precipitation over parts of northern Europe and, hence, shows a slightly different behavior than CRP-GL and IPSL-INERIS, which overestimate summer precipitation over the whole analysis domain. Southern European summer precipitation is considerably overestimated by all WRF experiments. A possible reason for the different behavior of UHOH compared to CRP-GL and IPSL-INERIS with respect to summer precipitation over parts of northern Europe is the choice of different microphysics schemes (two-moment scheme in UHOH, onemoment scheme in IPSL-INERIS and CRP-GL). All models, except CNRM, show a pronounced wet bias along the eastern boundary, which may indicate problems with the lateral boundary conditions of the limited area models (e.g., inconsistent velocity and humidity gradients between the RCMs' regional solutions and the ERA-Interim boundary forcing in the lateral sponge zone). In contrast, CNRM uses a global grid and -per definition -a very large sponge zone with a comparatively weak relaxation, which likely provides a smoother transition of the prescribed outer boundary conditions into the inner model domain and avoids spurious boundary effects. No difference in the spatial variability of precipitation biases between RCMs that apply a strong smoothing of surface orography (CLMCOM, KNMI, SMHI) and those applying a nonfiltered orography (all others) can be identified. This might partly be related to the averaging of simulated precipitation at 0.11 • to the 0.22 • E-OBS grid prior to the analysis.
To complete the overview on the spatial pattern of model biases and to provide a better handle on dynamical aspects of bias characteristics, Fig. 4 presents an evaluation of mean winter and summer mean sea-level pressure. In both seasons the RCMs reproduce the large-scale pattern of mean sea-level pressure fairly well and biases typically do not exceed 3 hPa. The bias pattern is generally smooth and has a large-scale structure in most cases. Exceptions are (a) the SMHI model, which shows a small-scale but strong overestimation in the northwestern corner of the analysis domain and an underestimation over continental Europe in winter, leading to a reduced meridional pressure gradient, and (b) the WRF experiments (IPSL-INERIS, UHOH and CRP-GL), which underestimate mean sea-level pressure over continental Europe in both seasons and, in the case of UHOH, also in the northwestern corner in summer. A particular feature of the WRF experiments is their agreement on a pronounced negative bias over mountainous terrain in winter (Scandinavian Alps, European Alps, Carpathians, Balkan Mountains) and the smallscale structure of the bias pattern, which is not found in the other models (except for positive summer biases over mountainous regions in CNRM and KNMI). This indicates a contribution of the model-specific method to reduce simulated surface pressure to mean sea level, and the pronounced biases in the mentioned regions should not be overinterpreted. Still, the underestimation of mean sea-level pressure by several hectopascal over large parts of continental Europe particularly in wintertime seems to be a robust feature of the WRF experiments and is also described by Mooney et al. (2013) in a sensitivity study of WRF in Europe.

Temporal and spatial means
The regionally averaged biases in mean seasonal and annual temperature and precipitation of both the EUR-11 and the EUR-44 ensemble are summarized in Figs. 5 and 6 (and Figs. B1, B2). For temperature the analysis reveals a cold bias of up to −2 • C for most models, most seasons and most subdomains. Exceptions are the CSC simulations that mostly show a slight warm bias as well as the tendency of both ensembles to overestimate summer temperatures over southern and southeastern Europe (subdomains EA, IP and MD). While CNRM, KNMI and SMHI are mostly located at the cold end of the model range, temperatures in CLMCOM and CSC are in many cases higher than in the rest of the ensemble. No obvious benefit of the higher resolution (EUR-11 vs. EUR-44) is apparent. The 0.11 • experiment of a given model performs worse or better than the corresponding 0.44 • experiment depending on season and subdomain. A systematic difference between both resolutions can be detected only for SMHI and KNMI where the higher resolution tends to produce lower temperatures in all seasons and regions compared to the coarse-resolution setup. A slightly different result is obtained for regionally averaged precipitation biases, which are positive in most cases and, for many models, tend to be larger in the 0.11 • experiments due to higher precipitation sums compared to the 0.44 • versions. This is especially true for the SMHI model, which shows a much stronger overestimation of precipitation at 0.11 • grid resolution compared to 0.44 • across all seasons and subdomains. Special cases are the British Isles (BI) with a dry bias in many experiments in winter, summer and autumn (especially of the 0.44 • versions) as well as subdomains AL, EA, FR and IP with a dry summer bias in many experiments. The precipitation biases of the three WRF experiments (CRP-GL, IPSL-INERIS, UHOH) are in many cases close to each other and do not sample the full range of model uncertainty. In general, the precipitation bias reaches from −40 to +80 %. Only the UHOH model shows exceptionally high deviations larger than +140 % in summer for regions IP and MD. Again, CNRM shows a special behavior and is often found at the dry end of the model range.
For individual seasons and subdomains, wet model biases are mostly smaller than 25 % and could, in principle, be explained by an observational undercatch of up to 20 % of true precipitation.
As the BIAS metric represents model biases averaged over a given subregion, compensating effects might arise; i.e., a small BIAS value might be the result of large negative and large positive biases over different parts of a given subdomain compensating each other. To identify such effects, the 95 %-P metric explores the 95th percentile of absolute biases at grid-point scale within each subdomain. For temperature (Figs. 7, B3) this metric mostly lies within the 1-3 • C range. Larger values are obtained for the topographically more structured subdomains SC and AL, which might partly be a result of the simplifying assumption of a temporally and spatially constant lapse rate used for elevation correction (see Sect. 3.3). The 95 %-P metric does not strongly modify the ranking of the models/experiments; i.e., models/experiments that show a small (large) BIAS typically also show a small (large) 95 %-P. Hence, the spatially averaged BIAS metric already provides a fairly good impression of model performance and is not too much affected by compensating effects.
Geosci. Model Dev., 7, 1297-1333  Again, an exception to this is CNRM-11, which typically shows a noticeable behavior with large 95 %-P values, while the BIAS metric for this experiment is not as special (though it typically also shows the largest biases). No systematic improvement of the 0.11 • experiments with respect to their 0.44 • counterparts can be identified for 95 %-P. In case of SMHI and CNRM the higher resolution models -representing stronger variations of topography -produce even larger peak deviations in subdomains SC and AL than their coarser resolved counterparts. For precipitation (Figs. 8, B4), 95 %-P mostly lies in the 50-100 % range but can be considerably larger (up to 400 %) for the southern European subdomains IP and MD. The latter can be explained by the relative definition of 95 %-P and the small precipitation sums in these regions especially during summer (cf. Fig. 3). This can lead to a large relative overestimation of precipitation by a particular model, although the absolute biases are small. Large 95 %-P values are also obtained for the European Alps (AL) especially for DMI and SMHI, which is the result of a pronounced overestimation of precipitation along the Alpine ridge in combination with a strong dry bias over the lowlying Po Valley south of the Alps (cf. Fig. 3). Especially for DMI these compensating effects of diverging precipitation biases within subdomain AL are not apparent from the BIAS metric ( Fig. B2) but only from 95 %-P (Fig. B4). For all subdomains, 95 %-P values are typically larger than 25 % and, in case these values correspond to wet model biases, cannot be explained by an observational undercatch of up to 20 % of true precipitation.

Spatial variability
The performance of the EUR-11 and EUR-44 ensembles with respect to the spatial variability of mean winter and mean summer temperature and precipitation within individual subdomains (i.e., at grid-box scale) is explored by the Taylor diagrams of Figs an overestimation, particularly in summertime and by up to 50 %. RSVs larger than 1.5 are obtained for CNRM and SMHI in a few cases. Wintertime RSVs are typically smaller and the spatial variability is often underestimated (RSV < 1). The systematic difference between summer and winter RSVs over many subdomains leads to a clustering of the respective markers for summer (triangles) and winter (circles) in these regions (EA, IP, SC, FR). The pronounced overestimation of spatial temperature variability by CNRM-11 over most parts of Europe is very likely related to the large spatial variability of the mean seasonal model bias (cf. Sect. 4.1). For most experiments and most subdomains the centered rootmean-square (rms) difference between simulation and observational reference amounts to less than 50 % of the observed spatial standard deviation. Overall, systematic differences in model skill between the 0.11 • and the 0.44 • versions (filled markers compared to nonfilled markers) are not found. Similarly to temperature, the spatial variability of mean winter and mean summer precipitation is typically overestimated by the experiments (Figs. 10, B6), RSVs are mostly located between 1 and 2. A stronger overestimation is found for the Mediterranean (MD) subdomain and in particular for the DMI model with RSVs of up to 4. Compared to temperature, the spatial pattern correlation of mean seasonal precipitation is much lower and PACO typically amounts to between 0.4 and 0.9 only. Whether a better performance is obtained for winter or summer (circles compared to triangles) considerably depends on the subdomain. There is no apparent systematic difference in model skill between the highand the low-resolution versions (filled compared to nonfilled markers). The centered root-mean-square difference between models and observations, expressed in units of the observed standard deviation, is typically found in the range between 50 and 200 % (RSVs between 0.5 and 2).

Interannual variability
The Taylor diagrams of Figs. 11 and 12 (and Figs. B7 and B8 for further subdomains) combine the parameters TCOIAV and RIAV, which assess the model performance with respect to the temporal (interannual) variability of mean winter and mean summer temperature and precipitation, based on regional averages over each subdomain. For winter temperature, temporal correlations are mostly larger than 0.9 while the results are worse for the summer season (Figs. 11, B7). Summer TCOIAVs are typically larger than 0.6, but values down to 0.3 are obtained for the 0.11 • WRF experiments (CRP-GL-11, IPSL-INERIS-11, UHOH-11) in several subdomains. Although we do not have a definite explanation, this could be linked to the high sensitivity of simulated summer temperatures to the selection of the convection scheme (Vautard et al., 2013b). CLMCOM and CNRM, however, show a very good performance in all seasons and all subdomains (TCOIAVs mostly larger than 0.9). For CNRM, this particularity could again be related to the special setup of this global model. The large relaxation zone and the continuous nudging of the model's solution towards ERA-Interim could help to maintain a correct chronology of synoptic events (i.e., of events that might partly be lost by limited area models due to their confined relaxation zone and an update of the boundary forcing at typically 6-hourly intervals only). Regarding the RIAV metric, both ensembles tend to overestimate the magnitude of interannual temperature variability, in particular during summertime. Except for Scandinavia (SC), where summer RIAVs are mostly smaller than 1, all subdomains are affected and summer temperature variability is in some cases overestimated by more than 50 % (RIAV larger than 1.5). For most cases, the centered root-mean-square difference between simulated and observed mean seasonal temperatures is smaller than the observed temporal standard deviation (normalized rms distance smaller than 1). No systematic improvement of an increased resolution (EUR-11 versus EUR-44 ensemble) is apparent; in some cases the switch from 0.44 • to 0.11 • can even deteriorate the model performance (compare nonfilled and filled symbols of the same color and the same marker type).
Similar to mean seasonal temperature, temporal correlations for precipitation are large in wintertime (mostly above 0.8) but systematically smaller in summer (Figs. 12, B8). Again, a number of 0.11 • WRF experiments show very low correlations in summertime. TCOIAVs are partly smaller than 0.3, suggesting inaccuracies in the representation of convective processes and their triggering mechanisms in this model. Concerning the interannual variability of precipitation, model performance shows a large spread. RIAV values are centered around 1 for subdomains IP, SC and FR, but both ensembles typically overestimate the interannual precipitation variability in both seasons (AL, EA, MD, SC) or in summer only (ME). Only subdomain BI shows a general underestimation of interannual precipitation variability (by up to 50 %). As for temperature, the centered root-meansquare difference for mean seasonal precipitation does typically not exceed the standard deviation of the observations (except AL) and, again, no obvious benefit of an increased grid resolution can be identified.

Mean annual cycle
The parameters CRCO and ROYA assess the model performance with respect to the mean annual cycle at monthly resolution, averaged over each subdomain. Not astonishingly, the rank correlation for temperature (Fig. 13, left panel) is high in all experiments (CRCOs larger than 0.95) reflecting a proper representation of the temperature variation throughout the year by the RCMs, mainly driven by the annual cycle of air temperature and SST in the imposed large-scale forcing and of top-of-the-atmosphere incoming solar radiation. Concerning the ratio of amplitudes (Fig. 13,  . Spatial Taylor diagrams exploring the model performance with respect to the spatial variability of mean winter (circles) and mean summer (triangles) temperature within subdomains EA, IP, ME and SC (see Fig. B5 for subdomains AL, BI, FR and MD). Filled markers: EUR-11 ensemble, nonfilled markers: EUR-44 ensemble, gray markers: ENS-22 ensemble. The diagrams combine the spatial pattern correlation (PACO, cos(azimuth angle)) and the ratio of spatial variability (RSV, radius). The distance from the 1-1 location corresponds to the normalized and centered root-mean-square difference (which does not take into account the mean model bias), expressed as multiples of the observed standard deviation. Note the different number of underlying grid cells per subdomain in the individual ensembles. than 1). Exceptions are the British Isles where a majority of experiments underestimates the mean annual amplitude as well as the WRF experiments (CRP-GL, IPSL-INERIS, UHOH), which systematically underestimate the annual amplitude over most parts of Europe. These results are closely related to the seasonal variability of the temperature bias in Figs. 5 and B1. In most cases temperature biases are positive in summer and negative in winter (or less negative in summer than in winter), leading to an overpronounced annual cycle. For SC, cold winter and cold summer biases are typically close to each other. This causes a negative shift of the annual cycle with only a minor influence on the annual variation.
For subdomain BI, in contrast, many simulations tend to underestimate summer temperatures more than winter temperatures, resulting in a flattening of the annual cycle. This is also the case for most regions in the WRF simulations, especially for IPSL-INERIS and UHOH. For ROYA, most outliers are members of the EUR-44 ensemble, i.e., an increased model resolution seems to be associated with a slightly better performance. For individual models and subdomains this might, however, not be true.
Regarding the mean annual cycle of precipitation, the model performance is generally worse than for temperature (Fig. 14). While most experiments show a rank correlation CRCO larger than 0.7 in subdomains BI, IP, SC and MD, correlations are typically much lower in FR, ME, AL and EA. In ME and EA rank correlations close to zero or even negative are obtained, indicating a deficient representation of the mean annual cycle of precipitation. In these regions, the spread of the individual experiments is, however, very large and most simulations actually have correlations larger than 0.5. Whether the annual amplitude of area-averaged precipitation is over-or underestimated (ROYA metric) strongly depends on the region and the experiment. While the annual amplitude is generally too small over the BI region, the majority of models overestimates the annual amplitude over FR, AL and MD. No systematic difference in model skill between the EUR-11 and the EUR-44 ensemble can be identified. For SMHI, IPSL-INERIS and CRP-GL the ROYA values of the 0.11 • simulations are generally larger than in the 0.44 • case, but only better in three out of eight subdomains.
The high-resolution experiments of KNMI and DMI show better ROYA values than their low-resolution counterparts in seven regions whereas the CSC and CNRM simulations perform better in six regions at 0.44 • grid spacing. With respect to CRCO, six models perform better with the higher resolution in at least six regions. Again, CSC and CNRM produce a better skill in six regions with the coarser resolution.

EURO-CORDEX versus ENSEMBLES
The gray bars and markers in Figs Figure 11. Temporal Taylor diagrams exploring the model performance with respect to the interannual temporal variability of mean winter (circles) and mean summer (triangles) temperature as averages over subdomains EA, IP, ME and SC (see Fig. B7 for subdomains AL, BI, FR and MD). Filled markers: EUR-11 ensemble, nonfilled markers: EUR-44 ensemble, gray markers: ENS-22 ensemble. The diagrams combine the temporal correlation of interannual variability (TCOIAV, cos(azimuth angle)) and ratio of interannual variability (RIAV, radius). The distance from the 1-1 location corresponds to the normalized and centered root-mean-square difference (which does not take into account the mean model bias), expressed as multiples of the observed standard deviation.
larger ensemble size (16 instead of 9 and 8 experiments for EUR-11 and EUR-44, respectively) and includes models that are not part of EUR-11 and EUR-44. For temperature, a comparison of the BIAS ranges (Figs. 5, B1) indicates an improvement in EUR-11 and EUR-44 concerning the strong overestimation of summer temperatures over the southern and southeastern parts of Europe (EA, IP, FR, MD), but also over central Europe (ME, AL). Regionally averaged summer temperature biases in EUR-11 and EUR-44 are typically smaller than 1.5 • C compared to strong warm biases of some ENSEMBLES experiments. However, the cold biases of SMHI, KNMI and CNRM do partly exceed those of the ENSEMBLES models by some tenths of a degree (AL, BI, MD). Considering the larger ensemble size of ENS-22, the overall bias range seems to be comparable. As for the temperature 95 %-P (Figs. 7, B3), both EUR-11 and EUR-44 mostly improve on ENS-22 except for CNRM and partly SMHI, KNMI and CRP-GL, which can be subject to strong biases on the grid-cell scale in subdomains EA, IP, SC, BI, MD and especially in AL.
Due to some wet and dry outliers of the EUR-11 and EUR-44 ensembles in individual subdomains and seasons, the range and the magnitude of the precipitation BIAS of the EURO-CORDEX simulations are partly larger than in ENS-22. This particularly concerns subdomains EA, ME, BI and FR. The same is true for the precipitation 95 %-P (Figs. 8,   B4). On the one hand some improvements with predominantly smaller values are apparent for subdomain IP while, on the other hand, several EUR-11 and EUR-44 simulations show larger biases in subdomains ME, BI and FR compared to ENSEMBLES. Regarding the reproduction of the spatial variability of temperature (Figs. 9, B5) and precipitation (Figs. 10, B8), EUR-11 and EUR-44 often slightly improve on ENS-22 (markers closer to the 1-1 location). Again, exceptions are CNRM and to some extent also SMHI, which partly show a pronounced overestimation of the spatial standard deviation of temperature beyond the ENS-22 range. Some features like the higher spatial correlation of winter precipitation (Fig. 10) and the smaller spatial temperature variability ( Fig. 9) in SC are concordantly reproduced by all three ensembles. The temporal variability of temperature (Figs. 11, B7) is slightly improved with respect to ENSEMBLES in summertime, mainly due to a less pronounced overestimation of interannual variability (RIAVs closer to one in many subdomains). No clear difference between EUR-11 and EUR-44 on one hand and ENS-22 on the other hand is obvious for metrics describing the interannual variability of precipitation (Figs. 12, B8). Again, the seasonal separation/clusteringlike for temperature in EA and FR and for precipitation in EA, IP, ME, SC and FR -is similar in all ensembles.
The rank correlations of the mean annual cycle of temperature averaged over the individual subdomains are large in all three ensembles (Fig. 13, left panel). For precipitation (Fig. 14, left panel), the performance of the EUR-11 and EUR-44 ensembles is comparable to ENS-22 except for some poor-performing outliers in subdomains BI (CNRM), FR (IPSL-INERIS) and ME (CNRM). It is worth mentioning that the regions with the largest range of CRCO in ENS-22 (ME and EA) present also the largest ranges in EUR-11 and EUR-44. The ranges in subdomains IP, SC and MD are, however, considerably reduced in the EUR-11 and EUR-44 ensembles. Regarding the ROYA metric, i.e., the ratio of amplitudes of the mean annual cycle (Figs. 13, 14, right panels), EUR-11 and EUR-44 show a similar skill as ENS-22, but with a tendency towards an underestimation of the amplitude of the annual cycle by some experiments in selected subdomains (IPSL-INERIS and UHOH for temperature; UHOH, CLMCOM, CSC, and SMHI for precipitation).

The overall picture
The evaluation of the EURO-CORDEX ensembles largely confirms RCM bias characteristics identified by previous studies based on the ENSEMBLES data. This concerns both the general magnitude as well as the sign of model biases. Improvements with respect to ENSEMBLES are a reduced overestimation of southern and southeastern European summer temperatures, a less pronounced overestimation of interannual summer temperature variability as well as a slightly better representation of the spatial climatic variability within the subdomains. In some cases, however, individual EURO-CORDEX experiments are subject to bias magnitudes beyond the range found for ENSEMBLES. This especially concerns the CNRM model, which shows a strong spatial variability of model biases on the grid cell level and a pronounced cold and dry bias over many parts of Europe. CNRM's summer dry bias, however, is not due to shortcomings in the physical parameterizations, but is a consequence of the specific design of the CNRM experiments. Further simulations in which the relaxation outside Europe is weaker (6 h instead of 10 min e folding time) do not show it. The reason might be an overdrying of the atmosphere in the relaxation area (the rest of the globe) where a permanent spinup of temperature and moisture relating to the mismatch between ERA-Interim and ARPEGE physics is imposed on the model. CNRM's cold bias over high mountains is to some extent related to the model's snow scheme and a too persistent snow cover (Vautard et al., 2013b).
The availability of different configurations of WRF allows comparing the bias spread obtained for this particular model to the spread across different models. The fact that the temperature bias range of the three WRF-11 experiments often corresponds to the bias range of the entire ensemble illustrates the uncertainty introduced by the choice of parameterizations and parameter settings (e.g., Bellprat et al., 2012a;Mooney et al., 2013). This, however, is not apparent for precipitation biases where the different WRF setups approximately agree on sign and magnitude of their bias. In wintertime, the wet bias of WRF seems to be closely related to the distinct negative bias of mean sea-level pressure (compare Figs. 3 and 4), indicating a too-high intensity of lowpressure systems passing the continent. Circulation types and storm tracks, however, have not been analyzed in detail in the present study and possible relations between precipitation and mean sea-level pressure biases remain speculative.
Mostly independent of the season and the subdomain under consideration, the relative ranking of models with respect to seasonal mean temperature is stable, with CNRM, KNMI and SMHI showing the coldest temperatures as opposed to warmer conditions in CLMCOM and CSC. For seasonally and regionally averaged precipitation sums the relative ranking is less fixed, although the high-resolution versions of SMHI, CRP-GL and IPSL-INERIS are often found at the wet end while CNRM typically belongs to the driest models.
For subdomain mean values at seasonal resolution, no apparent benefit of a finer grid resolution is identified. For temperature and depending on subdomain and season, the 0.11 • experiments can be warmer or colder than their 0.44 • counterparts and no systematic bias reduction in the highresolution experiments is found. This also holds for the 95th percentile of absolute temperature biases (95 %-P). In case of precipitation, seasonal mean biases are typically larger in the EUR-11 ensemble as precipitation sums are generally overestimated by both ensembles and the increase of resolution is mostly associated with a further increase of precipitation. The latter might be related to stronger orographic gradients in the high-resolution experiments due to a better resolved topography. Our analysis also highlights the potential of error compensation when restricting the analysis to mean values for relatively large subdomains. Especially for precipitation a metric such as 95 %-P can provide further insight into model biases on grid-cell level in addition to the metrics PACO and RSV, which measure the accuracy of horizontal distribution and spatial variation over a selected subdomain.
The absence of obvious benefits of a finer grid resolution in our analysis does not rule out such an added value in general. The 0.22 • resolution of the gridded observations, coarser than that of the 0.11 • RCM simulations, allows us to make conclusions concerning a lack of large-scale bias improvements by the 0.11 • experiments, but hinders identification of benefits at a smaller scale. In orographically structured terrain, we expect an added value of an increased spatial resolution for parameters such as mesoscale circulations, the precipitation intensity distribution at daily resolution or snow cover dynamics. These aspects have not been addressed in the current work, partly since this would require observational reference data with a better reliability than E-OBS at high temporal and spatial scales. Such data are currently not available at a European level but only for smaller subregions (mostly individual countries), such as the REG-NIE (Regionalization of Precipitation Totals) or HYRAS precipitation data for Germany (Rauthe et al., 2013) or the SAFRAN (Système d'Analyse Fournissant des Renseignements Atmosphériques à la Neige) reanalysis over France (Quintana-Segui et al., 2008). A detailed investigation of the added value of high-resolution experiments based on such data will be the subject of upcoming studies, possibly applying dedicated added value metrics (e.g., Kanamitsu and De-Haan, 2011). Indeed, recent studies by Bauer et al. (2011), Prein et al. (2013a and  indicate that an increase of RCM resolution (in their case to convection-permitting scales) bears added value, but this added value can cancel out by spatial and temporal averaging.
Further cautionary notes concern the influence of (1) internal model variability, (2) uncertainties in the observational reference data, and (3) deficiencies of the driving reanalysis on the computed skill metrics. Internal model variability (1) can influence the simulated mean climatology even in decadal and multidecadal RCM experiments that are subject to an identical boundary forcing (e.g., Bellprat et al., 2012a;Lucas-Picher et al., 2008;Roesch et al., 2008) in particular over large model domains as in our simulations. As the EUR-11 and EUR-44 ensembles consist of only one experiment for each setup, a quantification of the effect of internal variability on the model evaluation is not possible. Instead, slight nuances of bias characteristics should not be overinterpreted as they could, to some degree, result from internal random variability. A similar reasoning is true for uncertainties in the E-OBS observational reference. Finally, model evaluation has been carried out in a perfect boundary context and basically assumes a bias-free representation of the lateral atmospheric boundary forcing and of sea surface temperatures by the driving ERA-Interim reanalysis. Although the recent studies by Brands et al. (2012Brands et al. ( , 2013 suggest a negligible reanalysis uncertainty for the Northern Hemisphere extratropics, a certain influence of a biased boundary forcing on the evaluation results cannot be ruled out.

RCM deficiencies and capabilities
One of the most prominent deficiencies across members of both the EUR-11 and EUR-44 ensembles is the predominant cold bias in most seasons and for most subdomains. The spatially averaged bias often ranges from −1 to −2 • C but can be larger in individual cases. For some regions such as Norway Geosci. Model Dev., 7, 1297-1333, 2014 www.geosci-model-dev.net/7/1297/2014/ and the Alpine ridge, this cold bias might partly be related to the pronounced topography of the respective region, associated with large elevation differences between the individual RCMs and the E-OBS reference at grid point level. This, in turn, potentially amplifies inaccuracies of the assumption of a spatially and temporally uniform lapse rate for elevation correction (see Sect. 3.3). Exceptions to the general picture of a predominant cold model bias are the CSC model that mostly shows too high temperatures as well as the summer season in southern and southeastern Europe where most models have a tendency to overestimate temperatures. This result is consistent with previous findings (e.g., Hagemann et al., 2004, for PRUDENCE, andChristensen et al., 2008, for ENSEMBLES) and is probably related to an underestimation of summertime precipitation (compare Figs. 3 and 4) and soil moisture-temperature coupling: in soil moisture-controlled evaporative regimes, low soil moisture contents (e.g., resulting from preceding precipitation deficits) limit the amount of energy used for the latent heat flux and increase the sensible heat flux, ultimately leading to an increase of air temperature (e.g., Seneviratne et al., 2010). This feedback is sensitive to all processes that interfere with the regional balances of water and energy, and this includes land-surface, boundary layer, convective and radiative processes. Related to this is the overestimation of interannual temperature variability in the summer season by both ensembles (RIAVs larger than 1). This widespread and systematic model bias has previously also been reported for the PRUDENCE and ENSEMBLES experiments (e.g., Fischer et al., 2012;Lenderink et al., 2007;Vidale et al., 2007). The warm summer biases do not coincide with pronounced positive mean sea-level pressure biases (compare Figs. 2 and 4), which indicates the dominant role of regional-scale land surface-atmosphere interactions and only a minor contribution of large-scale circulation biases (e.g., too persistent blocking regimes). The former were also identified as driving factors for the correct representation of summer heat waves in the EURO-CORDEX ensemble (Vautard et al., 2013b). Regarding regionally averaged precipitation biases, the most striking feature is a pronounced wet bias of both ensembles over most subdomains and for most seasons (except CNRM and except the dry biases in southern and southeastern Europe). As a consequence of a general tendency to higher precipitation sums with increased model resolution, this wet bias is typically more pronounced in the 0.11 • experiments. Based on the restricted detail of our analysis, a full explanation of this bias is not possible at this point. Note that the E-OBS reference has not been corrected for the systematic undercatch of rain gauges (cf. Sect. 2.2). If one assumes a mean systematic undercatch of 20 % of true precipitation, wet model biases can in some cases be explained by this shortcoming of the observational reference.
Another important deficiency of simulated precipitation are the low rank correlations (CRCO metric) of simulated and observed climatological monthly means in subdomains FR, ME, AL and EA. Here, CRCOs are typically lower than 0.7 and partly close to zero or even negative, indicating a reversal of the observed annual cycle by the RCMs. When analyzing CRCO it has to be noted, though, that low or negative correlations are more likely in regions with a weak annual cycle of precipitation. A similar reasoning is true for model biases of the mean annual amplitude of precipitation (ROYA metric). As the numbers in the left panel of Fig. 14 (CRCO) indicate, the standard deviation of the mean annual cycle is smallest -only 13-18 % of the annual mean monthly precipitation -in subdomains FR, ME and AL. The difference between maximal and minimal mean monthly precipitation is also smallest for these three subdomains (right panel of Fig. 14, ROYA). It amounts to only 42-53 % of the annual mean monthly precipitation. For subdomains IP and MD this normalized difference is more than twice as large (131 and 113 %, respectively), indicating a pronounced annual variation of precipitation in these regions. This is confirmed by the high values of the normalized standard deviation (44 and 31 %; Fig. 14, left panel) in these subdomains. Hence, the bad model performance with respect to CRCO in FR, ME and AL and the considerable overestimation of ROYA in FR does not necessarily indicate a severe model bias but rather shows that the respective model cannot reproduce small monthly deviations from a rather uniform annual distribution of precipitation. This is, however, not the case for the partly weak mean annual correlation in EA. In particular, CCLM and CNRM seem to have serious problems to correctly reproduce the annual cycle of precipitation in this eastern part of the model domain.
Concerning the general overestimation of spatial temperature and precipitation variability within subdomains (RSV metric), this deficiency does very likely not only reveal true model biases but also deficiencies of the E-OBS reference relating to the spatial smoothing and an effective resolution lower than 0.22 • and 0.44 • , respectively, in regions of a low network density (see Sect. 2.2). This effect would lead to an apparent overestimation of RSV by the model experiments, although the true spatial variability might actually be well represented. Subdomains like ME with little orographic variability and, furthermore, a rather dense station network (cf. Haylock et al., 2009) would be less affected by this artifact and, indeed, show a better model performance with respect to RSV (Fig. 9). Unfortunately, not a single data set currently exists that provides homogenized climate data for the entire European continent with an effective spatial resolution equal or higher than the actual resolution of modern RCMs used for long-term climate simulations. Hence, more detailed investigations of small-scale climatological features can be carried out only for specific subregions where appropriate highresolution reference data exist.
When analyzing the temporal correlation between the simulated and observed seasonal mean values over the 20-year long evaluation period (metric TCOIAV), an obvious feature is the much better correlation for winter (mostly larger than 0.9 for both temperature and precipitation) compared to summer (often smaller than 0.6). The better performance for winter reflects the fact that European summer climate is much more controlled by local-to regional-scale processes, giving the RCMs a higher degree of freedom to alter the conditions imposed by the boundary forcing (e.g., Déqué et al., 2005). In contrast, winter climate in midlatitudes is more affected by the synoptic-scale transport of warm or cold and moist or dry air masses, which couples the internal solution of the RCMs closer to the temporal evolution of the lateral boundary values. Comparing TCOIAV for temperature and precipitation, smaller correlations are typically obtained for precipitation, reflecting a weaker control of the large-scale boundary conditions on subdomain mean precipitation compared to subdomain mean temperature.
Despite the mentioned shortcomings in the representation of specific climatic features over the European continent, the evaluation indicates a considerable skill of the EUR-11 and EUR-44 ensembles to reproduce larger-scale horizontal variability of climatological seasonal mean values (expressed, for instance, as differences of mean values between the individual subdomains). In most subdomains, especially for temperature, also the shape and the amplitude of regionally averaged mean annual cycles are reproduced to a large extent (ROYA and CRCO metrics). The climatological fields of mean sea-level pressure as represented by the driving ERA-Interim reanalysis are mostly captured well and are only slightly distorted in some cases.
For temperature the spatial variability within the individual subdomains is fairly well captured (PACO mostly > 0.9). This good performance is, however, to some extent a simple result of the systematic elevation dependency of air temperature. As continental-scale gradients and biases thereof are not sampled by the subdomains, high-elevation regions will typically have lower temperatures than their low-elevation counterparts in a given subdomain, both in the observations and in the models. As grid-scale topography can be assumed to be realistically represented by the models and as, additionally, an elevation correction is carried out for temperature this will lead to high values of PACO. This effect will generally be less pronounced in subdomains without strong orographic gradients (such as ME). In the case of precipitation, the spatial variability within subdomains is simulated less accurately (PACO typically between 0.4 and 0.9). This partly reflects the fact that seasonal precipitation sums are also affected by topography, but on regional scales far less systematic than temperature. Instead, RCMs can suffer from considerable systematic biases of the spatial precipitation field in orographic terrain such as the windward/lee effect (overestimation of precipitation on the windward side, underestimation on the lee side; e.g., .

Conclusions and outlook
The present work evaluates the ERA-Interim-driven RCM ensembles of the EURO-CORDEX initiative on a European scale. Our analysis mainly considers the standard parameters of 2 m temperature and precipitation and is based on monthly and seasonal mean values. Several simple and reproducible metrics covering a range of aspects of model performance are used to compare simulation results to the E-OBS observational reference. This enables a quantitative assessment of the newest generation of RCMs to simulate European climate conditions and a direct comparison with results of the previous ENSEMBLES simulations. The validation exercise serves as a quality standard for further simulations and future model developments. The added value of the high-resolution experiments (EUR-11) compared to their coarser resolution counterparts (EUR-44) is not specifically addressed in this study.
The model evaluation highlights the general ability of today's regional climate models to represent the basic spatiotemporal patterns of the European climate, but also indicates considerable deficiencies for selected metrics, regions and seasons. Some of these deficiencies, such as a predominant cold and wet bias in most seasons and over most of Europe, are found in the majority of experiments and reflect common model biases. Furthermore, many experiments are subject to a warm and dry summer bias over southern and southeastern Europe. The latter had previously been identified for the ENSEMBLES experiments, but for this specific case the bias appears to be reduced in the EURO-CORDEX ensembles. However, neglecting the influence of slightly incompatible setups (different driving reanalysis, different simulation and, hence, evaluation period), no general improvements of the EURO-CORDEX simulations with respect to ENSEMBLES could be identified for the temporal and spatial scales considered in the present work. In addition to common model deficiencies found across the range of different RCMs, a number of model-dependent biases could be identified. Except for a few consistent outliers, these biases typically depend on the region and season under consideration.
Identifying possible reasons for both common and modelspecific bias characteristics and formulating specific recommendations for model development will require a deeper and dedicated analysis, including additional metrics and variables and explicitly taking into account uncertainties in the observational reference and the effect of RCM-internal climate variability. These aspects will be the subject of upcoming studies within the EURO-CORDEX community. The same is true for studies explicitly addressing the added value of an increased grid resolution. In terms of regionally and seasonally averaged quantities the present work could not identify such an added value. This does, however, not rule out benefits of an increased resolution, and we would expect such benefits for quantities such as daily precipitation intensities, small-scale spatial climate variability in topographically Geosci. Model Dev., 7, 1297-1333, 2014 www.geosci-model-dev.net/7/1297/2014/ structured terrain or snow cover dynamics. These aspects still need to be investigated in more detail. Further analyses will consider (1) the relation between present-day model biases and simulated climate change signals, (2) the question of whether model biases are temporally stable and bias correction methods are feasible and can be reliably applied, (3) intercomparisons of the performance of different types of downscaling methodologies, as well as (4) the assessment of trends of simulated climatic parameters within the observed period. For the latter aspect the current 20-year long EURO-CORDEX evaluation experiments are not well suited, but extended simulations covering the full ERA-Interim period (1979-present) are already under way and will be available for such analyses. Furthermore, applying the same quantitative metrics used in the present study to the EURO-CORDEX GCM-driven experiments would allow separating the contribution of the driving global climate model from the intrinsic RCM contribution to the overall bias structure.
The corresponding means for the reference data R are defined accordingly. Annual means are calculated by an unweighted average over 12 monthly means beginning with January. Seasonal means of year i are calculated by an unweighted average over three consecutive monthly means beginning with December of year i − 1 for the winter season (DJF) and ending with November of year i for the fall season (SON).
Using these definitions, the applied evaluation metrics are calculated as follows.
A2 95 % percentile of the absolute value of grid point differences (95 %-P) For precipitation, relative differences with respect to the reference data are used.

A3 Pattern correlation (PACO)
with the spatial variances A4 Ratio of spatial variability (RSV)

A5 Temporal correlation of interannual variability (TCOIAV)
with the temporal variances (A10)  Figure B5. Spatial Taylor diagrams exploring the model performance with respect to the spatial variability of mean winter (circles) and mean summer (triangles) temperature within subdomains AL, BI, FR and MD (see Fig. 9 for subdomains EA, IP, ME and SC). Filled markers: EUR-11 ensemble, nonfilled markers: EUR-44 ensemble, gray markers: ENS-22 ensemble. The diagrams combine the spatial pattern correlation (PACO, cos(azimuth angle)) and the ratio of spatial variability (RSV, radius). The distance from the 1-1 location corresponds to the normalized and centered root-mean-square difference (which does not take into account the mean model bias), expressed as multiples of the observed standard deviation. Note the different number of underlying grid cells per subdomain in the individual ensembles.

A7 Climatological rank correlation (CRCO)
Geosci  Figure B7. Temporal Taylor diagrams exploring the model performance with respect to the interannual temporal variability of mean winter (circles) and mean summer (triangles) temperature as averages over subdomains AL, BI, FR and MD (see Fig. 11 for subdomains EA, IP, ME and SC). Filled markers: EUR-11 ensemble, nonfilled markers: EUR-44 ensemble, gray markers: ENS-22 ensemble. The diagrams combine the temporal correlation of interannual variability (TCOIAV, cos(azimuth angle)) and ratio of interannual variability (RIAV, radius). The distance from the 1-1 location corresponds to the normalized and centered root-mean-square difference (which does not take into account the mean model bias), expressed as multiples of the observed standard deviation.  Figure B8. As Fig. B7 but for mean winter (circles) and mean summer (triangles) precipitation. See Fig. 12 for subdomains EA, IP, ME and SC.

Data access
Since We acknowledge the E-OBS dataset from the EU-FP6 project ENSEMBLES (http://ensembles-eu.metoffice.com) and the data providers in the ECA&D project (http://eca.knmi.nl).
Edited by: J. C. Hargreaves