weather@home 2: validation of an improved global–regional climate modelling system. Geoscientific Model

. TS1 TS2 CE1 Extreme weather events can have large impacts on society and, in many regions, are expected to change in frequency and intensity with climate change. Owing to the relatively short observational record, climate models are useful tools as they allow for generation of a larger sample of extreme events, to attribute recent events to anthropogenic climate change, and to project changes in such events into the future. The modelling system known as weather@home, consisting of a global climate model (GCM) with a nested regional climate model (RCM) and driven by sea surface temperatures, allows one to generate a very large ensemble with the help of volunteer distributed computing. This is a key tool to understanding many aspects of extreme events. Here, a new version of the weather@home system (weather@home 2) with a higher-resolution RCM over Europe is documented and a broad validation of the climate is performed. The new model includes a more recent land-surface scheme in both GCM and RCM, where subgrid-scale land-surface heterogeneity is newly represented using tiles, and an increase in RCM resolution from 50 to 25 km. The GCM performs similarly to the previous version, with some improvements in the representation of mean climate. The European RCM temperature biases are overall reduced, in particular the warm bias over eastern Europe, but large biases remain. Precipitation is improved over the Alps in summer, with mixed changes in other regions and seasons. The model is shown to represent the main classes of regional extreme events reasonably well and shows a good sensitivity to its drivers. In particular, given the improvements in this version of the weather@home system, it is likely that more reliable statements can be made with regards to impact statements, especially at more localized CE2 scales.


Introduction
Anthropogenic climate change due to increased greenhouse gas concentrations in the atmosphere poses numerous threats to society (IPCC, 2013). In particular, the frequency, intensity, and duration of extreme events such as heat waves, droughts, and flooding may have already changed due to climate change (Frich et al., 2002;Fischer and Knutti, 2015), a trend that is expected to continue in the future . The growing field of extreme event attribution attempts to answer the question whether and to what extent anthropogenic climate change altered the frequency and intensity of observed extreme events. Answering this question is now becoming possible for many events (National Academies of Sciences, Engineering, and Medicine, 2016), and is done by quantifying the role of anthropogenic climate change versus natural climate variability for events that have occurred in the past (e.g. Otto et al., 2012;Stott et al., 2016). Another field of research investigates how extreme events may change in the future, thereby concentrating on future climate projections (e.g. Mitchell et al., 2016c).
Owing to their rarity, extreme weather events and their characteristics can be difficult to assess. Indeed, only a few such events may be available in observational records. Therefore, model-based approaches consisting of large ensembles that allow for the statistics of rare events to be analysed are 2 B. P. Guillod et al.: weather@home2 an essential complement to observational products. In particular, large ensembles of global climate models (GCMs) allow derivation of multiple sequences of weather patterns and a substantial number of associated extreme events. Dynamical downscaling of these GCM simulations by regional climate models (RCMs, Giorgi, 2006) can provide more spatially detailed information, which can be very valuable for the investigation of localized impacts of extreme weather events. One such modelling system is weather@home (Massey et al., 2015). Consisting of a GCM with prescribed sea surface temperatures (SSTs) and sea ice and a nested RCM over a region of interest, it leverages the computing power of volunteers around the world to generate very large ensembles of GCM-RCM simulations. This is particularly useful for the investigation of extreme weather events, and weather@home has been used successfully for the attribution of many extreme weather events (e.g. Pall et al., 2011;Otto et al., 2012) as well as their impacts, such as flooding-related property damages  and heat-related mortality (Mitchell et al., 2016b).
Model performance is however a common limitation inherent to modelling approaches. Like any model, weather@home exhibits biases in certain variables (Massey et al., 2015). In particular, a substantial warm and dry bias was found in summer over easternCE3 Europe, similar to many RCMs (Jacob et al., 2007). The increase in capabilities of home computers, on which weather@home simulations are being run, makes it possible to increase the model resolution and include newer model developments, with the aim of reducing these biases.
Although identifying the causes of GCM and RCM biases is not straightforward, previous studies suggest that the land surface may play an important role (e.g. Davin et al., 2016), in particular for summer climate. Several studies have identified the tendency for RCMs to display an excessive summer drying over Europe Kotlarski et al., 2014). The resulting dry summer soil moisture bias in turn feeds back onto the atmosphere through the underestimation of evapotranspiration (or latent heat flux) and the simultaneous overestimation of the sensible heat flux at the surface (e.g. Seneviratne et al., 2010). These fluxes may directly affect temperature (via the sensible heat flux) and precipitation (via moisture input to the atmosphere, e.g. Eltahir and Bras, 1996). In addition, they can lead to indirect effects modulated by the boundary layer, thereby affecting cloud cover (e.g. Ek and Holtslag, 2004) and precipitation (e.g. Findell and Eltahir, 2003;Taylor et al., 2011;Guillod et al., 2015). Although weather@home biases could be due to atmospherically driven lack of precipitation, improvements to land-surface schemes in RCMs and GCMs have been shown to substantially improve the simulated surface climate (e.g. Davin et al., 2011Davin et al., , 2016, suggesting that at least part of the biases may be attributable to deficiencies in the representation of the land surface. Besides model formulation, other aspects of the land surface such as soil parameters may significantly impact surface climate (e.g. Guillod et al., 2013).
A new version of weather@home (called weather@home 2) was therefore developed by including a more recent version of the MOSES land-surface model (see Sect. 2.2). In this paper, we describe and validate the GCM globally and the RCM over the European domain, with a focus on the simulation of mean climate, daily extremes, and the reliability of the model response to forcings.
The paper is structured as follows: in Sect. 2, we describe weather@home and the new developments that lead to its second version, as well as the modelling simulations and observational data used in this paper. The GCM (HadAM3P) is validated in Sect. 3, with a focus on mean biases in temperature, precipitation, and atmospheric circulation. Section 4 provides a detailed validation of the RCM (HadRM3P) over Europe, including analyses of the model biases in mean and extremes as well as its reliability. Section 5 draws some conclusions on the suitability of the modelling system to investigate extreme weather events.

weather@home
The climate modelling system known as weather@home (Massey et al., 2015) is part of the http://climateprediction. net climate modelling project (Allen, 1999). It consists of an atmospheric GCM, HadAM3P, that is downscaled to a higher resolution over a limited domain by its RCM equivalent, HadRM3P. The downscaling is only coupled one-way, so that the RCM can not impact on the GCM. Both models share essentially the same physics and are based on the atmospheric component of the coupled climate model of the UK Met Office Hadley Centre, HadCM3 (Gordon et al., 2000), with a number of improvements described in Massey et al. (2015). These include increasing the GCM horizontal resolution to 1.875 • × 1.25 • (in longitude and latitude, respectively) and introducing better representations of large-scale and convective clouds. The formulation of the RCM, HadRM3P, differs from HadAM3P only in terms of horizontal resolution, time step (reduced from 15 to 5 min), and resolution-dependent physical parameters. In general, HadRM3P is run on a rotated grid, allowing it to simulate the area of interest over an equatorial domain (in the rotated coordinate system) at quasi-uniform horizontal resolutions of 0.44 or 0.22 • . It has been run over many regions worldwide, including all of those defined by the CORDEX (Coordinated Regional Climate Downscaling Experiment) initiative (Giorgi et al., 2009), although any domain can be specified. HadRM3P is run alongside a given HadAM3P simulation, the latter providing the lateral boundary conditions at the regional domain edges at 6-hourly intervals.
Both models are forced with sea surface temperature (SST) and sea ice, atmospheric composition, SO 2 emissions (including volcanoes), and solar forcing, as well as initial conditions for all model variables. The HadAM3P GCM has been shown to represent the atmospheric dynamics well compared with many state-of-the-art GCMs (Mitchell et al., 2016a).
The strength of weather@home resides in its ability to run very large ensembles of simulations, of the order of thousands to tens of thousands. To achieve this, volunteer distributed computing via the Berkeley Open Infrastructure for Network Computing (BOINC, Anderson, 2004) is used. Individual simulations are sent to volunteers around the world, who run the HadAM3P-HadRM3P simulations and upload the results onto a server. A large number of simulations can thereby be performed in parallel which are particularly relevant when examining extreme events, rare by definition, and requiring large numbers of simulated years to define their statistics robustly. The weather@home project has led to many high-impact analyses, notably in the field of extreme event attribution, where sets of simulations with observed or corresponding "natural" conditions (without anthropogenic climate change) can be compared to assess the role of human influences on extreme events (e.g. Schaller et al., 2016;Mitchell et al., 2016b;Haustein et al., 2016).
While the weather@home project initially focused on a European region (e.g. Massey et al., 2015) and North American region (the Pacific North-west, Li et al., 2015;Mote et al., 2016), it has also been successfully used in Australia and New Zealand (Black et al., 2016), Africa (Marthews et al., 2015), and is currently also being deployed over a number of additional regions.

Model developments for version 2 of weather@home
A few modifications have been incorporated in version 2 of weather@home (hereafter w@h2) relative to the original weather@home (hereafter w@h1, described in detail by Massey et al., 2015). More specifically, a more recent land-surface scheme was introduced in both HadAM3P and HadRM3P, and the standard horizontal resolution of HadRM3P was increased.
In both model versions, HadRM3P is run over the European CORDEX domain (Fig. 1) The main development included in w@h2 is an improved representation of the land surface. In w@h2, land-surface model (LSM) MOSES 1 used in w@h1 (Cox et al., 1999) was replaced by a more sophisticated version, MOSES 2 . MOSES is a third-generation LSM, incorporating the direct physiological effect of CO 2 on photosynthesis and stomatal conductance (Sellers et al., 1997). The total land evapotranspiration includes interception evaporation from the canopy, plant transpiration, bare soil evaporation, and snow sublimation. Five vegetation types and four non-vegetated surface types are considered. The soil is represented by four layers spanning a total depth of 3 m, with the hydrology following Richards' equation (see Cox et al., 1999, for further details).
The main difference between the two LSM versions is the explicit consideration of land-surface heterogeneities within each grid cell via the introduction of a tiling scheme in MOSES 2 . Indeed, in MOSES 1 only one surface type is considered in each grid cell. The introduction of tiles in MOSES 2 allows consideration of each of the nine surface types mentioned above, and computation of surface fluxes for each surface type, of which the areaweighted average is returned to the atmospheric component of the model.
Another improved representation of the land surface introduced into w@h2 is the TRIFFID dynamic vegetation model (Top-Down Representation of Interactive Foliage and Flora Including Dynamics, Cox, 2001). The vegetation distribution (i.e. fraction of surface types within each grid cell) in MOSES 2 can be either prescribed to observed values or computed interactively by TRIFFID. In w@h2, TRIFFID has been implemented in the regional but not global model. Although for most applications TRIFFID is switched off and both models are similar in that respect, a side-effect is that the prognostic snow albedo cannot be turned on in the global model, while it is turned on by default in the regional model.
In addition to the tiling scheme, a number of smaller improvements have been implemented in MOSES 2, notably in the representation of snow processes (Essery and Clark, 2003).
Finally, the definition of the region over which the RCM is run is more flexible in w@h2 than in w@h1. While in w@h1 one application was built and deployed for each region separately, w@h2 consists of a single executable that can be used for any region, the latter being defined via input parameters. This simplifies the extension of weather@home to many regions, although the creation of an initial condition file remains necessary for any newly created region.

Modelling experiments
A large ensemble of w@h2 consisting of more than 100 simulations per year from 1900 to 2006 is analysed. First, a restart file from a century-long HadAM3P simulation with MOSES 1 has been reconfigured for MOSES 2. This initial condition file is then used in a spin-up ensemble consisting of 12-month simulations (from December to November, with multiple simulations for each year), providing spun-up initial conditions on 1 December each year. The simulations analysed in this paper are then initialized on 1 December each year from the end state of the spin-up ensemble and are run for 13 months. The effect of the relatively short spinup for soil variables on simulated temperature and precipitation is discussed in Sect. 3.1 for HadAM3P and Sect. 4.1 for HadRM3P. The correspondence between simulated and real years comes from using observed sea surface temperature and sea ice as the lower boundary condition and observed concentrations of greenhouse gases, SO 2 emissions, and influence of volcanoes and solar radiation. The sea surface temperature and sea ice are prescribed from observed estimates in the HadISST dataset (Rayner et al., 2003) version 2.1.0.0 (see Titchner and Rayner, 2014, for sea ice), a pre-release version directly provided by the UK Met Office Hadley Centre. The other input variables of greenhouse gas concentrations (CO 2 , CH 4 , N 2 O, O 3 , and halocarbon gases), SO 2 emissions, volcanic activity, and solar forcing are prescribed to historical values as in Massey et al. (2015) with the data also provided by the Met Office Hadley Centre.
To assess whether the model developments described in Sect. 2.2 lead to an improved representation of climate in w@h2 compared to w@h1, we also use the w@h1 ensemble from Massey et al. (2015), consisting of about 20 members per year from 1961 to 1990. It should be noted that the difference between these two model ensembles may not only result from the models themselves, but also from (i) differences in the prescribed SSTs and sea ice, the analysed w@h1 ensem-ble being based on version 1 of the HadISST dataset (Rayner et al., 2003), and (ii) horizontal resolution, this latter point applying only to the RCM.

Observational data
We use gridded observation-based climate products for the model validation. Global temperature and precipitation over land (excluding Antarctica) are taken from version 3.23 of the Climate Research Unit time series dataset (CRU-TS; Harris et al., 2014), covering 1901-2014, which we interpolate to the model grid using a first-order conservative scheme. Global atmospheric fields (geopotential height) are taken from the Japanese 55-year Reanalysis (JRA-55) project carried out by the Japan Meteorological Agency (Kobayashi et al., 2015) and are bilinearly interpolated to the model grid. For the validation of HadRM3P, we use the E-OBS dataset (Haylock et al., 2008) version 12.0, which provides daily temperature and precipitation data on the model grid from 1950 to the present. To validate the land-surface fluxes in HadRM3P, we use two datasets available over the common time period 1984-2006: the satellite-based Surface Radiation Balance (SRB) version 3.1 dataset (Stackhouse et al., 2004;Zhang et al., 2015) is used for surface radiation fluxes, and the FLUXNET-MTE product (Jung et al., 2009(Jung et al., , 2011 is used for surface sensible and latent heat fluxes. These two datasets are bilinearly interpolated to the rotated RCM grid. In this section, we investigate the performance of HadAM3P in w@h2. First, seasonal mean biases in surface air temperature, precipitation, and geopotential height at 500 hPa (as a proxy for the background state of atmospheric flow) are shown and compared to those in w@h1 over a 30-year period from 1961 to 1990 (Sect. 3.1; w@h2 biases look very similar when the whole time period, from 1900 to 2006, is considered) and are complemented by biases in variability. Then, time series of global land temperature and precipitation are shown and discussed in Sect. 3.2. Figure 2 shows the ensemble mean seasonal biases in surface air temperature in w@h2 (left; a, d, g, j) and in w@h1 (centre; b, e, h, k) relative to CRU-TS, as well as the difference between the absolute bias values (right; c, f, i, l; these are expressed so that negative values, in green, indicate an improvement in w@h2 compared to w@h1). Overall, the bias patterns are similar in both model versions, with the largest biases found in the Northern Hemisphere winter (December to February, DJF) and summer (June to August, JJA). The difference between the biases in the two models is most prominent in JJA, with significant improvements over Africa, the southern US, and parts of central Russia. Conversely, the biases in that season are higher in w@h2 in the north of North America, eastern Russia, and western Russia and Europe. The improved land-surface scheme in HadAM3P therefore does not improve the representation of summer temperature averages over Europe (Fig. 2i). In DJF, the difference between the two models is smaller, with w@h2 performing slightly better than w@h1 in the whole Southern Hemisphere but slightly poorer over eastern North America, northern Africa and India. In the Northern Hemisphere spring (March to May, MAM), biases are larger in w@h2 over the eastern US, Canada, and parts of Asia, but reduced over Europe, western and northern Russia, Alaska, and India. The difference between the two models is small in September to November (SON), with improvements in the Southern Hemisphere and mixed differences in the Northern Hemisphere. Table 1 summarizes the biases globally, expressed as area-weighted root mean squared biases. Globally, the performance is very similar in both models, with a small improvement for all seasons in w@h2 compared to w@h1. For most regions, the performance of HadAM3P is similar to state-of-the-art coupled climate models from CMIP5 (Flato et al., 2013), although a fair comparison is difficult given that in w@h2 the ocean state is prescribed to observations, while in CMIP5 models it is computed interactively by an ocean model coupled to the atmospheric model.

Seasonal mean biases
Since variability is very relevant for attribution , we also compute biases in the standard deviation of monthly averaged temperature (Supplement Fig. S1). While biases in temperature variability are similar in both model versions, w@h2 tends to improve the representation of summer and autumn monthly variability at mid-latitudes. The precipitation biases, shown in Fig. 3, highlight some improvements in w@h2 relative to w@h1. In particular, biases are reduced in the rainy season over the Amazon (DJF and MAM) and Africa. These improvements are confirmed by Table 1, with constant or improved biases at the global scale in all seasons. Nonetheless, these improvements are rather small in amplitude and the main biases in w@h1 are still present in w@h2 . Quite striking are the large dry biases over and around Indonesia in all seasons. Since absolute precipitation biases are dominated by regions with large amounts of rainfall, we also show these biases in relative terms in Supplement Fig. S2. Apart from the dry areas, which by definition tend to show large relative changes, Fig. S1 highlights the summer dry bias over Eurasia. Differences between w@h1 and w@h2 ( Fig. S1c, f, i, l) highlight substantial improvement in w@h2 over East Asia in DJF, as well as over northern Africa in most seasons. Like with temperature, the model performs similarly to typical CMIP5 models (Flato et al., 2013, but note that, as for temperature, this comparison may not be fair given the prescribed SSTs in w@h2 as opposed to the interactive ocean in CMIP5). Biases in variability (Supplement Fig. S3) exhibit similar patterns to biases in mean.
Critical for many extreme events is the state of the atmospheric circulation, features of which are known to be poorly reproduced in current-generation climate models (Anstey et al., 2013;Harvey et al., 2014). For instance, strong anticyclonic air advecting from low latitudes can cause persistent, stable systems over western Europe during summer, leading to extremely hot and dry conditions (e.g. Pfahl and Wernli, 2012). Here, we use seasonal-mean geopotential height at 500 hPa as a proxy for the background atmospheric wave activity (Fig. 4). For a more detailed analysis of the dynamics in w@h1, see Mitchell et al. (2016a, b). Figure 4 shows that the largest anomalies in the Northern Hemisphere with respect to the reanalysis are during winter. The bias patterns are similar in both models, w@h1 and w@h2. This is unsurprising, because capturing mid-latitude jet variability is linked with model resolution (Berckmans et al., 2013), and while the regional model of w@h2 has increased horizontal resolution compared with w@h1, there  is no two-way feedback with the global model, so any increase in model resolution will not improve the global atmospheric dynamics. Consequently, no improvement in capturing geopotential height is seen in the Northern Hemisphere. The only major difference between the two model versions is seen in the Southern Hemisphere, in particular over the JJA and SON seasons. However, this is most likely not due to the model version, but rather to the use of different SST datasets. Indeed, HadISST2 (used in w@h2) exhibits lower SSTs in the Southern Hemisphere compared to HadISST1  used in w@h1 (not shown). Winter geopotential height variability underestimation as well as summer variability over Europe are improved in w@h2 (Supplement Fig. S4), but the improvements are overall small -likely also due to the use of the same GCM resolution in both models.
Finally, to assess whether the 1-year spin-up was sufficient to allow the soil variables to be spun up, Supplement Fig. S5 shows the difference between ensemble mean soil moisture (for each soil layer) in December between the 1st month and the 13th month of the analysed simulations (i.e. 13th and 25th month of simulation, respectively), scaled by the standard deviation of the second one. Apart from northern Africa, the differences are confined to the third (central Asia) and fourth (many regions) layers. This suggests that a longer spin-up may be required in future experiments with w@h2. Fortunately, however, the upper 1 m of the soil, corresponding to the root zone in most regions and therefore most critical for evapotranspiration, appears relatively well spun-up over Europe. Nonetheless, the soil moisture state in deeper layers may in some cases impact soil moisture dynamics in the root zone and, thereby, affect land-atmosphere exchange and surface climate. It is not possible to further assess whether an additional year would lead to further changes, as these are not available, and soil temperature is not examined here as this variable has not been saved in our simulations. The impact on temperature biases is shown in Supplement Fig. S6 and the largest impact is found in DJF, but is unlikely due to soil moisture as it spans all latitudes. The most striking difference is a reduction of the bias over south-eastern Europe and the central US, which may be driven by increased soil moisture in these regions with soil moisture-limited evapotranspiration regimes (Seneviratne et al., 2010) and possibly by effects of soil temperature. An impact is also found in MAM. This suggests that a longer spin-up might potentially further reduce the summer temperature warm model bias. For precipitation (Supplement Fig. S7), the impact is small globally, in all seasons except DJF and, in other seasons, over the Sahara (note that % biases are very sensitive to small changes in this region). DJF impacts are found throughout latitudes and are thus unlikely to be a soil moisture spin-up issue, but may result from changes in circulation induced by temperature changes. These results highlight that a longer spin-up may be required in future uses of w@h2, which will be implemented for future w@h2 experiments.

Global land time series
Given the use of the model for attribution, another interesting question is whether the model is able to simulate the response to external forcings, such as CO 2 , aerosols, and volcanoes. In this section, we focus on the global mean response over land and show time series of global land yearly averages in temperature and precipitation ( relative to 1961-1990; see Fig. S8 for raw values). The interquartile (25-75%) and 5-95% ranges of the w@h2 ensemble members for each year provide an estimate of the unpredictable (chaotic) component of atmospheric variability, while variations between years depict the response to the model forcings (including SSTs and sea ice). For temperature, years with strong positive or negative anomalies often match between the observations and the model, and CRU-TS mostly lies within the 90 % confidence interval of the w@h2 ensemble (71% of the years, suggesting that variability at the global scale might be slightly underestimated). The global trend also seems well captured, such as the faster warming since the 1980s. Although this may not be surprising since others have found that prescribing SSTs may strongly force trends over land (e.g. Shin and Sardeshmukh, 2011), we note that regional trends computed from various ensemble members suggest a large range of trends despite the prescription of SSTs (see Sect. 4.4). The actual temperature values (Fig. S8a) are very similar to the anomalies (Fig. 5a). For precipitation (Fig. 5b), some discrepancy is found between about 1915 and 1945, when the model simulates too much rainfall, but observational error is also likely larger in this period. Although CRU-TS appears to lie more often outside the w@h2 ensemble for precipitation than for temperature (observed values are within the 5-95 % range from w@h2 in only 58 % of the years), some of the spikes (e.g. mid 1950s, early 1970s, late 1990s) and troughs (e.g. mid 1960s, early 1990s) are found in both model and observations, suggesting that HadAM3P is able to reproduce some of the sensitivity of precipitation to drivers such as SSTs. It should be noted, however, that unlike for temperature, the long-term precipitation average is substantially lower in the model than in observations (Fig. S8b), indicating larger biases at the global scale. Similar time series plots for the 26 SREX regions  are shown in Figs. S9-S12. Overall, variability from year to year is well captured by the model, suggesting a good model sensitivity to SSTs, greenhouse gases, and other drivers. Some regions show a strong dependence of temperature and precipitation on the underlying SST patterns, especially over the tropics (most regions in South America, Africa, and South and Southeast Asia), as opposed to other regions where most of the model spread appears to be due to internal variability within the atmosphere only. These time series suggest that the model's response to external factors is reliable in most regions of the globe.

Regional model validation
We now move to the validation of the HadRM3P regional climate model within w@h2. As for the validation of HadAM3P in the previous section, we analyse seasonal mean biases in surface air temperature and precipitation and compare these to those in w@h1 over a 30-year period from 1961 to 1990 (Sect. 4.1). These biases are analysed in detail for the sub-regions shown in Fig. 1, with a focus on the mean biases for regional averages and the geographical distribution of temperature and precipitation within each sub-region. The origin of the mean biases is also investigated in Sect. 4.2. We then look at the ability of the model to represent extremes by means of quantile-quantile plots in Sect. 4.3. The sensitivity of the model to forcings for sub-regions within the European domain is then investigated using reliability diagrams (Sect. 4.4).

Mean biases
HadRM3P mean biases in temperature (Fig. 6, with respect to the E-OBS dataset) are similar to those of HadAM3P, including the warm bias in summer. This particular bias, however, is substantially reduced in w@h2 relative to w@h1, over most of central and south-eastern Europe in HadRM3P (by 1-2 • C, Fig. 6i). This contrasts with results from the HadAM3P GCM, for which this bias worsens in this region and season (Sect. 3.1 and Fig. 2i). We note that in  w@h1 the summer temperature bias was larger in HadRM3P than in HadAM3P ( Fig. 10 in Massey et al., 2015), while in w@h2 the biases are more consistent between the global and regional models. Hence, the improvement in HadRM3P in w@h2 compared to w@h1 comes from not increasing the global model bias. This improvement could be a result of the higher horizontal resolution in w@h2 (0.22 • , versus 0.44 • in w@h1), which could explain why this bias is reduced in HadRM3P but not in HadAM3P. The improved representation of the land surface with the introduction of MOSES 2 may also contribute to this improvement, consistent with other studies (e.g. Davin et al., 2016). Feedbacks between the land surface and the atmosphere have indeed been shown to be key to summer temperature in these regions, in particular for hot extremes (e.g. Quesada et al., 2012). The origin of the biases is inves-tigated in greater detail in Sect. 4.2. Probably as a side-effect of this bias reduction, the warm bias extends further north in w@h2, inducing a slight degradation of model performance over Scandinavia and western Russia. Other changes with the introduction of w@h2 include the vanishing of a small warm bias over central and eastern Europe in SON but the appearance of a new small warm bias over eastern Europe (Ukraine, Bielorussia) in DJF and MAM. Table 2 shows the biases in regional averages for the eight regions from the PRUDENCE project  shown in Fig. 1. As a complement, Fig. 7 summarizes the temperature biases at the grid cell level for the sub-regions expressed as the spatial root mean squared biases (RMSBs) in each region. Given that the two regional models are run at different resolutions and that the E-OBS dataset is available on both model grids, RMSB is computed at both resolutions for each model in order to allow for a fair comparison, by bilinearly interpolating w@h1 data to 0.22 • and aggregating w@h2 data to 0.44 • . The improvement in JJA is found at both resolutions in all regions except Scandinavia (SC), while in other seasons the differences between the two models are found to be rather small at the scale of the analysed regions. We now examine the biases in precipitation. Figure 8 shows the seasonal mean biases in both model versions and their difference (see Fig. S13 for relative precipitation biases). The biases are very similar between both models. In particular, the dry bias over eastern Europe in JJA is not re-duced in w@h2, which sheds some light on the mechanisms leading to the reduced temperature bias in this region and season. The introduction of the more sophisticated MOSES 2 land-surface scheme may impact climate in two main ways: first, MOSES 2 may better simulate evapotranspiration (e.g. by better distributing water across storage components or improved stomatal resistance parameterization), thereby leading to an improved partitioning of the energy available at the land surface into sensible and latent heat fluxes. Improved surface fluxes, in particular sensible heat flux, directly lead to an improved simulated temperature. Second, altered surface fluxes may additionally impact precipitation (e.g. Gen-  Guillod et al., 2014Guillod et al., , 2015, feeding back on the biases. For instance, precipitation may increase as a response to increased evapotranspiration, which may further reduce the biases by providing more water for further evapotranspiration, thereby leading to cooler and wetter conditions. The absence of an improvement in simulated precipitation over eastern Europe suggests that this second pathway does not dominate the response. Instead, it is either the direct improvement in simulated evapotranspiration in MOSES 2 or other factors unrelated to the land-surface scheme, such as increased horizontal resolution, which reduce temperature biases. Figure 9 provides an overview of the precipitation biases at the grid cell scale within each sub-region by showing the precipitation RMSB (as in Fig. 7 for temperature), complemented by Table 2 for the bias of regionally averaged precipitation. Unlike for temperature, model performance for precipitation is highly dependent on horizontal resolution and the interpretation is less straightforward. The region with the largest precipitation biases at the grid cell scale is the Alps (AL). There, the biases are largest for each model at their own resolution, but smaller when interpolated or aggregated to the other resolution. This is expected for w@h2, as aggregating the data to a coarser grid allows for biases of opposing signs in neighbouring grid cells to compensate each other. As a result, w@h2 clearly outperforms w@h1 over the Alps at 0.44 • resolution. However, the improvement of w@h1 performance after bilinear interpolation to the higher resolution may seem surprising. It suggests that the locations of the peaks in precipitation are shifted relative to the observations, leading to large local biases of both signs within the region, a feature that can indeed be observed in Fig. 8. The geographical distribution of precipitation, quantified by the spatial correlation between seasonally averaged precipitation in model and observations (Fig. S14), highlights that, in most cases, the spatial correlation increases with interpolation or aggre- gation, while no significant difference between the models is found at each model's respective resolution. The better resolution of topography thereby does not particularly improve the simulation of spatial patterns within the regions, even over the Alps. The smoothing of the field that results from bilinearly interpolating to 0.22 • thereby artificially reduces the overall bias. This result is consistent with earlier findings showing that the model exhibits some exaggerated rainshadow effect (Buonomo et al., 2007), also seen here with a dry bias south of the Alps. This effect also likely plays a role in the better performance of w@h2 at 0.44 • , which should therefore be treated with caution (see e.g. the apparent improvement in w@h2 found in Fig. 9, where the bias difference is shown at 0.44 • ). Nonetheless, it should be noted that for example in JJA, the precipitation bias is halved when considering regional averages over the Alps (Table 2), while no such difference is found at the grid cell scale (Fig. 9), highlighting again the scale dependency of the biases. This improvement found in JJA at the regional scale, however, does not hold in other seasons. Overall, these results suggest that the analysis of regionally aggregated data in a region may be more appropriate in regions with complex topography than analysis at the grid cell scale. Finally, the impact of the short spin-up is evaluated as was done in Sect. 3.1 for HadAM3P. Figure S15 shows the difference in soil moisture as in Fig. S5 (see Sect. 3.1). Over Europe, only Finland and north-western Russia display large differences in the upper 1 m of the soil. In the deepest layer, soil moisture is larger in the analysed year than in the previous year over south-eastern Europe and in some other regions, but this deep layer may be less critical to evapotranspiration and therefore to surface climate. Analyses of temper-14 B. P. Guillod et al.: weather@home2 ature and precipitation biases (Figs. S16 and S17) show that the hot MAM and JJA biases over south-eastern Europe are reduced with progressing spin-up, as expected from the increasing soil moisture, suggesting that a longer spin-up may further reduce this bias. Temperature biases in DJF and precipitation biases in all seasons are not related to soil moisture changes in a straightforward manner, and hence could be due to soil temperature, a variable not saved as an output in our simulations and therefore not analysed here.

Origin of the biases
To investigate the causes of the biases, and in particular the role of the land surface in these, we analyse surface radiative and turbulent fluxes. Figure 10 shows the seasonal cycle of HadRM3P biases for each region and a number of variables. This analysis was conducted over years 1984-2006 instead of the 1961-1990 period analysed in previous sections due to availability of observations of land-surface variables such as radiation (SRB dataset) and surface turbulent fluxes (FLUXNET-MTE dataset). As a side-effect, only w@h2 simulations are analysed (the w@h1 ensemble only spans . The warm and dry summer biases appear clearly in Fig. 10a, b, in particular over eastern Europe (EA) and the Mediterranean (MD) regions. Positive biases in net shortwave radiation at the surface (Fig. 10d) are found in most regions from April/May to September, and are mostly driven by an underestimation of cloud cover (not shown; see Massey et al., 2015, for cloud cover biases in w@h1-HadAM3P). The overestimation of incoming energy is most pronounced in June and July in EA, and may explain part of the warm biases.
The turbulent heat fluxes provide further insights into the RCM biases: sensible heat flux (H , Fig. 10e), latent heat flux (λE, Fig. 10f), and the partitioning of the energy available at the land surface into these two fluxes as expressed by the evaporative fraction (EF = λE λE+H ), i.e. the fraction of the turbulent fluxes that is used for evapotranspiration. EF (Fig. 10g) is overestimated in spring but underestimated in summer, a decrease (relative to observations) that is a sign of excessive summer soil moisture depletion. In fact, the overestimation of λE in spring may itself contribute to excessive soil moisture depletion, although precipitation minus evaporation (Fig. 10c) does not exhibit particularly negative biases (note, however, that observed precipitation might be underestimated since the E-OBS dataset does not correct for the systematic undercatch of rain gauge measurements). The result of this drying observed as a bias in EF is (i) an overestimation of H , particularly in July and August, and (ii) a concurrent underestimation of λE. The overestimation of H likely contributes to the positive temperature bias in these months. In fact, the MD region appears to be strongly affected by the biases in turbulent fluxes, which may explain its large warm bias despite a radiation bias smaller than other regions such as EA. The underestimation of λE, on the other hand, implies an overly dry boundary layer, which in turn may lead to an underestimation of cloud cover and precipitation.
These results show that despite the improvements found in w@h2 following the use of a more sophisticated landsurface scheme, some deficiencies remain. Part of the biases in temperature and precipitation can be explained by the land surface. The origin of these land-surface biases could lie in atmospheric parameterizations (e.g. of cloud and precipitation formation), which provide too little precipitation and too high incoming shortwave radiation. Alternatively, deficiencies in the land surface could be the driver of the fast drying of the soils, which in turn feed back onto the atmosphere, leading to the observed cloud, radiation, and precipitation biases. A combination of both the atmosphere and the land surface likely leads to the observed biases, but identifying the driver of these biases is outside the scope of this paper.

Extreme events
The ability of the weather@home ensemble modelling system to generate a large number of simulations makes itCE4 particularly attractive for the study of extreme weather events and their attribution to anthropogenic climate change. Various extremes events have been investigated using weather@home, such as floods  and heat waves (Otto et al., 2012;Mitchell et al., 2016b). In this section, we analyse the performance of the model for the following extreme events: hot summer days, cold winter nights, and heavy precipitation days in both seasons. Figures 11 and 12 show quantile-quantile plots for the eight regions for different variables and seasons, using all overlapping years between E-OBS and our w@h2 ensemble . The dots and crosses contain the values at specific quantiles for the whole ensemble, with filled dots for deciles, empty dots for the values at percentiles 1 to 5 and 95 to 99, and crosses for the 0.5 and 99.5CE6 percentile values. The envelopes provide indications about the spread from ensemble members to assess both uncertainty and internal variability of the model as follows: 1000 bootstrap samples are constructed, each with one ensemble member per year, thereby containing the same total number of days as the observations. The envelope displays the 95 % range of the quantile values computed from each bootstrap sample.
We first investigate the performance of hot summer extremes, quantified by the daily maximum temperature (in red in Fig. 11). High daily maximum temperature values are overestimated in all regions. Interestingly, in most regions, the quantiles match the observations very well in the colder half of the data, but not in the warmer tail, highlighting that the warm biases in hot extremes in these regions are responsible for the warm bias in mean temperature. In MD and EA, however, even the cold tail of daily maximum temperature is overestimated. Interestingly, these two regions can be expected to be in a regime where soil moisture is a major limit- ing factor to evapotranspiration, thereby strongly controlling summer temperature (e.g. Mueller and Seneviratne, 2012). The dry summer precipitation bias in these two regions (e.g. Fig. 8) can thus be expected to indeed induce a warming over a wide range of temperature quantiles. A possible reason for the bias to be restricted to warm extremes in the other regions may be that the model on some occasions produces an overly strong summer drying in these regions, inducing a shift into a soil moisture-controlled regime and thereby an amplification of temperature anomalies on hot days. Note that the spread from the bootstrap samples is small in most regions, highlighting that these biases do not result from internal variabil-ity, but are exhibited in any subsample of the same size as the observational data.
For cold winter temperatures (daily minimum temperature in DJF, in blue in Fig. 11), the model performs rather well. Apart from the regions MD, SC, and, to a lesser extent, ME and AL, where nighttime temperatures are underestimated or overestimated, observed cold quantile values are mostly within the range of the modelled values. Extreme cold nights in BI and FR, however, are also underestimated by the model (i.e. extreme cold nights are not cold enough). Overall, w@h2 appears to be suitable for the investigation of cold winter nights over Europe.  Figure 11. Quantile-quantile plots of the distribution of (red) JJA daily maximum temperature (tasmax)   q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q Figure 12. Same as Fig. 11 but for daily precipitation: quantile-quantile plots for JJA (red) and DJF (blue) comparing w@h2 to E-OBS over years . Here, the same axes are used for both seasons.
For daily precipitation (Fig. 12 with JJA in red and DJF in blue), the spread between bootstrap samples is larger. In summer, heavy precipitation days are very well represented in all regions apart from BI and EA, where the quantile values are underestimated by w@h2. These regions also exhibit relatively large negative mean precipitation biases (e.g. Table 2). Nonetheless, it appears that overall w@h2 does a reasonable job at simulating summer heavy precipitation extremes in most European regions. Daily winter heavy precipitation (in blue in Fig. 12), on the other hand, is overestimated in most regions (especially in MD, SC, AL, and EA), but well simulated in BI and IP, with intermediate performances in FR and ME. We note that, unlike for temperature, most precipitation quantile-quantile plots display a rather linear shape, suggesting that for applications where bias correction is necessary, applying a linear method may be appropriate.
These results provide some confidence in the ability of w@h2 to simulate extreme events over Europe. A few exceptions include summer hot extremes, which are overestimated over all regions. A range of bias-correction methodologies are available to take such biases into account, ranging from a simple additive ("delta method", for temperature) or multiplicative ("linear scaling", for precipitation, e.g. Lafon et al., 2013) adjustment based on the mean, to sophisticated methods that attempt to correct for changes in the shape of the distribution, such as quantile-quantile mapping (e.g. Wood et al., 2004). The shapes of the quantile-quantile plots for summer daily maximum temperature (Fig. 11) suggest that the application of a simple additive bias correction may not be suitable for correcting extremes. A multiplicative factor applied to precipitation, on the other hand, seems appropriate in most regions. However, these bias-correction techniques may not preserve the physical consistency between variables that is provided by the model, which may be an issue in the case of impact studies. In the case of large ensembles such as those from weather@home, a new bias-correction methodology (Sippel et al., 2016b), based on the resampling of ensemble members conditional on the distribution of, e.g. summer averaged temperature over a region of interest, has been shown to not only improve seasonal averages, but also the representation of extremes. This new methodology is promising for a wide range of applications with weather@home model output.

Reliability and trends
A common use of climate models, including weather@home, is the study of the response of climate to forcing agents. In particular, weather@home is regularly used for the attribution of extreme weather events to anthropogenic climate change. An obvious question is then the following. Is the model reliable, i.e. does it simulate well the response to potential drivers such as sea surface temperature and greenhouse gases? In this section, we investigate the reliability of w@h2 for simulating seasonally averaged events: warm summers, cold winters, dry summers, and wet winters. While seasonal averages are not directly related to extreme weather events, the drivers of both are likely similar (e.g. higher CO 2 leads to increased mean and extreme temperature), and the occurrence of a few extreme events may strongly impact the seasonal average. Figures 13 to 16 show reliability diagrams (Weisheimer and Palmer, 2014) for these four types of seasonal events and the eight analysed regions, using w@h2 and CRU-TS data from 1901 to 2006. For each type of event (e.g. high summer temperature, defined as JJA averaged temperature in the upper tercile), the probability of the event is computed for each year from regionally averaged w@h2 model output ("forecast probability"). The 106 forecasts (one per year) are then grouped into bins of size 0.1, and the corresponding observed frequency ("observed relative frequency") is computed from the observations in the corre-sponding years, with uncertainties derived from bootstrapping (Wilks, 2011;NCAR -Research Applications Laboratory, 2015). The forecast and observed values for each bin are then plotted with the size of the dot proportional to the sample size (i.e. number of years). Results for bins containing at least five data points (i.e. years) are shown in red, while for other bins, shown in black, values are not very robust and should be interpreted with caution. The grey background in each plot shows the skill region, i.e. where data contribute positively to the Brier skill score. Here, we follow a commonly used method (e.g. Weisheimer and Palmer, 2014) whereby the tercile definition is based on the observed and modelled distributions, respectively, i.e. a model's forecast of a warm summer is when the temperature is in the upper tercile (i.e. upper third) of its own distribution.
In order to facilitate interpretation, reliability is further classified into five categories using the definition proposed by Weisheimer and Palmer (2014). To do so, 1000 bootstrap samples with replacements were constructed from the full set of w@h2 data. A reliability diagram was simulated for each of them, to whose points a weighted linear regression was applied, using the number of forecasts in each bin as weights. The 75 % confidence interval (uncertainty range) of the regression slopes is used to categorize forecasts into five classes, from 1 (dangerously useless) to 5 (perfect forecast) (Weisheimer and Palmer, 2014). Table 3 provides some detail on the definition of the five categories, and the category is indicated in the upper left of each panel on Figs. 13-16.
Reliability diagrams for warm summers (JJA temperature in the upper tercile, Fig. 13) show that the model is very reliable at simulating the dependency of this quantity to forcings and displays good resolution, albeit with a small underconfidence, i.e. the model tends to over-forecast low probability events but under-forecast high probability events (see Wilks, 2011, for details on the interpretation of reliability diagrams). Such forecasts can typically perform very well after calibration. Interestingly, this underestimation of the sensitivity of hot temperatures to forcings is consistent with the tendency of RCMs to underestimate trends in heat waves over Europe (Min et al., 2013;Sippel et al., 2016a). All regions display skill that is "still very useful for decision making" or "perfect" (categories 4 and 5, respectively). Note that for a few bins (e.g. for forecast values above 0.7 in IP), observations are in the upper tercile for all years with such forecast (modelled) probabilities, preventing the bootstrapping method to compute uncertainty ranges for individual bins (note that the uncertainty of the linear fit used to categorize the performance can still be applied). In most cases, we also find that data points that lie far from the 1 : 1 line (e.g. for forecast probabilities greater than 0.4 in FR) include very few years and should therefore be interpreted cautiously (black dots, including less than 5 years or "forecasts"). A similarly good performance is found for the occurrence of low winter temperature (DJF temperature in the lower tercile, Fig. 14). Thus, the overall high reliability of w@h2 for simu- . Grey shading indicates where data points contribute positively to skill (Wilks, 2011). Performance category is indicated in the upper left of each plot, on a scale from 1 (dangerously useless) to 5 (perfect) (see Table 3) as in Weisheimer and Palmer (2014). lating warm summers and cold winters provides some confidence in weather@home-based attribution statements for temperature over Europe.
The reliability of the model for seasonal averages of precipitation is found to be lower. For low summer precipitation (Fig. 15), the reliability is found to be marginally useful for IP and EA, and not useful for FR. The reliability in other regions is even lower ("dangerously useless"), as the slope of the linear fit is slightly negative. A more positive picture is found for high winter precipitation: perfect forecasts are identified for ME and SC (Fig. 16), and still marginally useful performance for IP and BI. The reliability is classified as "dangerously useless" for MD, FR, and EA (Weisheimer and Palmer, 2014). The relatively low skill for precipitation  should however be expected and it is consistent with low seasonal predictability in Europe found in other studies (e.g. Weisheimer and Palmer, 2014). It should be noted that, as Figs. 13 to 16 are based on 1901-2006, they include the influence of all temporally varying factors, including greenhouse gases, sea surface temperature and sea ice, aerosols, and volcanoes. Therefore, these results are dominated by the long-term trend arising from increased greenhouse gas concentrations, rather than by year-to-year sea surface temperature variability, for example. Trends in regional averages of temperature and precipitation, quantified using the Theil-Sen slope with Mann-Kendall significance testing (e.g. Yue et al., 2002), are shown in Fig. 17 for summer and winter. For w@h2, we constructed 1000 106-year time series by randomly sampling one simulation per year, from which trends and p values are derived. Boxplots summarize these 1000 trend values and are overlaid by white dots depicting the observed trend from CRU-TS. The value at the bottom of each boxplot indicates the percentage of w@h2 time series with a significant trend, with an asterisk if the observed trend is significant. Overall, temperature trends are well within the interquartile range of modelled trends, although they are underestimated in IP, FR, and AL. Thus, w@h2 follows the tendency of RCMs to underes- show the observed regional trend estimated from CRU-TS. Theil-Sen linear trend slopes are computed using regional averages and significance is tested using a Mann-Kendall test. The numbers below the boxes indicate the percentage of w@h2 time series with a statistically significant trend (at the 5 % level), and with an asterisk if the observed trend is significant.
timate temperature trends over Europe (Min et al., 2013). For precipitation, on the other hand, trends are noisy and clustered around 0, and observed trends often lie at the tail of the w@h2 trend distributions. This could explain the overall poor reliability in seasonal averages of precipitation found in Figs. 15 and 16. Attempts to isolate the response to the oceans (SSTs and sea ice) by using anomalies from a 31-year running average (not shown) do not provide more insights, as the forecasts from individual years are all close to the climatological forecast of 1/3. This result is consistent with the time series shown in panels (e, k, o) in Figs. S9-S12, which show that for European regions the inter-member spread is substantially larger than the variability in the ensemble mean from year to year (long-term trend excepted). Therefore, in w@h2 most of the inter-annual variability in Europe is due to (unpredictable) internal variability in the atmosphere, rather than to specific SST or sea ice patterns, consistent with the relatively low seasonal predictability often found over Europe (e.g. Weisheimer and Palmer, 2014). Further work will investigate this more specifically and will aim at determining whether this finding is a model feature or can be confirmed by observations. Here, we simply note that Figs. S9-S12 suggest a different behaviour in some regions known to be strongly influenced by SST patterns such as the El Niño-Southern Oscillation.

Conclusions
The new version of weather@home presented and validated in this paper is a powerful tool for the study of extreme weather events. The modelling set-up consists of the HadAM3P GCM driven by sea surface temperature, sea ice, and other forcings, which is downscaled over a sub-region by its RCM counterpart, HadRM3P. Using a distributed computing infrastructure (Massey et al., 2006), very large ensembles of climate model simulations can be generated, allowing one to examine rare extreme events with high statistical confidence.
Improvements in w@h2 include the use of a more recent land-surface scheme, MOSES 2, which uses tiles to represent land-surface-type heterogeneity within each grid cell, as well as a 2-fold increase in horizontal resolution in HadRM3P with the use of the 0.22 • European CORDEX region. A large ensemble with about 100 members per year for years 1901-2006 has been generated, and is compared to a w@h1 ensemble over 1961-1990(Massey et al., 2015. Overall, w@h2 shows reduced biases compared to w@h1, although the general bias patterns persist. Biases in HadAM3P are reduced in the Southern Hemisphere, while mixed results are found in the Northern Hemisphere. The model is found to be reliable in most regions and in terms of year-to-year variability in global temperature over land. In HadRM3P, the most striking bias reduction is found over eastern Europe, where a warm summer bias is reduced (but remains significant). Precipitation biases in HadRM3P, on the other hand, do not exhibit substantial improvements overall. Hot extremes are overestimated for all European regions, but cold extremes are well represented. The model is shown to perform particularly well for extreme daily precipitation.
A limitation of w@h2 as presented in this study is the relatively short spin-up (1 year). We find that a longer spin-up may further improve w@h2, in particular with respect to the representation of summer temperatures over south-eastern Europe. Future w@h2 experiments will therefore include a longer spin-up of 5-10 years, in order to allow for a full stabilization of soil moisture and soil temperature and to thereby take full advantage of the capability of the model.
One of the main uses of weather@home relates to the attribution of extreme weather events to anthropogenic climate change. The ability of the model to respond to forcing agents such as greenhouse gases and sea surface temperature was therefore examined over Europe. The model is reliable for seasonal averages of temperature although slightly underconfident, i.e. it might underestimate the impact of the forcing. The model's reliability is less satisfactory for seasonally averaged precipitation, although in most regions and seasons comparison with observations lies within uncertainties.
Another common use of weather@home output is for the generation of datasets of synthetic extreme events, to be used by the impact modelling community. For example, the ongoing MaRIUS project (Managing the Risks, Impacts and Uncertainties of droughts and water Scarcity) uses drought events in the UK for present and future conditions generated by weather@home to assess the risks associated with droughts. Using the weather@home modelling system allows for thousands of drought events to be generated and fed into various hydrological and impact models, thereby enabling a risk assessment framework to be applied to types of events with rather few observed occurrences.
For some applications, bias correction might be necessary. The availability of a large number of simulations allows for new methodologies to be applied, for example by re-sampling from the ensemble (Sippel et al., 2016b) or by giving weights to ensemble members in order to obtain distributions close to observations.
In this paper, we focused on the European region, but w@h2 is being developed over a range of regions. Collaborators around the world have already used weather@home, where HadRM3P is run over their region of interest, and the project is expected to continue establishing new regions with w@h2 in the future.
In conclusion, the improved physical representation of the land surface in w@h2 increases our confidence in the model's ability to simulate weather extremes, in particular hot extremes which can be highly related to land-surfaceatmosphere interactions (e.g. Miralles et al., 2014), although some biases persist. Overall, weather@home may be a useful tool for the investigation of extreme weather events if proper bias corrections and other caveats are taken into account.
Code availability. HadRM3P is available from the UK Met Office as part of the Providing REgional Climates for Impacts Studies (PRECIS) program. Access to standard versions of the software is dependent on attendance at a PRE-CIS training workshop after which all source code, including that relevant to configuring HadAM3P, and other materials are made available (http://www.metoffice.gov.uk/research/applied/ international-development/precis/obtain). These workshops are either held at the Met Office, for which a small fee is charged to cover the costs of the workshop delivery, or as part of a project, often in a region where PRECIS is to be applied. The code to manage and embed these models within the weather@home project is specific to their utilization within the BOINC environment and we consider it not within the scope of this publication.
The full set of model output data for the experiment used in this study will be freely available at the Centre for Environmental Data Analysis (http://www.ceda.ac.uk) in the next few months. Until the point of publication within the CEDA archive, please email cpdn@oerc.ox.ac.uk, who will work with you to access the relevant data.
The Supplement related to this article is available online at doi:10.5194/gmd-10-1-2017-supplement.

the NASA Langley Research Center Atmospheric Science Data
Center and the FLUXNET-MTE dataset was downloaded from http://www.bgc-jena.mpg.de/geodb/BGI/Home. We are grateful to CEDA (Centre for Environmental Data Analysis, NERC) and their Jasmin analysis platform (Lawrence et al., 2013) on which data analysis has been done. Finally, we thank the two anonymous reviewers, whose comments have helped to improve the manuscript.
Edited by: W. Hazeleger Reviewed by: two anonymous referees Remarks from the language copy-editor CE1 For the sake of consistency, I have copy-edited this paper throughout according to Oxford English language standards.

CE3
Since this does not refer to a specifically defined political region, it has been lower-cased.

CE4
Please confirm. The composition of Figs. 1-6, 8 and 10-17 has been adjusted to our standards.

TS2
As a scientific publication consists of not only the research article but also its underlying material and related items, Copernicus Publications collects the DOIs of data sets, videos, samples, model code, and other supplementary/underlying material or resources as well as additional outputs. These "assets" (i.e. non-textual resources underlying the research and findings in the article or other outputs from the research) should be added to the reference list and properly cited in the article. As for other resources, this citation must also consist of author(s), title, DOI, and year. The production office will then add these to the assets tab (see http://www.geosci-model-dev.net/9/3517/2016/gmd-9-3517-2016-assets.html). If no DOI can be registered, assets can be linked through persistent URLs. This is not seen as best practice and the persistence of the URL must be secured.

TS3
Declaration of all potential conflicts of interest is required by us as this is an integral aspect of a transparent record of scientific work. If there are possible conflicts of interest, please state what competing interests are relevant to your work.

TS4
Please provide last access date.

TS5
Please provide article number with DOI or page range.

TS6
Please list all authors.

TS7
Please provide DOI.

TS8
Please provide volume number and article number.

TS9
Please provide last access date.

TS10
Please provide last access date.

TS11
Please provide last access date.

TS12
Please check author's name.