ESMValTool (v1.0) – a community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP

Abstract. A community diagnostics and performance metrics tool for the evaluation of Earth system models (ESMs) has been developed that allows for routine comparison of single or multiple models, either against predecessor versions or against observations. The priority of the effort so far has been to target specific scientific themes focusing on selected essential climate variables (ECVs), a range of known systematic biases common to ESMs, such as coupled tropical climate variability, monsoons, Southern Ocean processes, continental dry biases, and soil hydrology–climate interactions, as well as atmospheric CO2 budgets, tropospheric and stratospheric ozone, and tropospheric aerosols. The tool is being developed in such a way that additional analyses can easily be added. A set of standard namelists for each scientific topic reproduces specific sets of diagnostics or performance metrics that have demonstrated their importance in ESM evaluation in the peer-reviewed literature. The Earth System Model Evaluation Tool (ESMValTool) is a community effort open to both users and developers encouraging open exchange of diagnostic source code and evaluation results from the Coupled Model Intercomparison Project (CMIP) ensemble. This will facilitate and improve ESM evaluation beyond the state-of-the-art and aims at supporting such activities within CMIP and at individual modelling centres. Ultimately, we envisage running the ESMValTool alongside the Earth System Grid Federation (ESGF) as part of a more routine evaluation of CMIP model simulations while utilizing observations available in standard formats (obs4MIPs) or provided by the user.


Introduction
Earth system model (ESM) evaluation with observations or reanalyses is performed both to understand the performance of a given model and to gauge the quality of a new model, either against predecessor versions or a wider set of models. Over the past decades, the benefits of multimodel intercomparison projects such as the Coupled Model Intercomparison Project (CMIP) have been demonstrated. Since the beginning of CMIP in 1995, participating models have been further developed, with more complex and higher resolution models joining in CMIP5 (Taylor et al., 2012) which supported the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (AR5) (IPCC, 2013). The main purpose of these internationally coordinated model experiments is to address outstanding scientific questions, to improve the understanding of climate, and to provide estimates of future climate change. Standardization of model output in a format that follows the Network Common Data Format (netCDF) Climate and Forecast (CF) Metadata Convention (http://cfconventions.org/) and collection of the model output on the Earth System Grid Federation (ESGF, http://esgf.llnl.gov/) facilitated multi-model analyses. However, CMIP has historically lacked a common analysis tool available that could operate directly on submitted model data and deliver a standard evaluation of models against observations.
An important new aspect in the next phase of CMIP (i.e. CMIP6; Eyring et al., 2015) is a more distributed organization under the oversight of the CMIP Panel, where a set of standard model experiments, which were common across earlier CMIP cycles, the Diagnostic, Evaluation and Characterization of Klima (DECK) experiments and the CMIP6 historical simulations, will be used to broadly characterize model performance and sensitivity to standard external forcing. Standardization, coordination, common infrastructure, and documentation functions that make the simulation results and their main characteristics available to the broader community are envisaged to be a central part of CMIP6. The Earth System Model Evaluation Tool (ESMValTool) presented here is a community development that can be used as one of the documentation functions in CMIP to help diagnose and understand the origin and consequences of model biases and inter-model spread. Our goal is to develop an evaluation tool that users can run to produce well-established analyses of the CMIP models once the output becomes available on the ESGF. This is realized through text files that we refer to as standard namelists, each calling a certain set of diagnostics and performance metrics to reproduce analyses that have demonstrated to be of importance in ESM evaluation in previous peer-reviewed papers or assessment reports. Through this approach, routine and systematic evaluation of model results can be made more efficient. The framework enables scientists to focus on developing more innovative analysis methods rather than constantly having to "re-invent the wheel". An additional purpose of the ESMValTool is to facilitate model evaluation at individual modelling centres, in particular to rapidly assess the performance of a new model against predecessor versions. Righi et al. (2015) and Jöckel et al. (2016) have applied a subset of the namelists presented here to evaluate a set of simulations using different configurations of the global ECHAM/MESSy Atmospheric Chemistry model (EMAC). In this paper we also highlight the integration of ESMValTool into modelling workflows -including models developed at NOAA's Geophysical Fluid Dynamics Laboratory (GFDL), the EMAC model, and the NEMO ocean model -through the use of the ESMValTool's reformatting routine capabilities.
In addition to standardized model output, the ESGF hosts observations for Model Intercomparison Projects (obs4MIPs; Ferraro et al., 2015;Teixeira et al., 2014) and reanalyses data (ana4MIPs, https://www.earthsystemcog.org/ projects/ana4mips). The obs4MIPs and ana4MIPs projects provide the community with access to CMIP-like data sets (in terms of variables, temporal and spatial frequencies, and time periods) of satellite data and reanalyses, together with the corresponding technical documentation. The ESMVal-Tool makes use of these observations as well as observations available from other sources to evaluate the models. In sev-V. Eyring et al.: ESMValTool (v1.0) eral of the diagnostics and metrics, more than one observational data set or meteorological reanalysis is used to account for uncertainties in observations. This is crucial for assessing model performance in a more robust and scientifically valid way.
For the model evaluation we apply diagnostics and in several cases also performance metrics. Diagnostics (e.g. the calculation of zonal means or derived variables in comparison to observations) provide a qualitative comparison of the models with observations. Performance metrics are defined as a quantitative measure of agreement between a simulated and observed quantity which can be used to assess the performance of individual models or generation of models. Quantitative performance metrics are routinely calculated for numerical weather forecast models, but have been increasingly applied to atmosphere-ocean general circulation models (AOGCMs) or ESMs. Performance metrics used in these studies have mainly focused on climatological mean values of selected ECVs (Connolley and Bracegirdle, 2007;Pincus et al., 2008;Reichler and Kim, 2008), and only a few studies have developed process-based performance metrics (SPARC-CCMVal, 2010;Waugh and Eyring, 2008;Williams and Webb, 2009). The implementation of performance metrics in the ESMValTool enables a quantitative assessment of model improvements, both for different versions of individual ESMs and for different generations of model ensembles used in international assessments (e.g. CMIP5 versus CMIP6). Application of performance metrics to multiple models helps in highlighting when and where one or more models represent a particular process well. While quantitative metrics provide a valuable summary of overall model performance, they usually do not give information on how particular aspects of a model's simulation interact to determine the overall fidelity. For example, a model could simulate a mean state (and trend) in global mean surface temperature that agrees well with observations, but this could be due to compensating errors. To learn more about the sources of errors and uncertainties in models and thereby highlight specific areas requiring improvement, evaluation of the underlying processes and phenomena is necessary. A range of diagnostics and performance metrics focussing on a number of key processes are also included in the ESMValTool. This paper describes ESMValTool version 1.0 (v1.0), which is the first release of the tool to the wider community for application and further development as open-source software. It demonstrates the use of the tool by showing example figures for each namelist for either all or a subset of CMIP5 models. Section 2 describes the technical aspects of the tool, and Sect. 3 the type of modelling and observational data currently supported by the ESMValTool (v1.0). In Sect. 4 an overview of the namelists of the ESMValTool (v1.0) is given along with their diagnostics and performance metrics and the variables and observations used. Section 5 describes the use of the ESMValTool in a typical model development cycle and evaluation workflow and Sect. 6 closes with a summary and an outlook.

Brief overview of the ESMValTool
In this section we give a brief overview of the ESMValTool (v1.0) which is schematically depicted in Fig. 1. A detailed user's guide is provided in the Supplement.
The ESMValTool consists of a workflow manager and a number of diagnostic and graphical output scripts. It builds on a previously published diagnostic tool for chemistryclimate model evaluation (CCMVal-Diag Tool;Gettelman et al., 2012), but is different in its focus. In particular, it extends to ESMs by including diagnostics and performance metrics relevant for the coupled Earth system, and also focuses on evaluating models with a common set of diagnostics rather than being mostly flexible as the CCMVal-Diag tool. In addition, several technical and structural changes have been made that facilitate development by multiple users. The workflow manager is written in Python, while a multi-language support is provided in the diagnostic and the graphic routines. The current version supports Python (www.python.org), the NCAR Command Language (NCL, 2016), and R (Ihaka and Gentleman, 1996), but it can be extended to other opensource languages. The ESMValTool is executed by invoking the main.py script, which takes a namelist as a single input argument. The namelists are text files written using the XML (eXtensible Markup Language) syntax and define the data to be read (models and observations), the variables to be analysed, and the diagnostics to be applied. The XML syntax has been chosen in order to allow users to express the relationship between these three elements (data, variables, and diagnostics) in a structured, easy to use way.
Within the workflow, the input data are checked for compliance with the CF and Climate Model Output Rewriter (CMOR, http://pcmdi.github.io/cmor-site/tables.html) standards required by the tool (see Sect. 3) via a set of dedicated reformatting routines, which are also able to fix the most common errors in the input data (e.g. wrong coordinates, undefined or missing values, non-compliant units). It is additionally possible to define new variables using variablespecific scripts, for example to calculate the total column ozone from a 3-D ozone field (tro3), temperature (ta), and surface pressure (ps). The diagnostic and graphic routines are written in a modular and flexible way so that they can be customized by the user via diagnostic-specific settings in the configuration file (cfg-file) and variable-specific settings (in the directory variable_defs/) without changing the source code. These routines are complemented by a set of libraries, providing general-purpose code for the most common operations (statistical analyses, regridding tools, graphic styles, etc.). The output of the tool can be both NetCDF and graphics files in various formats. In addition, a log file is written containing all the information of a specific call of the main Figure 1. Schematic overview of the ESMValTool (v1.0) structure. The primary input to the workflow manager is a user-configurable text namelist file (orange). Standardized libraries/utilities (purple) available to all diagnostics scripts are handled through common interface scripts (blue). The workflow manager runs diagnostic scripts (red) that can be written in several freely available scripting languages. The output of the ESMValTool (grey) includes figures, binary files (netCDF), and a log file with a list of relevant references and processed input files for each diagnostic. script: time and date of the call, version number, analysed data (models and observations), applied diagnostics and variables, and corresponding references. This helps to increase the traceability and reproducibility of the results.
To facilitate the development of new namelists and diagnostics by multiple developers from various institutions while preserving code quality and reliability, an automated testing framework is included in the package. This allows the developers to verify that modifications and new code are compatible with the existing code and do not change the results of existing diagnostics. Automated testing within the ESMValTool is implemented on two complementary levels: -unittests are used to verify that small code units (e.g. functions/subroutines) provide the expected results.
-integration testing is used to verify that a diagnostic integrates well into the ESMValTool framework and that a diagnostic provides expected results. This is verified by comparison of the results against a set of reference data generated during the implementation of the diagnostic.
Each diagnostic is expected to produce a set of well-defined results, i.e. files in a variety of formats and types (e.g. graphics, data files, ASCII files). While testing results of a diagnostic, a special namelist file is executed by the ESMValTool which runs a diagnostic on a limited set of test data only, minimizing executing time for testing while ensuring that the diagnostic produces the correct results. The tests implemented include -file availability: a check that all required output data have been successfully generated by the diagnostic. A missing file is always an indicator for a failure of the program.
-file checksum: currently the MD5 checksum is used to verify that contents of a file are the same.
-graphics check: for graphic files an additional test is implemented which verifies that two graphical outputs are identical. This is in particular useful to verify that outputs of a diagnostic remain the same after code changes.
Unittests are implemented for each diagnostic independently using nose (https://nose.readthedocs.org/en/latest/). Test files are searched recursively, executed, and a statistic on success and failures is provided at the end of the execution. In order to run integration tests for each diagnostic, a small script needs to be written once. As for the unittests, a summary of success and failures is provided as output (see the Supplement for details). For the documentation of the code, Sphinx is used (http: //sphinx-doc.org/) to organize and format ESMValTool documentation, including text which has been extracted from source code. Sphinx can help to create documentation in a variety of formats, including HTML, LaTeX (and hence printable PDF), manual pages and plain text. Sphinx was originally developed for documenting Python code, and one of its features is the capability -using the so-called autodoc V. Eyring et al.: ESMValTool (v1.0) extension -to extract documentation strings from Python source files and use them in the documentation it generates. This feature apparently does not exist for NCL source files (such as those which are used in ESMValTool), but it has been mimicked here via a Python script, which walks through a subset of the ESMValTool NCL scripts, extracts function names, argument lists and descriptions (from the comments immediately following the function definition), and assembles them in a subdirectory for usage with Sphinx. The documentation includes a listing of the functions, procedures, and plotting routines in order to encourage the reuse of existing code in multiple namelists.

Models and observations
The open-source release of ESMValTool (v1.0) that accompanies this paper is intended to work with CMIP5 model output, but the tool is compatible with any arbitrary model output, provided that it is in CF-compliant netCDF format and that the variables and metadata are following the CMOR tables and definitions. The namelists are designed such that it is straightforward to execute the same diagnostics with either CMIP DECK or CMIP6 model output rather than CMIP5 output, and these will be provided when the new simulations are available. As mentioned in the previous section, routines are provided for checking CF/CMOR compliance and fixing the most common minor flaws in the model output submitted to CMIP5. More substantial deviations from the required standards in the model output may be corrected via projectand model-specific procedures defined by the user and automatically applied within the workflow. The current reformatting routines are, however, not able to convert arbitrary model output to the full CF/CMOR standard. In this case, it is the responsibility of the individual modelling groups to perform that conversion. Currently, model-specific reformatting routines are provided for EMAC (Jöckel et al., 2016(Jöckel et al., , 2010, the GFDL CM3 and ESM models (Donner et al., 2011;Dunne et al., 2012Dunne et al., , 2013, and for NEMO (Madec, 2008) which is the ocean model used in for example EC-Earth (Hazeleger et al., 2012). Users can develop similar reformatting routines specific to their model using the existing reformat routines for other models as a template. This will allow the tool to run directly on the original model output rather than having to reformat the model output to CF/CMOR beforehand.
The observations are organized in tiers. Where available, observations from the obs4MIPs and reanalysis from the ana4MIPs archives at the ESGF are used in the ESMValTool. These data sets form "Tier 1". Tier 1 data are freely available for download to be directly used by the tool since they are formatted following the CF/CMOR standard and do not need any additional processing. For other observational data sets, the user has to retrieve the data from their respective source and reformat them into the CF/CMOR standard. To facilitate this task, we provide specific reformatting routines for a large number of such data sets together with detailed information of the data source, as well as download and processing instructions (see Table 1). "Tier 2" includes other freely available data sets and "Tier 3" includes restricted data sets (e.g. requiring the user to accept a license agreement issued by the data owner). For Tier 2 and 3 data, links and help scripts are provided, so that these observations can be easily retrieved from their respective sources and processed by the user. A collection of all observational data used in ES-MValTool (v1.0) is hosted at DLR and the ESGF nodes at BADC and DKRZ, but depending on the license terms of the observations these might not be publicly available.

Overview of namelists included in the ESMValTool (v1.0)
A number of namelists have been included in the ESMVal-Tool (v1.0) that group a set of performance metrics and diagnostics for a given scientific topic. Namelists that focus on the evaluation of a physical climate process for, respectively, the atmosphere, ocean, and land surface are presented in Sects. 4.1, 4.2, and 4.3. These can be applied to simulations with prescribed SSTs (i.e. AMIP runs) or the CMIP5 historical simulations (simulations for 1850 to the present day conducted with the best estimates of natural and anthropogenic climate forcing) that are run by either coupled AOGCMs or ESMs. Another set of namelists has been developed to evaluate biogeochemical biases present in ESMs when additional components of the Earth system such as the carbon cycle, atmospheric chemistry, or aerosols are simulated interactively (Sects. 4.4 and 4.5 for carbon cycle and aerosols/chemistry, respectively).
In each subsection, we first scientifically motivate the inclusion of the namelist by reviewing the main systematic biases in current ESMs and their importance and implications. We then give an overview of the namelists that can be used to evaluate such biases along with the diagnostics and performance metrics included, and the required variables and corresponding observations that are used in ESMValTool (v1.0). For each namelist we provide one to two example figures that are applied to either all or a subset of the CMIP5 models. An assessment of CMIP5 models is however not the focus of this paper. Rather, we attempt to illustrate how the namelists contained within ESMValTool (v1.0) can facilitate the development and evaluation of climate model performance in the targeted areas. Therefore, the results of each figure are only briefly described in each figure caption. Table 1 provides a summary of all namelists included in ESMValTool (v1.0) along with information on the quantities and ESMValTool variable names for which the namelist is tested, the corresponding observations or reanalyses, the section and example figure in this paper, and references for the namelist. Table 2 then provides an overview of the diagnostics included for each namelist along with specific calcula- Table 1. Overview of standard namelists implemented in ESMValTool (v1.0) along with the quantity and ESMValTool variable name for which the namelist is tested, the corresponding observations or reanalyses, the section and example figure in this paper, and references for the namelist. When the namelist is named with a specific paper (naming convention: namelist_SurnameYearJournalabbreviation.xml), it can be used to reproduce in general all or in some cases only a subset of the figures published in that paper. Otherwise the namelists group a set of diagnostics and performance metrics for a specific scientific topic (e.g. namelist_aerosol_CMIP5.xml). Observations and reanalyses are listed together with their Tier, type (e.g. reanalysis, satellite or in situ observations), the time period used, and a reference. Tier 1 includes observations from obs4MIPs or reanalyses from ana4MIPs. Tier 2 and tier 3 indicate freely available and restricted data sets, respectively. For these observations, reformatting routines are provided to bring the original data in the CF/CMOR standard format so that they can directly be used in the ESMValTool.   Goswami et al. (1999); Sperber et al. (2013); Wang and Fan (1999); Wang et al. (2012); Webster and Yang (1992); Lin et al. (2008); Fig. 9.32 of Flato et al. (2013) namelist_SA Monsoon_daily satellite, 1998satellite, -near-present, Huffman et al., 2007 GPCP-1DD 1DD (Tier 1, satellite, 1997-2010, Huffman et al., 2001 CMAP (Tier 2, satellite &rain gauge, 1979-near-present, Xie andArkin, 1997) MERRA (Tier 1, reanalysis, 1979-2011, Rienecker et al., 2011) ERA-Interim (Tier 3, reanalysis, 1979-2014, Dee et al., 2011 Skin temperature (K) ts HadISST (Tier 2, reanalysis, 1870-2014, Rayner et al., 2003 2001-2011, Wielicki et al., 1996 TOA outgoing longwave radiation (W m −2 ) rlut NOAA polar-orbiting satellites (Tier 2, satellite, 1974-2013, Liebmann and Smith, 1996 namelist_CVDP Precipitation (kg m −2 s −1 ) pr GPCP-SG (Tier 1, satellite & rain  gauge, 1979-near-present, Adler et al., 2003 TRMM (Tier 1, satellite, 1998near-present, Huffman et al., 2007 Sect. 4.1.4 "NCAR climate variability diagnostics package"/Figs. 8 and 9 Phillips et al. (2014) Air pressure at sea level (Pa) psl NOAA-CIRES Twentieth Century Reanalysis Project (Tier 1, reanalysis, 1900reanalysis, -2012reanalysis, , Compo et al., 2011 Near-surface air temperature (K) tas NCEP (Tier 2, reanalysis, 1948, Kistler et al., 2001   pr GPCP-1DD (Tier 1, satellite, 1997-2010, Huffman et al., 2001 TOA longwave radiation (W m −2 ) rlut NOAA polar-orbiting satellites (Tier 2, satellite, 1974-2013, Liebmann and Smith, 1996 CERES-SYN1deg (Tier 1, satellite, 2001-2011, Wielicki et al., 1996 namelist_ lauer13jclim Atmosphere cloud condensed water content (kg m −2 ) Carbon monoxide (mol mol −1 ) vmrco GLOBALVIEW (Tier 2, ground, 1991-2008, GLOBALVIEW-CO2, 2008 Nitrogen dioxide (NO x = NO + NO 2 ) (mol mol −1 ) C 2 H 4 propane (mol mol −1 ) C 2 H 6 propane (mol mol −1 ) C 3 H 6 propane (mol mol −1 ) Emmons (Tier 2, aircraft, various campaigns, Emmons et al., 2000) namelist_ey ring13jgr Temperature (       tions, the plot type, settings in the configuration file (cfg-file), and comments.

Quantitative performance metrics for atmospheric ECVs
A starting point for the calculation of performance metrics is to assess the representation of simulated climatological mean states and the seasonal cycle for essential climate variables (ECVs, GCOS, 2010). This is supported by a large observational effort to deliver long-term, high-quality observations from different platforms and instruments (e.g. obs4MIPs and the ESA Climate Change Initiative (CCI, http://cci.esa.int/)) and ongoing efforts to improve global reanalysis products (e.g. ana4MIPs). Following  and similar to Fig. 9.7 of Flato et al. (2013), a namelist has been implemented in the ESMValTool that produces a "portrait diagram" by calculating the relative space-time root-mean square error (RMSE) from the climatological mean seasonal cycle of historical simulations for selected variables [namelist_perfmetrics_CMIP5.xml]. In Fig. 2 the relative space-time RMSE for the CMIP5 historical simulations  against a reference observation and, where available, an alternative observational data set, is shown. The overall mean bias can additionally be calculated and adding other statistical metrics is straightforward. Different normalizations (mean, median, centered median) can be chosen and the multi-model mean/median can also be added. In order to calculate the RMSE, the data are regridded to a common grid using a bilinear interpolation method. The user can select which grid to use as a target grid. The results shown in this section have been obtained after regridding the data to the grid of the reference data set. With this namelist it is also possible to perform more in-depth analyses of the ECVs, by calculating seasonal cycles, Taylor diagrams (Taylor, 2001), zonally averaged vertical profiles, and latitude-longitude maps. In the latter two cases, it is also possible to produce difference plots between a given model and a reference (usually the observational data set) or between two versions of the same model, and to apply a statistical test to highlight significant differences. As an example, Fig. 3 (left panel) shows the zonal profile of seasonal mean temperature differences between the MPI-ESM-LR model (Giorgetta et al., 2013) and ERA-Interim reanalysis (Dee et al., 2011), and Fig. 3 (right panel) a Taylor diagram for temperature at 850 hPa for CMIP5 models compared to ERA-Interim. A similar analysis can be performed with namelist_righi15gmd_ECVs.xml, which reproduces the ECV plots of Righi et al. (2015) for a set of EMAC simulations.
Tested variables in ESMValTool (v1.0) that are shown in Fig. 2 are selected levels of temperature (ta), eastward (ua) and northward wind (va), geopotential height (zg), and specific humidity (hus), as well as near-surface air temperature (tas), precipitation (pr), all-sky longwave (rlut) and shortwave (rsut) radiation, longwave (LW_CRE) and shortwave (SW_CRE) cloud radiative effects, and aerosol optical depth (AOD) at 550 nm (od550aer). The models are evaluated against a wide range of observations and reanalysis data: ERA-Interim and NCEP (Kistler et al., 2001) for temperature, winds, and geopotential height, AIRS (Aumann et al., 2003) for specific humidity, CERES-EBAF for radiation (Wielicki et al., 1996), the Global Precipitation Climatology Project (GPCP, Adler et al., 2003) for precipitation, the Moderate Resolution Imaging Spectrometer (MODIS, Shi et al., 2011), and the ESA CCI aerosol data (Kinne et al., 2015) for AOD. Additional observations or reanalyses can be provided by the user for these variables and easily added. The tool can also be applied to additional variables if the required observations are made available in an ESMValTool compatible format (see Sect. 2 and Supplement).

Multi-model mean bias for temperature and precipitation
Near-surface air temperature (tas) and precipitation (pr) are the two variables most commonly requested by users of ESM simulations. Often, diagnostics for tas and pr are shown for the multi-model mean of an ensemble. Both of these variables are the end result of numerous interacting processes in the models, making it challenging to understand and improve biases in these quantities. For example, near surface air temperature biases depend on the models' representation of radiation, convection, clouds, land characteristics, surface fluxes, as well as atmospheric circulation and turbulent transport (Flato et al., 2013), each with their own potential biases that may either augment or oppose one another. The namelist_flato13ipcc.xml reproduces a subset of the figures from the climate model evaluation chapter of IPCC AR5 (Chapter 9, Flato et al., 2013). This namelist will be further developed and a more complete version included in future releases. The diagnostic that calculates the multimodel mean bias compared to a reference data set is part of this namelist and reproduces Figs. 9.2 and 9.4 of Flato et al. (2013). Figure 4 shows the CMIP5 multi-model average as absolute values and as biases relative to ERA-Interim and the GPCP data for the annual mean surface air temperature and precipitation, respectively. Model output is regridded using bilinear interpolation to the reanalysis or observational grid by default, but alternative options that can be set in the cfg-file include regridding of the data to the lowest or highest resolution grid in the entire input data set. Such figures can also be produced for individual seasons as well as for a single model simulation or other 2-D variables if suitable observations are provided.

Monsoon
Monsoon systems represent the dominant seasonal climate variation in the tropics, with profound socio-economic impacts. Current ESMs still struggle to capture the major features of both the South Asian summer monsoon (SASM, Sect. "South Asian summer monsoon (SASM)") and the West African monsoon (WAM, Sect. "West African Monsoon Diagnostics"). Sperber et al. (2013) and Roehrig et al. (2013) provide comprehensive assessments of the ability of CMIP5 models to represent these two monsoon systems. By implementing diagnostics from these two studies into ES-MValTool (v1.0), we aim to facilitate continuous monitoring of progress in simulating the SASM and WAM systems in ESMs.

South Asian summer monsoon (SASM)
While individual models vary in their simulations of the SASM, there are known biases in ESMs that span a range of temporal and spatial scales. The namelists in the ESM-ValTool are targeted toward analysing these biases in a systematic way. Climatological mean biases include excess precipitation over the equatorial Indian Ocean, too little precipitation over the Indian subcontinent, and excess precipitation over orography such as the southern slopes of the Himalayas (Annamalai et al., 2007;Bollasina and Nigam, 2009;Sperber et al., 2013); see also Fig. 4. The monsoon onset is typically too late in the models, and the boreal summer intraseasonal oscillation (BSISO), which has a particularly large socio-economic impact in South Asia, is often weak or not present (Sabeerali et al., 2013). Monsoon lowpressure systems, which generate many of the most intense rain events during the monsoon (Krishnamurthy and Misra, 2011), are often too infrequent and weak (Stowasser et al., 2009). In coupled models, biases in SSTs, evaporation, precipitation, and air-sea coupling are common (Bollasina and Nigam, 2009) and have been shown to affect both presentday simulations and future projections (Levine et al., 2013). Interannual teleconnections with El Niño-Southern Oscillation (ENSO, Lin et al., 2008) and the Indian Ocean Dipole (Ashok et al., 2004;Cherchi and Navarra, 2013) are also not well captured (Turner et al., 2005).
Three SASM namelists for the basic climatology, seasonal cycle, intraseasonal and interannual variability, and key teleconnections have been implemented in the ESMVal-Tool focusing on SASM rainfall and horizontal winds in June-September (JJAS) [namelist_SAMonsoon.xml, namelist_SAMonsoon_AMIP.xml, namelist_SAMonsoon_daily.xml]. Rainfall and wind climatologies, including their pattern correlations and RMSE Zonally averaged temperature profile difference between MPI-ESM-LR and the ERA-Interim reanalysis data with masked non-significant values. MPI-ESM-LR has generally small biases in the troposphere (< 1-2 K), but a cold bias in the tropopause region that is particularly strong in the extratropical lower stratosphere. This is a systematic bias present in many of the CMIP3 and CCMVal models (IPCC, 2007;SPARC-CCMVal, 2010), related to an overestimation of the water vapour concentrations in that region. Right panel: Taylor diagram for temperature at 850 hPa from CMIP5 models compared with ERA-Interim (reference observation-based data set) and NCEP (alternate observation-based data set) showing a very high correlation of R > 0.98 with the reanalyses demonstrating very good performance in this quantity. Both figures produced with namelist_perfmetrics_CMIP5.xml.
against observations, are similar to the metrics proposed by the Climate Variability and Predictability (CLIVAR) Asian-Australian Monsoon Panel (AAMP) Diagnostics Task Team and used by Sperber et al. (2013). Diagnostics for determining global monsoon domains and intensity follow the definition of Wang et al. (2012) where the global precipitation intensity is calculated from the difference between the hemispheric summer (May-September in the Northern Hemisphere, November-March in the Southern Hemisphere) and winter (vice versa) mean values, and the global monsoon domain is defined by those areas where the precipitation intensity exceeds 2.0 mm day −1 and the summer precipitation is > 0.55 × the annual precipitation (Fig. 5). Seasonal cycle diagnostics include monthly rainfall over the Indian region (5-30 • N, 65-95 • E) and dynamical indices based on wind shear (Goswami et al., 1999;Wang and Fan, 1999;Webster and Yang, 1992). Figure 6 shows examples of the seasonal cycle of area-averaged Indian rainfall from selected CMIP5 models and their AMIP counterparts. The namelists include diagnostics to calculate maps of interannual standard deviation of JJAS rainfall and horizontal winds at 850 and 200 hPa, and maps of teleconnection diagnostics between Nino3.4 SSTs (defined by the region 190-240 • E, 5 • S to 5 • N) and JJAS precipitation across the monsoon region (30 • S to 30 • N, 40-300 • E) following Sperber et al. (2013). To generate difference maps, data are first regridded using an area-conservative binning and using the lowest-resolution grid as a target. For atmosphere-only models, we also evaluate their ability to represent year-to-year monsoon variability directly against time-equivalent observations to check whether models, given correct interannual SST forcing, can reproduce observed year-to-year variations and significant events occurring in particular years. This evaluation is done by plotting the time series across specified years of standardized anomalies (normalized by climatology) of JJAS-averaged dynamical indices and area-averaged JJAS precipitation over the Indian region (defined above) for both the models and observations. Namelists for intraseasonal variability include maps of standard deviation of 30-50-day filtered daily rainfall, with area-averaged values for key regions including the Bay of Bengal (10-20 • N, 80-100 • E) and the eastern equatorial Indian Ocean (10 • S-10 • N, 80-100 • E) given in the plot titles. To illustrate the northward and eastward propagation of the BSISO, Hovmöller lag-longitude and lag-latitude diagrams show either the latitude-averaged (10 • S-10 • N) and plotted for 60-160 • E, or longitude-averaged (80-100 • E) and plotted for 10 • S-30 • N, anomalies of 30-80-day filtered daily rainfall correlated against intraseasonal precipitation at the Indian Ocean reference point (75-100 • E, 10 • S-5 • N). These use a slightly modified (for season, region, and filtering band) version of the existing Figure 4. Annual-mean surface air temperature (upper row) and precipitation rate (mm day −1 , lower row) for the period 1980-2005. The left panels show the multi-model mean and the right panels the bias as the difference between the CMIP5 multi-model mean and the climatology from ERA-Interim (Dee et al., 2011) and the Global Precipitation Climatology Project (Adler et al., 2003) for surface air temperature and precipitation rate, respectively. The multi-model mean near-surface temperature agrees with ERA-Interim mostly within ±2 • C. Larger biases can be seen in regions with sharp gradients in temperature, for example in areas with high topography such as the Himalaya, the sea ice edge in the North Atlantic, and over the coastal upwelling regions in the subtropical oceans. Biases in the simulated multi-model mean precipitation include too low precipitation along the Equator in the western Pacific and too high precipitation amounts in the tropics south of the Equator. Similar to Figs. 9.2 and 9.4 of Flato et al. (2013) and produced with namelist_flato13ipcc.xml.
Madden-Julian Oscillation (MJO) NCL scripts, available at https://www.ncl.ucar.edu/Applications/mjoclivar.shtml, that are based on the recommendations from the US CLI-VAR MJO Working Group  and are similar to those shown in Lin et al. (2008) and used in Sect. "Madden-Julian Oscillation (MJO)" for the MJO.
Tested variables in ESMValTool (v1.0), some of which are illustrated in Figs. 5 and 6, include precipitation (pr), eastward (ua) and northward wind (va) at various levels, and skin temperature (ts). The primary reference data sets are ERA-Interim for horizontal winds, Tropical Rainfall Measuring Mission 3B43 version 7 (TRMM-3B43-v7; Huffman et al., 2007, for rainfall andHadISST, Rayner et al., 2003, for SST), although the models are evaluated against a wide range of other observational precipitation data sets (see Table 1) and an alternate reanalysis data set: the Modern-Era Retrospective Analysis for Research and Applications (MERRA; Rienecker et al., 2011).

West African monsoon diagnostics
West Africa and the Sahel are highly dependent on seasonal rainfall associated with the WAM. Rainfall in the re-gion exhibits strong inter-decadal variability (Nicholson et al., 2000), with major socio-economic impacts (Held et al., 2005). Projecting the future response of the WAM to increasing concentrations of greenhouse gases (GHG) is therefore of critical importance, as is the ability to make dependable forecasts of the WAM evolution on monthly to seasonal timescales. Current ESMs exhibit biases in their representation of both the mean state (Cook and Vizy, 2006;Roehrig et al., 2013) and temporal variability (Biasutti, 2013) of the WAM. Such biases can affect the skill of monthly to seasonal predictions of the WAM as well as long-term future projections. CMIP5 coupled models often exhibit warm SST biases in the equatorial Atlantic, which induce a southward shift of the WAM in summer (Richter et al., 2014). Because of the zonal symmetry, the 10 • W-10 • E meridional transect of any geophysical variable (see below) is particularly informative with respect to the main features of the WAM and their representation in climate models (Redelsperger et al., 2006). For instance, the JJAS-averaged Sahel rainfall has a large intermodel spread, with biases ranging from ±50 % of the observed value (Cook and Vizy, 2006;Roehrig et al., 2013). Differences in simulated surface air temperatures are large over the Sahel and Sahara, with deficiencies in the Saharan Figure 5. Monsoon precipitation intensity (upper panels) and monsoon precipitation domain (lower panels) for TRMM and an example of deviations from observations from three CMIP5 models (EC-Earth, HadGEM2-ES, and GFDL-ESM2M). The models have difficulties representing the eastward extent of the monsoon domain over the South China Sea and western Pacific, and several models (e.g. HadGEM2-ES) underestimate the latitudinal extent of most of the monsoon regions. The monsoon precipitation intensity tends to be underestimated in the South Asian, East Asian and Australian monsoon regions, while in the African and American monsoon regions the sign of the intensity bias varies between models. Similar to Fig. 9.32 of Flato et al. (2013) and produced with namelist_SAMonsoon.xml.
heat low inducing feedback errors on the WAM structure. Here, a correct simulation of the surface energy balance is critical, where biases related to the representation of clouds, aerosols, and surface albedo (Roehrig et al., 2013). The seasonal cycle also shows large inter-model spread, pointing to deficiencies in the representation of key processes important for the seasonal dynamics of the WAM. Daily precipitation is highly intermittent over the Sahel, mainly caused by a few intense mesoscale convective systems during the monsoon season (Mathon et al., 2002). Intense mesoscale convective systems over Africa as well as the diurnal cycle of the WAM are still a challenge for most climate models (Roehrig et al., 2013). Improving the quality of the WAM in climate models is therefore urgently needed.
To evaluate key aspects of the WAM, two namelists have been implemented in the ESM-ValTool (v1.0): namelist_WAMonsoon.xml and namelist_WAMonsoon_daily.xml. These include maps and meridional transects (averages over 10 • W to 10 • E) that provide a climatological picture of the summer (JJAS) WAM structure: (i) precipitation (pr) for the mean position of the WAM, (ii) near-surface air temperature (tas) for biases in the Atlantic cold tongue and the Saharan heat low, (iii) horizontal winds (ua, va) for the mean position and intensity of the monsoon flow at 925 hPa and of the mid-(700 hPa) and upper-level (200 hPa) jets. The surface and top of the atmosphere (TOA) radiation budgets provide a picture of the radiative fluxes associated with the WAM. Figure 7 shows the meridional transect of summer-averaged  models: 1980-2004; observations: 1998-2010). The grey area in each panel indicates the standard deviation from the model mean, to indicate the spread between models (observations/reanalyses are not included in this spread). These illustrate the range of rainfall simulated particularly in AMIP experiments where there is no feedback between precipitation and SST biases that might moderate the rainfall biases (Bollasina and Ming, 2013;Levine et al., 2013). Some of the CMIP5 coupled models (e.g. HadGEM2-ES, IPSL-CM5A-MR) show a delayed monsoon onset that is not apparent in their AMIP configurations. This is related to cold SST biases in the Arabian Sea which develop during boreal winter and spring (Levine et al., 2013). Produced with namelist_SAMonsoon.xml. precipitation over West Africa for a range of CMIP5 models as an example of this namelist. The diagnostic for the mean seasonal cycle of precipitation is also provided to evaluate the WAM onset and withdrawal. Finally, a set of diagnostics for the WAM intraseasonal variability evaluates the ability of models to capture variability of precipitation on timescales associated with African easterly waves (3-10 days), the MJO (25-90 days) and more broadly the WAM intraseasonal variability (1-90 days). The strong day-to-day intermittency of precipitation is also diagnosed using maps of 1-day autocorrelation of intraseasonal precipitation anomalies (Roehrig et al., 2013). To perform the autocorrelation analysis, data is first regridded to a common 1 • × 1 • map using a bilinear interpolation method, whereas for generating difference maps the same regridding method as for the SASM diagnostics is used (see Sect. "South Asian summer monsoon (SASM)"). Observations for evaluation are based on the following data sets: GPCP version 2.2 and Tropical Rainfall Measuring Mission 3B43 version 7 (TRMM-3B43-v7, Huffman et al., 2007) precipitation retrievals, Clouds and Earth's Radiant Energy Systems (CERES) Energy Balanced and Filled (EBAF) edition 2.6 radiation estimates (Loeb et al., 2009), NOAA daily TOA outgoing longwave radiation (Liebmann and Smith, 1996), and ERA-Interim reanalysis for the dynamics.

Natural modes of climate variability NCAR climate variability diagnostics package
Modes of natural climate variability from interannual to multi-decadal timescales are important as they have large impacts on the regional and even global climate with attendant socio-economic impacts. Characterization of internal (i.e. unforced) climate variability is also important for the detection and attribution of externally forced climate change signals (Deser et al., 2012. Internally generated modes of variability also complicate model evaluation and intercomparison. As these modes are spontaneously generated, they do not need to exhibit the same chronological sequence in models as in nature. However, their statistical properties (e.g. timescale, autocorrelation, spectral characteristics, and spatial patterns) are captured to varying degrees of skill among climate models. Despite their importance, systematic evaluation of these modes remains a daunting task given the wide time range to consider, the length of the data record needed to adequately characterize them, the importance of sub-surface oceanic processes, and uncertainties in the observational records (Deser et al., 2010).
In order to assess natural modes of climate variability in models, the NCAR Climate Variability Diagnostics Package (CVDP, Phillips et al., 2014) has been implemented into the ESMValTool. The CVDP has been developed as a standalone tool. To allow for easy updating of the CVDP once a new version is released, the structure of the CVDP is kept in its original form and a single namelist [namelist_CVDP.xml] has been written to enable the CVDP to be run directly within ESMValTool. The CVDP facilitates evaluation of the major modes of climate variability, including ENSO (Deser et al., 2010), PDO (Deser et al., 2010;Mantua et al., 1997), the Atlantic Multi-decadal Oscillation (AMO, Trenberth and Shea, 2006), the Atlantic Meridional Overturning Circulation (AMOC, Danabasoglu et al., 2012), and atmospheric teleconnection patterns such as the Northern and Southern Annular Modes (NAM, Hurrell and Deser, 2009;Thompson and Wallace, 2000, and SAM, Thompson and Wallace, 2000, respectively), North Atlantic Oscillation (NAO, Hurrell and Deser, 2009), and Pacific North and South American (PNA and PSA, respectively; Thompson and Wallace, illustrate the inter-model spread in the mean position and intensity of the WAM among the CMIP5 models. The spread is slightly reduced in AMIP simulations, as the warm SST bias in the equatorial Atlantic is removed. The WAM mean structure, however, is not captured by many models. Produced with namelist_WAMonsoon.xml.

2000) patterns.
For details on the actual calculation of these modes in CVDP we refer to the original CVDP package and explanations available at http://www2.cesm.ucar.edu/ working-groups/cvcwg/cvdp.
Depending on the climate mode analysed, the CVDP package uses the following variables: precipitation (pr), sea level pressure (psl), near-surface air temperature (tas), skin temperature (ts), snow depth (snd), and the basin-average ocean meridional overturning mass stream function (msftmyz). The models are evaluated against a wide range of observations and reanalysis data, for example NCEP for near-surface air temperature, HadISST for skin temperature, and the NOAA-CIRES Twentieth Century Reanalysis Project (Compo et al., 2011) for sea level pressure. Additional observations or reanalysis can be added by the user for these variables. The ESMValTool (v1.0) namelist runs on all CMIP5 models. As an example, Fig. 8 shows the representation of the PDO as simulated by 41 CMIP5 models and observations (HadISST) and Fig. 9 the mean AMOC from 13 CMIP5 models.

Madden-Julian Oscillation (MJO)
The MJO is the dominant mode of tropical intraseasonal variability (30-80 day) and has wide impacts on numerous regional climate and weather phenomena (Madden and Julian, 1971). Associated with enhanced convection in the tropics, the MJO exerts a significant influence on monsoon precipitation, e.g. on the South Asian Monsoon (Pai et al., 2011) and on the west African monsoon (Alaka and Maloney, 2012). The eastward propagation of the MJO into the West Pacific can trigger the onset of some El Niño events (Feng et al., 2015;Hoell et al., 2014). The MJO also influences tropical cyclogenesis in various ocean basins (Klotzbach, 2014). Increased vertical resolution in the atmosphere and better representation of stratospheric processes have led to an improvement in MJO fidelity in CMIP5 compared to CMIP3 (Lin et al., 2006). However, current generation models still struggle to adequately capture the eastward propagation of the MJO (Hung et al., 2013) and the variance intensity is typically too weak. Identifying and reducing such biases will be important for ESMs to accurately represent important climate phenomena, such as regional precipitation variability in the tropics arising through the differing impact of MJO phases on ENSO and ENSO forced regional climate anomalies (Hoell et al., 2014).
To assess the main MJO features in ESMs, a namelist with a number of diagnostics developed by the US CLIVAR MJO Working Group Waliser et al., 2009) has been implemented in the ESMValTool (v1.0) [namelist_mjo_mean_state.xml, namelist_mjo_daily.xml]. These diagnostics are calculated using precipitation (pr), outgoing longwave radiation (OLR) (rlut), and eastward (ua) and northward wind (va) at 850 hPa (u850) and 200 hPa (u200) against various observations and reanalysis data sets for boreal summer (May-October) and winter (November-April).
Observation and reanalysis data sets include GPCP-1DD for precipitation, ERA-Interim and NCEP-DOE reanalysis 2 for wind components (Kanamitsu et al., 2002) and NOAA polar-orbiting satellite data for OLR (Liebmann and Smith, 1996). The majority of the scripts are based on example scripts at http://ncl.ucar.edu/Applications/mjoclivar.shtml. Daily data is required for most of the scripts. The basic di- agnostics include mean seasonal state and 20-100-day bandpass filtered variance for precipitation and u850 in summer and winter. To better assess and understand model biases in the MJO, a number of more sophisticated diagnostics have also been implemented. These include; univariate empirical orthogonal function (EOF) analysis for 20-100 day bandpass filtered daily anomalies of precipitation, OLR, u850 and u200. To illustrate the northward and eastward propagation of the MJO, lag-longitude and lag-latitude diagrams show either the equatorial (latitude) averaged (10 • S-10 • N) or zonal (longitude) averaged (80-100 • E) intraseasonal precipitation anomalies and u850 anomalies correlated against intraseasonal precipitation at the Indian Ocean reference point (75-100 • E, 10 • S-5 • N). Similar figures can also be produced for other key variables and regions following the definitions of Waliser et al. (2009). To further explore the MJO Figure 9. Long-term annual mean Atlantic Meridional Overturning Streamfunction (AMOC; Sv) as simulated by 13 CMIP5 models (individual panels labelled by model name) for the historical period 1900-2005. AMOC annual averages are formed, weighted by the cosine of the latitude and by the depth of the vertical layer, and then the data is masked by setting all those areas to missing where the variance is less than 1 × 10 −6 . The figure shows that there is a wide spread among the CMIP5 models, with maximal AMOC strength ranging from ∼ 13 Sv (CanESM2) to over ∼ 28 Sv (NorESM1), while the models agree generally well on the position of maximal AMOC strength. intraseasonal variability, the wavenumber-frequency spectra for each season is calculated for individual variables. In addition, we also produce cross-spectral plots to quantify the coherence and phase relationships between precipitation and u850. Figure 10 shows examples of boreal summer (May-October) wavenumber-frequency spectra of 10 • S-10 • N averaged daily precipitation from GPCP-1DD, HadGEM2-ES, MPI-ESM-LR and EC-Earth. Finally, we also calculate the multivariate combined EOF (CEOF) modes using equatorial averaged (15 • S-15 • N) daily anomalies of u850, u200 and OLR. This analysis demonstrates the relationship between lower-and upper-tropospheric wind anomalies and convection. To further illustrate the spatial-temporal structure of the MJO, the first two leading CEOFs are used to derive a composite MJO life cycle which highlights intraseasonal variability and northward/eastward propagation of the MJO. The data used in these diagnostics are regridded to a common 0.5 • × 0.5 • grid using an area-conservative method.

Diurnal cycle
In addition to the previously discussed biases in precipitation, many ESMs that rely on parameterized convection exhibit biases related to the diurnal cycle and timing of precipitation. Over land, ESMs tend to simulate a diurnal cycle of continental convective precipitation in phase with insolation, while observed precipitation peaks in the early evening. This constitutes one of the endemic biases of ESMs, in which convective precipitation intensity is often related to atmospheric instability. This bias can have important implications for the simulated climate, as the timing of precipitation influences subsequent surface evaporation, and convective clouds affect radiation differently around noon or in late afternoon. The biases in the diurnal cycle are most pronounced over land areas and the diurnal cycles of convection and clouds during the day contribute to the continental warm bias (Cheruy et al., 2014). Similarly, biases in the diurnal cycle also exist over the ocean (Jiang et al., 2015). Another motivation for looking at the diurnal cycle in models is that its representation is more closely linked to the parameterizations of surface fluxes, boundary-layer, convection and cloud processes than any other diagnostics. The phase of precipitation and radiative fluxes during the day is the consequence of surface warming, boundary-layer turbulence mixing and cumulus clouds moistening, as well as of the triggering criteria used to activate deep convection, and the closure used to compute convective intensity. The evaluation of the diurnal cycle thus provides a direct insight into the representation of physical processes in a model. Recent efforts to improve the representation of the diurnal cycle of precipitation models include modifying the convective entrainment rate, revisiting the quasi-equilibrium hypothesis for shallow and deep convection, and adding a representation of key missing processes such as boundary-layer thermals or cold pools. We envisage that ESMValTool will help to quantify the impact of those improvements in the next generation of ESMs.
To help document progress made in the representation of the diurnal cycle of precipitation (pr) in models, a set of diagnostics has been implemented in the ESMValTool. After regridding all data on a common 2.5 • × 2.5 • grid using bilinear interpolation, the mean diurnal cycle computed every 3 h is approximated at each grid point by a sum of sine and cosine functions (first harmonic analysis) allowing one to derive global maps of the amplitude and phase of maximum rainfall over the day. The mean diurnal cycle of precipitation is also provided over specific regions in the tropics. Over land, we contrast semi-arid (Sahel) and humid (Amazonia) regions as well as West Africa and India. Over the ocean, we focus on the Gulf of Guinea, the Indian Ocean and the eastern and western equatorial Pacific. We use TRMM 3B42 V7 as a reference (http://mirador.gsfc. nasa.gov/collections/TRMM_3B42_daily__007.shtml). The ESMValTool also includes diagnostics for the evaluation of the diurnal cycle of radiative fluxes at the top of the atmosphere and at the surface, and their decomposition into LW and SW, total and clear sky components; however, not all are available for all models from the CMIP5 archive. As a reference, we use 3-hourly SYN1deg CERES products (Wielicki et al., 1996), derived from measurements at the top of the atmosphere and computed using a radiative transfer model at the surface (http://ceres.larc.nasa.gov/products. php?product=SYN1deg). These diagnostics provide a first insight into the representation of the diurnal cycle, but further analysis is required to understand the links between the model's parameterizations and the representation of the diurnal cycle, as well as the impact of errors in the diurnal cy-cle on other, slower timescale climate processes. Figure 11 shows the evaluation against TRMM observations of the mean diurnal cycle averaged over specific regions in the tropics for five summers (2004)(2005)(2006)(2007)(2008) simulated by four CMIP5 ESMs.

Clouds and radiation
Clouds are a key component of the climate system because of their large impact on the radiation budget as well as their crucial role in the hydrological cycle. The simulation of clouds in climate models has been challenging because of the many non-linear processes involved (Boucher et al., 2013). Simulations of long-term mean cloud properties from the CMIP3 and CMIP5 models show large biases compared to observations Klein et al., 2013;Lauer and Hamilton, 2013). Such biases have a range of implications as they affect application of these models to investigate chemistryclimate interactions and aerosol-cloud interactions, while also having an impact on the climate sensitivity of the model.
The namelist namelist_lauer13jclim.xml computes the climatology and interannual variability of climate relevant cloud variables such as cloud radiative forcing, liquid and ice water path, and cloud cover, and reproduces the evaluation results of Lauer and Hamilton (2013). The standard namelist includes a comparison of the geographical distribution of multi-year average cloud parameters from individual models and the multi-model mean with satellite observations. Taylor diagrams are generated that show the multi-year annual or seasonal average performance of individual models and the multi-model mean in reproducing satellite observations. The diagnostic routine also facilitates the assessment of the bias of the multi-model mean and zonal averages of individual models compared with satellite observations. Interannual variability is estimated as the relative temporal standard deviation from multi-year time series of data with the temporal standard deviations calculated from monthly anomalies after subtracting the climatological mean seasonal cycle. Data regridding is applied using a bilinear interpolation method and choosing the grid of the reference data set as a target. As an example, Fig. 12 shows the bias of the 20-year average  annual mean cloud radiative effects from CMIP5 models (multi-model mean) against the CERES EBAF satellite climatology (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) (Loeb et al., 2012(Loeb et al., , 2009), similar to Flato et al. (2013;their Fig. 9.5).
The cloud namelist focuses on precipitation (pr) and four cloud parameters that largely determine the impact of clouds on the radiation budget and thus climate in the model simulations: total cloud amount (clt), liquid water path (lwp), ice water path (iwp), and TOA cloud radiative effect (CRE) consisting of the longwave CRE and shortwave CRE that can also separately be evaluated with the performance metrics namelist (see Sect. 4.1.1). Precipitation is evaluated with Figure 11. Mean diurnal cycle of precipitation (mm h −1 ) averaged over five summers (2004)(2005)(2006)(2007)(2008) over specific regions in the tropics (Sahel, West Africa, Gulf of Guinea, India, Indian Ocean, Amazonia, eastern equatorial Pacific, and western equatorial Pacific) as observed by TRMM 3B42 V7 and as simulated by four CMIP5 models: CNRM-CM5, EC-Earth, HadGEM2-A, and IPSL-CM5A-LR. ESMs produce a too strong peak of rainfall around noon over land, while the observed precipitation maximum is weaker and delayed to 18:00. At the same time, most models underestimate nocturnal precipitation. Over the ocean, the diurnal cycle of precipitation is more flat, but the rainfall maximum usually occurs a few hours earlier than in observations during the night, and the amplitude of oceanic precipitation shows large variations among models. Produced with namelist_DiurnalCycle_box_pr.xml. Figure 12. Climatological (1985Climatological ( -2005 annual-mean cloud radiative effects from the CMIP5 models against CERES EBAF (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) in W m −2 . Top row shows the shortwave effect; middle row the longwave effect, and bottom row the net effect. Multi-model-mean biases against CERES EBAF 2.7 are shown on the left, whereas the right panels show zonal averages from CERES EBAF 2.7 (black), the individual CMIP5 models (thin grey lines), and the multi-model mean (red). The multi-model mean longwave CRE is overestimated in models, particularly in the Pacific and Atlantic south of the inter-tropical convergence zone (ITCZ) and in the South Pacific convergence zone (SPCZ). The longwave CRE is underestimated over Central and South America as well as parts of Central Africa and southern Asia. The most striking biases in the multi-model mean shortwave CRE are found in the stratocumulus regions off the west coasts of North and South America, southern Africa, and Australia. Despite biases in component cloud properties, simulated CRE is in quite good agreement with observations. Reproducing Fig. 9.5 of Flato et al. (2013) and produced with namelist_flato13ipcc.xml.
GPCP data, total cloud amount with MODIS, liquid water path with passive-microwave satellite observations from the University of Wisconsin (O'Dell et al., 2008), and the ice water path with MODIS Cloud Model Intercomparison Project (MODIS-CFMIP, Pincus et al., 2012;King et al., 2003) data.

Quantitative performance assessment of cloud regimes
The cloud-climate radiative feedback process remains one of the largest sources of uncertainty in determining the climate sensitivity of models (Boucher et al., 2013). Traditionally, clouds have been evaluated in terms of their impact on the mean top of atmosphere fluxes. However, it is possible to achieve good performance on these quantities through compensating errors; for example, boundary layer clouds may be too reflective but have insufficient horizontal coverage (Nam et al., 2012). Williams and Webb (2009) proposed a Cloud Regime Error Metric (CREM) which critically tests the ability of a model to simulate both the relative frequency of occurrence and the radiative properties correctly for a set of cloud regimes determined by the daily mean cloud top pressure, in-cloud albedo and fractional coverage at each grid box. Having previously identified the regimes by clustering joint cloud-top pressure-optical depth histograms from the International Satellite Cloud Climatology Project (ISCCP, Rossow and Schiffer, 1999) as per Williams and Webb (2009), each daily model grid box is assigned to the regime cluster centroid with the closest cloud top pressure, in-cloud albedo and fractional coverage as determined by the three-element Euclidean distance. The fraction of grid points assigned to each of the regimes and the mean radiative properties of those grid points are then compared to the observed values. This routine also uses a bilinear regridding method with a 2.5 • × 2.5 • target grid. This metric is now implemented in the ESMValTool (v1.0), with references in the code to tables in the Williams and Webb (2009) study defining the cluster centroids [namelist_williams09climdyn_CREM.xml]. Required are daily data from ISCCP mean cloud albedo (albisccp), IS-CCP mean cloud top pressure (pctisccp), ISCCP total cloud fraction (cltisccp), TOA outgoing short-and long-wave radiation (rsut, rlut), TOA outgoing shortwave and longwave (clear sky) radiation (rsutcs, rlutcs), surface snow area fraction (snc) or surface snow amount (snw), and sea ice area fraction (sic). The metric has been applied over the period January 1985 to December 1987 to those CMIP5 models with the required diagnostics (daily data) available for their AMIP simulation (see caption of Fig. 13). A perfect score with respect to ISCCP would be zero. Williams and Webb (2009) also compared data from the MODIS and the Earth Radiation Budget Experiment (ERBE, Barkstrom, 1984) to ISCCP in order to provide an estimate of observational uncertainty. This observational regime characteristic was found to be 0.96 as marked in Fig. 13 when calculated over the period March 1985 to February 1990. Hence a model with a score that is similar to this value can be considered to be within observational uncertainty, although it should be noted that this does not necessarily mean that the model lies within the observations for each regime. Error bars are not plotted since experience has shown that the metric has little sensitivity to interannual variability and models that are visibly different in Fig. 13 are likely to be significantly so. A minimum of 2 years, and ideally 5 years or more, of daily data are required for the scientific analysis.

Handling of ocean grids
Analysis of ocean model data from ESMs poses several unique challenges for analysis. First, in order to avoid numerical singularities in their calculations, ocean models often use irregular grids where the poles have been rotated or moved to be located over land areas. For example, the global configuration of the Nucleus for European Modelling of the Ocean (NEMO) framework uses a tripolar grid (Madec, 2008), with the three poles located over Siberia, Canada, and Antarctica. Second, transports of scalar quantities (e.g. overturning stream functions and heat transports) can only be calculated accurately on the original model grids as interpolation to other grids introduces errors. This means that e.g. for the calculation of water transport through a strait, both the hori- Figure 13. Cloud Regime Error Metric (CREM) from Williams and Webb (2009) applied to some CMIP5 AMIP simulations with the required data in the archive. The results show that MIROC5 is the best performing model on this metric, other models are slightly worse on this metric. The red dashed line shows the observational uncertainty estimated from applying this metric to independent data from MODIS. An advantage of the metric is that its components can be decomposed to investigate the reasons for poor performance. This requires extra print statements compared to the default code but might help to identify, for instance, cloud regimes that are too reflective or simulated too frequently at the expense of some of the other regimes. Produced with namelist_williams09climdyn_CREM.xml.
zontal and vertical extent of the grids on which the u and v currents are defined is required. Therefore, this type of diagnostic can only be used for models for which all native grid information is available. State variables like SSTs, sea ice, and salinity are regridded using grid information (i.e. coordinates, bounds, and cell areas) available in the ocean input files of the CMIP5 models. To create difference plots against observations or other models, all data are regridded to a common regular grid (e.g. 1 • × 1 • ) using the regridding functionality of the Earth System Modeling Framework (ESMF, https://www.ncl.ucar.edu/Applications/ESMF.shtml).

Southern Ocean diagnostics Southern Ocean mixed-layer dynamics and surface turbulent fluxes
Earth system models often show large biases in the Southern Ocean mixed layer. For example, Sterl et al. (2012) showed that in EC-Earth/NEMO the Southern Ocean is too warm and salinity too low, while the mixed layer is too shallow. These biases are not specific to EC-Earth, but are rather widespread. At the same time, values for Antarctic Circumpolar Current (ACC) transport vary between 90 and 264 Sv in CMIP5 models, with a mean of 155 ± 51 Sv. The differences are associated with differences in the ACC density structure. Figure 14. Annual-mean difference between EC-Earth/NEMO and ERA-Interim sea surface temperatures (a), the World Ocean Atlas sea surface salinity (b), and the Argo float observations for ocean mixed-layer thickness (c), showing that in the Southern Ocean SSTs in EC-Earth are too high, sea surface salinity too fresh, and the mixed layer too shallow. The other available diagnostics of the namelist_SouthernOcean.xml help in understanding these biases. Vertical sections of temperature (d) and salinity differences (e) reveal that the SST bias is mainly an austral summer problem, but also that vertical mixing is not able to penetrate a year-round existing warm layer below 80 m depth.
A namelist has been implemented in the ESMValTool to analyse these biases [namelist_SouthernOcean.xml]. With these diagnostics polar stereographic (difference) maps can be produced to compare monthly/annual mean model fields with corresponding ERA-Interim data. The patch recovery technique is applied to regrid data to a common 1 • × 1 • grid. There are also scripts to plot the differences in the area mean vertical profiles of ocean temperature and salinity between models and data from the World Ocean Atlas Locarnini et al., 2010). The ocean mixed-layer thickness from models can be compared with that obtained from the Argo floats (Dong et al., 2008). Finally, the ACC strength, as measured by water mass transport through the Drake Passage, is calculated using the same method as in the CDFTOOLS package (CDFTOOLS, http://servforge.legi.grenoble-inp.fr/projects/CDFTOOLS). This diagnostic can be used to calculate the transport through other sections as well, but is presently only available for NEMO/ORCA1 output, for which all grid information is available. The required variables for the comparison with ERA-Interim are sea surface temperature (tos), downward heat flux (hfds, calculated from ERA-Interim by summing the surface latent and sensible heat flux and the net shortwave and longwave fluxes (hfls + hfss + rsns + rlns)), water flux (wfpe, calculated by summing precipitation and evaporation (pr + evspsbl)) and the wind stress components (tauu and tauv). For the comparison with the World Ocean Atlas 2009 data (WOA09) sea surface salinity (sos), seawater salinity (so), and temperature (to) are required variables. For the comparison with the Argo floats the ocean mixed-layer thickness (mlotst) is required. Finally the two components of seawater velocity (uo and vo) are required for the volume transport calculation. Some example figures from this set of diagnostic scripts are shown for EC-Earth in Fig. 14.

Atmospheric processes forcing the Southern Ocean
One leading cause of SST biases in the Southern Ocean is systematic biases in surface radiation fluxes (Trenberth and Fasullo, 2010) coupled with systematic errors in macrophysical (e.g. cloud amount) and microphysical (e.g. frequency of mixed-phase clouds) cloud properties (Bodas-Salcedo et al., 2014).
A namelist has been implemented in the ESMValTool that compares model estimates of cloud, radiation, and surface turbulent flux variables over the Southern Ocean with suitable observations [namelist_SouthernHemisphere.xml]. Due to the lack of surface/in situ observations over the Southern Ocean, remotely sensed data can be subject to considerable uncertainty (Mace, 2010). While this uncertainty is not explicitly addressed in ESMValTool (v1.0), in future releases we will include a number of alternative satellitebased data sets for cloud variables (e.g. MISR, MODIS, IS-CCP) as well as new methods under development to derive surface turbulent flux estimates constrained by observed TOA radiation flux estimates and atmospheric energy divergence derived from reanalysis products (Trenberth and Fasullo, 2008). Inclusion of multiple satellite-based estimates will provide some estimate of observational uncertainty over the region. Variables analysed include (i) total cloud cover (clt), vertically integrated cloud liquid water and cloud ice water (clwvi, clivi), (ii) surface/(TOA) downward/outgoing total sky and clear sky shortwave and longwave radiation fluxes (rsds, rsdcs, rlds, rldscs/rsut, rsutcs, rlut, rlutcs), and (iii) surface turbulent latent and sensible heat fluxes (hfls, hfss). Observational constraints are derived from, respectively, cloud: CloudSat level 3 data (Stephens et al., 2002); radiation: CERES-EBAF level 3 Ed2 data; and surface turbulent fluxes: WHOI-OAflux (Yu et al., 2008).
The following diagnostics are calculated with accompanying plots: (i) seasonal mean absolute-value and difference maps for model data versus observations covering the Southern Ocean region (30-65 • S) for all variables. (ii) Mean seasonal cycles using zonal means averaged separately over three latitude bands: (i) 30-65 • S, the entire Southern Ocean, (ii) 30-45 • S, the sub-tropical Southern Ocean and (iii) 45-65 • S, the mid-latitude Southern Ocean. (iii) Annual means of each variable (models and observations) plotted as zonal means, over 30-65 • S. (iv) Scatterplots of seasonal mean downward (surface) and outgoing (TOA) longwave and shortwave radiation as a function of total cloud cover, cloud liquid water path or cloud ice water path, calculated for the three regions outlined above. The data are regridded using a cubic interpolation method with the observation grid as a target. Figure 15 provides an example diagnostic, with the top panel showing covariability of seasonal mean surface downward shortwave radiation as a function of total cloud cover. To construct the figure, grid point values of cloud cover, for each season covering 30 to 65 • S, are saved into bins of 5 % increasing cloud cover. For each grid point the corresponding seasonal mean radiation value is used to obtain a mean radiation flux for each cloud cover bin. The lower panel plots the fractional occurrence of seasonal mean cloud cover from CloudSat and model data for the same spatial and temporal averaging as used in the upper panel. Observations from CERES-EBAF radiation plotted against CloudSat cloud cover are compared to an example CMIP5 model. From the covariability plot we can diagnose whether models exhibit a similar dependency between incoming surface shortwave radiation and cloud cover as seen in observations. We can further assess whether there is a systematic bias in surface solar radiation and whether this bias occurs at specific values of cloud cover. Similar covariability plots are available for surface incoming longwave radiation and for TOA longwave and shortwave radiation, plotted, respectively, against cloud cover, cloud liquid water path, and cloud ice water path. Combining these diagnostics provides a comprehensive evaluation of simulated relationships between surface and TOA radiation fluxes and cloud variables.

Simulated tropical ocean climatology
An accurate representation of the tropical climate is fundamental for ESMs. The majority of solar energy received by the Earth is in the tropics and the potential for thermal emission of absorbed energy back into space is also largest in the tropics due to the high column concentrations of water vapour at low latitudes (Pierrehumbert, 1995;Stephens and Greenwald, 1991). Coupled interactions between equatorial SSTs, surface wind stress, precipitation and upper-ocean mixing are central to many tropical biases in ESMs. This is the case both with respect to the mean state and for key modes of variability, influenced by, or interacting with, the mean state (e.g. ENSO, Choi et al., 2011). Such biases are often reflected in a "double ITCZ" seen in the majority of CMIP3 and CMIP5 CCMs (Li and Xie, 2014;Oueslati and Bellon, 2015). The double ITCZ bias, present in many ESMs, occurs when models fail to simulate a single, year-round, ITCZ rainfall maximum north of the Equator. Instead, an unrealistic secondary maximum in models south of the Equator is present for part or all of the year. Such biases are particularly prevalent in the tropical Pacific, but can also occur in the Atlantic (Oueslati and Bellon, 2015). This double ITCZ is often accompanied by an overextension of the eastern Pacific equatorial cold tongue into the central Pacific, collocated with a positive bias in easterly near-surface wind speeds and a shallow bias in ocean mixed-layer depth (Lin, 2007). Such biases can directly impact the ability of an ESM to accurately represent ENSO variability Guilyardi, 2006) and its potential sensitivity to climate change (Chen et al., Figure 15. Upper panel: covariability between incoming surface shortwave radiation (rsds) and total cloud cover (clt). Lower panel: fraction occurrence histograms of binned cloud cover: observations are CERES-EBAF (radiation) and CloudSat (cloud cover). The CanESM2 model from the CMIP5 archive is shown as an example for comparison to observations (the namelist runs on all CMIP5 models). CanESM2 generally reproduces the observed slope of rsds as a function of clt, although there is a systematic positive bias in the amount of shortwave radiation reaching the surface for most cloud cover values. A positive bias is also seen in the CanESM2 histogram of cloud occurrence, with a strong peak in seasonal cloud fraction of 90 % in most seasons. Produced with namelist_SouthernHemisphere.xml. 2015), with negative consequences for a range of simulated features, such as regional tropical temperature and precipitation variability, monsoon dynamics, and ocean and terrestrial carbon uptake (Iguchi, 2011;Jones et al., 2001).
To assess such tropical biases with the ESMVal-Tool, we have implemented a namelist with diagnostics motivated by the work of Li and Xie (2014): namelist_TropicalVariability.xml. In particular, we reproduce their Fig. 5 for models and observations/reanalyses, calculating the equatorial mean (5 • N-5 • S), longitudinal sections of annual mean precipitation (pr), skin temperature (ts), horizontal winds (ua and va), and 925 hPa divergence (derived from the sum of the partial derivatives of the wind components extracted at the 925 hPa pressure level (that is, du/dx + dv/dy). Latitude cross sections of the model variables are plotted for the equatorial Pacific, Indian and Atlantic oceans with observational constraints provided by the TRMM-3B43-v7 for precipitation, the HadISST for SSTs, and ERA-Interim reanalysis for temperature and winds. Latitudinal sections of absolute and normalized annual mean SST and precipitation are also calculated, spatially averaged for the three ocean basins. Normalization follows the procedure outlined in Fig. 1 of Li and Xie (2014) whereby values at each latitude are normalized by the tropical mean (20 • N-20 • S) value of the corresponding parameter (e.g. annual mean precipitation at a given location is divided by the 20 • N-20 • S annual mean value). Finally, to assess how models capture observed relationships between SST and precipitation, we calculate the covariability of precipitation against SST for specific regions of the tropical Pacific. This analysis includes calculation of the mean square error (MSE) between model SST/precipitation and observational equivalents. A similar regridding procedure as for the Southern Hemisphere diagnostics is applied here, based on a cubic interpolation method and using the observations as a target grid. The namelist as included in the ESMValTool (v1.0) runs on all CMIP5 models. Figure

Sea ice
Sea ice is a key component of the climate system through its effects on radiation and seawater density. A reduction in sea ice area results in increased absorption of shortwave radiation, which warms the sea ice region and contributes to further sea ice loss. This process is often referred to as the sea ice albedo climate feedback which is part of the Arctic amplification phenomena. CMIP5 models tend to underestimate the decline in summer Arctic sea ice extent observed by satellites during the last decades (Stroeve et al., 2012) which may be related to models' underestimation of the sea ice albedo feedback process (Boé et al., 2009). Conversely in the Antarctic, observations show a small increase in March sea ice extent, while the CMIP5 models simulate a small decrease (Flato et al., 2013;Stroeve et al., 2012). It is therefore important that model sea ice processes are evaluated and improvements regularly assessed. Caveats have been noted with respect to the limitations of using only sea ice extent as a met- ric of model performance  as the sea ice concentration, volume, and drift, sea ice thickness and surface albedo, as well as sea ice processes such as melt pond formation or the summer sea ice melt are all important sea ice related quantities. In addition, the atmospheric forcings (e.g. wind, clouds, and snow) and ocean forcings (e.g. salinity and ocean transport) impact on the sea ice state and evolution.
In ESMValTool (v1.0) the sea ice namelist includes diagnostics that cover sea ice extent and concentration [namelist_SeaIce.xml], but work is underway to include other variables and processes in future releases. An example diagnostic produced by the sea ice namelist is given in Fig. 17, which shows the time series of September Arctic sea ice extent from the CMIP5 historical simulations compared to observations from the National Snow and Ice Data Center (NSIDC) produced by combining concentration estimates created with the NASA Team algorithm and the Bootstrap algorithm Peng et al., 2013) and SSTs from the HadISST data set, similar to Fig. 9.24 of Flato et al. (2013). Sea ice extent is calculated as the total area (km 2 ) of grid cells over the Arctic or Antarctic with sea ice concentrations (sic) of at least 15 %. The sea ice namelist can also calculate the seasonal cycle of sea ice extent and polar stereographic contour and polar contour difference plots of Arctic and Antarctic sea ice concentrations. For the latter diagnostic, data are regridded to a common 1 • × 1 • grid using the patch recovery technique.

Continental dry bias
The representation of land surface processes and fluxes in climate models critically affects the simulation of near-surface climate over land. In particular, energy partitioning at the surface strongly influences surface temperature, and it has been suggested that temperature biases in ESMs can be in part related to biases in evapotranspiration. The most notable feature in the majority of CMIP3 and CMIP5 models is a tendency to overestimate evapotranspiration globally (Mueller and Seneviratne, 2014). A diagnostic to analyse the representation of evapotranspiration in ESMs has been included in the ESMValTool [namelist_Evapotranspiration.xml]. For comparison with the LandFlux-EVAL product , the modelled surface latent heat flux (hfls) is converted to evapotranspiration units using the latent heat of vaporization. The diagnostic then produces lat-lon maps of absolute evapotranspiration as well as bias maps (model minus reference product, after regridding data to the coarsest grid using areaconservative interpolation). In Fig. 18, the global pattern of monthly mean evapotranspiration is evaluated against the LandFlux-EVAL product. The evapotranspiration diagnostic is complemented by the Standardized Precipitation Index (SPI) diagnostic [namelist_SPI.xml], which gives a measure of drought intensity from an atmospheric perspective and can help relating biases in evapotranspiration to atmospheric causes such as the accumulated precipitation amounts. For each month, precipitation (pr) is summed over the preceding months (options for 3, 6 or 12-monthly SPI). Then a twoparameter distribution of cumulative probability is fitted to the strictly positive month sums, such that the probability of a non-zero precipitation sum being below a certain value x corresponds to (x). The shape and scale parameters of the gamma distribution are estimated with a maximum likelihood approach. Accounting for periods of no precipitation, occurring at a frequency q, the total cumulative probability distribution of a precipitation sum below x, H (x), becomes H (x) = q + (1 − q) · (x). In the last step, a precipitation sum x is assigned to its corresponding SPI value by computing the quantile q N (0, 1) of the standard normal distribution at probability H (x). The SPI of a precipitation sum x, thus, corresponds to the quantile of the standard normal distribution which is assigned by preserving the probability of the original precipitation sum, H (x). Mean and annual cycle are not meaningful since the SPI accounts for seasonality and transforms the data to a zero average in each month. Therefore the diagnostic focuses on lat-lon maps of annual or seasonal trends in SPI (unitless) when comparing models with observations.

Runoff
Evaluation of precipitation is a challenge due to potentially large errors and uncertainty in observed precipitation data (Biemans et al., 2009;Legates and Willmott, 1990). An alternative or additional option to the direct evaluation of precipitation over land (such as e.g. included in the global precipitation evaluation in Sect. 4.1.2) is the evaluation of river runoff that can in principle be measured with comparatively small errors for most rivers. Routine measurements are performed for many large rivers, generating a large global database (e.g. available at the Global Runoff Data Centre (GRDC, Dümenil Gates et al., 2000)). The length of available time series, however, varies between the rivers, with large data gaps especially in recent years for many rivers. The evaluation of runoff against river gauge data can provide a useful independent measure of the simulated hydrological cycle. If both river flow and precipitation are given with reasonable accuracy, it will also provide an observational constraint on model surface evaporation, provided that the considered averaging time periods are long enough so that changes in surface water storages are negligible (Hagemann et al., 2013), e.g. by considering climatological means of 20 years or more. For present climate conditions ESMs often exhibit a dry and warm near-surface bias during summer over mid-latitude continents (Hagemann et al., 2004). Continental dry biases in precipitation exist in the majority of CMIP5 models over South America, the Mid-West of the US, the Mediterranean region, central and eastern Europe, and western and South Asia (Fig. 4 of this paper and Fig. 9.4 of Flato et al., 2013). These precipitation biases often transfer into dry biases in runoff, but sometimes dry biases in runoff can be caused by a too large evapotranspiration (Hagemann et al., 2013). In order to relate biases in runoff to biases in precipitation and Figure 18. Bias in evapotranspiration (mm day −1 ) for July in a subset of CMIP5 models in reference to the LandFlux-EVAL evapotranspiration product. The global mean bias is also indicated for each model as well as the RMSE. The comparison reveals the existence of biases in July evapotranspiration for a subset of CMIP5 models. All models overestimate evapotranspiration in summer, especially in Europe, Africa, China, Australia, Western North America, and parts of Amazonia. Biases of the opposite sign (underestimation in evapotranspiration) can be seen in some other regions of the world, notably over parts of the tropics. For most regions, there is a clear correlation between biases in evapotranspiration and precipitation (see precipitation bias in Fig. 4). Produced with namelist_Evapotranspiration.xml. evapotranspiration, the catchment oriented evaluation in this section considers biases in all three variables. This means that the respective variables are considered to be spatially averaged over the drainage basins of large rivers.
Beside bias maps, a set of diagnostics to produce basin-scale comparisons of runoff (mrro), evapotranspiration (evspsbl) and precipitation (pr) have also been implemented in ESMValTool [namelist_runoff_et.xml]. This namelist calculates biases in climatological annual means of the three variables for 12 large-scale catchments areas on different continents and for different climates. For total runoff, catchment averaged model values are compared to climatologi-cal long-term averages of GRDC observations. Due to the incompleteness of these station data, a year-to-year correspondence of data cannot be achieved so only climatological data are considered, as in Hagemann et al. (2013). Simulated precipitation is compared to catchment-averaged WATCH forcing data based on ERA-Interim (WFDEI) data (Weedon et al., 2014) for the period 1979-2010. Here, the GPCCcorrected WFDEI precipitation data are taken. Note that these were recently being extended until 2013. Evapotranspiration observations are estimated using the difference of the catchment-averaged WFDEI precipitation minus the climatological GRDC river runoff. As an example, Fig. 19 shows Figure 19. Biases in runoff coefficient (runoff/precipitation) and precipitation for major catchments of the globe. The MPI-ESM-LR historical simulation is used as an example. Even though positive and negative precipitation biases exist for MPI-ESM-LR in the various catchment areas, the bias in the runoff coefficient is usually negative. This implies that the fraction of evapotranspiration generally tends to be overestimated by the model independently of whether precipitation has a positive or negative bias. Produced with namelist_runoff_et.xml. biases in runoff coefficient (runoff/precipitation) against the relative precipitation bias for the historical simulation of one of the CMIP5 models (MPI-ESM-LR).

Terrestrial biogeochemistry
A realistic representation of the global carbon cycle is a fundamental requirement for ESMs. In the past, climate models were directly forced by atmospheric CO 2 concentrations, but since CMIP5, ESMs are routinely forced by anthropogenic CO 2 emissions, the atmospheric concentration being inferred from the difference between these emissions and the ESM simulated land and ocean carbon sinks. These sinks are affected by atmospheric CO 2 and climate change, inducing feedbacks between the climate system and the carbon cycle (Arora et al., 2013;Friedlingstein et al., 2006). Quantification of these feedbacks is critical to estimate the future of these carbon sinks and hence atmospheric CO 2 and climate change .
The diagnostics implemented in the ESMValTool to evaluate simulated terrestrial biogeochemistry are based on the study of Anav et al. (2013) and span several timescales: climatological means, and intra-annual (seasonal cycle), interannual, and long-term trends [namelist_anav13jclim.xml]. Further extending these routines, the diagnostics presented in Sect. 4.1.1 are also applied here to calculate quantitative per-formance metrics. These metrics assess how both the land and ocean biogeochemical components of ESMs reproduce different aspects of the land and ocean carbon cycle, with an emphasis on variables controlling the exchange of carbon between the atmosphere and these two reservoirs. The analysis indicates some level of compensating errors within the models. Selecting, within the namelist, several specific diagnostics to be applied to more key variables controlling the land or ocean carbon cycle, can help to reduce the risk of missing such compensating errors. Figure 20 shows a portrait diagram similar to Fig. 3 of Anav et al. (2013), but for seasonal carbon cycle metrics against suitable reference data sets (see below).
For land, diagnostics of the land carbon sink net biosphere productivity (nbp) are essential. Although direct observations are not available, nbp can be estimated from atmospheric CO 2 inversions (JMA and TRANSCOM) and on the global scale combined with observation-based estimates of the oceanic carbon sink (fgco2 from GCP, Le Quéré et al., 2015). In addition to net carbon fluxes, diagnostics for gross primary productivity of land (gpp), leaf area index (lai), vegetation (cVeg), and soil carbon pools (cSoil) are also implemented in the ESMValTool to assess possible error compensation in ESMs. Observation-based gpp estimates are derived from Model Tree Ensemble (MTE) upscaling data (Jung et al., 2009) from the network of eddy-covariance flux towers (FLUXNET, Beer et al., 2010). The leaf area index data set used for evaluation (LAI3g) is derived from the Global Inventory Modeling and Mapping Studies group (GIMMS) AVHRR normalized difference vegetation index (NDVI-017b) data . Finally, cSoil and cVeg are assessed as mean annual values over different large subdomains using the Harmonised World soil Database (HWSD, Fischer et al., 2008) and the Olson-based vegetation carbon data set (Gibbs, 2006;Olson et al., 1985).

Marine biogeochemistry
Marine biogeochemistry models form a core component of ESMs and require evaluation for multiple passive tracers. The increasing availability of quality-controlled global biogeochemical data sets for the historical period (e.g. Surface Ocean CO 2 Atlas Version 2 (SOCAT v2, Bakker et al., 2014)) provides further opportunity to evaluate model performance on multi-decadal timescales. Recent analyses of CMIP5 ESMs indicate that persistent biases exist in simulated biogeochemical variables, for instance as identified in ocean oxygen (Andrews et al., 2013) and carbon cycle  fields derived from CMIP5 historical experiments. Some systematic biases in biogeochemical tracers can be attributed to physical deficiencies within ocean models (see Sect. 4.2), motivating further understanding of coupled physical-biogeochemical processes in the current generation of ESMs. For example, erroneous over oxygenation of subsurface waters within the MPI-ESM-LR CMIP5 model has Figure 20. Relative space-time RMSE calculated from the 1986-2005 climatological seasonal cycle of the CMIP5 historical simulations over different sub-domains for net biosphere productivity (nbp), leaf area index (lai), gross primary productivity (gpp), precipitation (pr) and near-surface air temperature (tas). The RMSE has been normalized with the maximum RMSE in order to have a skill score ranging between 0 and 1. A score of 0 indicates poor performance of models reproducing the phase and amplitude of the reference mean annual cycle, while a perfect score is equal to 1. The comparison suggests that there is no clearly superior model for all variables. All models have significant problems in representing some key biogeochemical variables such as nbp and lai, with the largest errors in the tropics mainly because of a too weak seasonality. Similar to Fig. 18 of Anav et al. (2013) and produced with namelist_anav13jclim.xml. been attributed to excess ventilation and vertical mixing in mid-to high-latitude regions .
A namelist is provided that includes diagnostics to support the evaluation of ocean biogeochemical cycles at global scales, as simulated by both ocean-only and coupled climatecarbon cycle ESMs [namelist_GlobalOcean.xml]. Supported input variables include surface partial pressure of CO 2 (spco2), surface chlorophyll concentration (chl), surface total alkalinity (talk), and dissolved oxygen concentration (o2). These variables provide an integrated view of model skill with regard to reproducing bulk marine ecosystem and carbon cycle properties. Observation-based reference data sets include SOCAT v2 and ETH-SOM-FFN (Landschützer et al., 2014a, b) for surface pCO 2 , Sea-viewing Wide Field-ofview Sensor (SeaWiFS) satellite data for surface chlorophyll (McClain et al., 1998), climatological data for total alkalinity , and World Ocean Atlas 2005 climatological data (WOA05) with in situ corrections following Bianchi et al. (2012) for dissolved oxygen. Diagnostics calculate contour plots for climatological distributions, interannual or interseasonal (e.g. JJAS) variability, together with the difference between each model and a chosen ref-erence data set. Such differences are calculated after regridding the data to the coarsest grid using an area-conservative interpolation. Monthly, seasonal, or annual frequency timeseries plots can also be produced either globally averaged or for a selected latitude-longitude range. Optional extensions include the ability to mask model data with the same coverage as observations, calculate anomaly fields, and to overlay trend lines, and running or multi-model means. Preprocessing routines are also included to accommodate native curvilinear grids, common in ocean model discretization (see Sect. 4.2.1), along with providing the ability to extract depth levels from 3-D input fields. An example plot is presented in Fig. 22, showing interannual variability in surface ocean pCO 2 as simulated by a subset of CMIP5 ESMs (BNU-ESM, HadGEM2-ES, GFDL-ESM2M), expressed as the standard deviation of de-trended annual averages for the period 1992-2005. As an observation-based reference pCO 2 field, ETH-SOM- FFN (1998FFN ( -2011 is used, which extrapolates SOCAT v2 data  using a twostep neural network method. As described in Landschützer et al. (2014a), ETH-SOM-FFN partitions monthly SOCAT v2 pCO 2 observations into discrete biogeochemical provinces there is no clear bias common in most ESMs, except in the tropics where models simulate a lower CO 2 source than that estimated by the inversion. Reproducing Fig. 6 of Anav et al. (2013) and produced with namelist_anav13jclim.xml. by establishing common relationships between independent input parameters using a self-organizing map (SOM). Nonlinear input-target relationships, as derived for each biogeochemical province using a feed-forward network (FFN) method, are then used to extrapolate observed pCO 2 .
A diagnostic for oceanic net primary production (npp) is also implemented in the ESMValTool for the climatological annual mean and seasonal cycle, as well as for interannual variability over the 1986-2005 period [namelist_anav13jclim.xml]. Observations are derived from the SeaWiFS satellite chlorophyll data, using the Vertically Generalized Production Model (VGPM, Behrenfeld and Falkowski, 1997).

Tropospheric aerosols
Tropospheric aerosols play a key role in the Earth system and have a strong influence on climate and air pollution. The global aerosol distribution is characterized by a large spatial and temporal variability which makes its representation in ESMs particularly challenging (Ghan and Schwartz, 2007). In addition, aerosol interactions with radiation (direct aerosol effect, Schulz et al., 2006) and with clouds (indirect aerosol effects, Lohmann and Feichter, 2005) need to be accounted for. Model-based estimates of anthropogenic aerosol effects are still affected by large uncertainties, mostly due to an incorrect representation of aerosol processes . Myhre et al. (2013) report a substantial spread in simulated aerosol direct effects among 16 global aerosol models and attribute it to diversities in aerosol burden, aerosol optical properties and aerosol optical depth (AOD). Diversities in black carbon (BC) burden up to a factor of three, related to model disagreements in simulating deposition processes were also found by Lee et al. (2013). Model meteorology can be a source of diversity since it impacts on atmospheric transport and aerosol lifetime. This in turn relates to the simulated essential climate variables such as winds, humidity and precipitation (see Sect. 4.1). Large biases also exist in simulated aerosol indirect effects (IPCC, 2013) and are often a result of systematic errors in both model aerosol and cloud fields (see Sect. 4.1.6).
To assess current biases in global aerosol models, the aerosol namelist of the ESMValTool comprises several diagnostics to compare simulated aerosol concentrations and optical depth at the surface against station data, motivated by the work of Pringle et al. (2010), Pozzer et al. (2012), and Righi et al. (2013) [namelist_aerosol_CMIP5.xml]. Diagnostics include time series of monthly or yearly mean aerosol concentrations, scatterplots with the relevant statistical indi- Figure 22. Interannual variability in de-trended annual mean surface pCO 2 (Pa) for the period 1998-2011 from an observation-based reference product (ETH-SOM-FFN; upper left) and three CMIP5 models (1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005). The spatial structure of interannual variability differs between individual CMIP5 ESMs; however, both BNU-ESM and GFDL-ESM2M are able to reproduce pronounced variability in surface ocean pCO 2 within the equatorial Pacific, primarily associated with ENSO variability . Produced with namelist_GlobalOcean.xml. cators, and contour maps directly comparing model results against observations. The comparison is performed considering collocated model and observations in space and time.
In the current version of ESMValTool, these diagnostics are supplied with observational data from a wide range of station networks, including Interagency Monitoring of Protected Visual Environments (IMPROVE) and CASTNET (North America), the European Monitoring and Evaluation Programme (EMEP, Europe), and the recently established Asian network (EANET). The AERONET data are also available for evaluating aerosol optical depth in continental regions and in a few remote marine locations. For evaluating aerosol optical depth, we also use satellite data, the primary advantage of which is almost-global coverage, particularly over the oceans. Satellite data are however affected by uncertainties related to the algorithm used to process radiances into relevant geophysical state variables. The tool currently implements data from the Multi-angle Imaging SpectroRadiometer (MISR, Stevens and Schwartz, 2012), MODIS, and the ESACCI-AEROSOL product (Kinne et al., 2015), which is a combination of ERS2-ATSR2 and ENVISAT-AATSR data. To calculate model biases against satellite data, regridding is performed using a bilinear interpolation to the coarsest grid. Aerosol optical depth time series over the ocean for the period 1850-2010 are shown in Fig. 23 for the CMIP5 models in comparison to MODIS and ESACCI-AEROSOL. Finally, more specific aerosol diagnostics have been implemented to compare aerosol vertical profiles of mass and number con-centrations and aerosol size distributions, based on the evaluation work by Lauer et al. (2005) and Aquila et al. (2011). These diagnostics, however, use model quantities that were not part of the CMIP5 data request and therefore will not be discussed here.

Tropospheric trace gas chemistry and stratospheric ozone
In the past, climate models were forced with prescribed tropospheric and stratospheric ozone concentration, but since CMIP5 some ESMs have included interactive chemistry and are capable of representing prognostic ozone Flato et al., 2013). This allows models to simulate important chemistry-climate interactions and feedback processes. Examples include the increase in oxidation rates in a warmer climate which leads to decreases in methane and its lifetime  or the increase in tropical upwelling (associated with the Brewer-Dobson circulation) in a warmer climate and corresponding reductions in tropical lower stratospheric ozone as a result of faster transport and less time for ozone production (Butchart et al., 2010;Eyring et al., 2010). It is thus becoming important to evaluate the simulated atmospheric composition in ESMs. A common high bias in the Northern Hemisphere and a low bias in the Southern Hemisphere have been identified in tropospheric column ozone simulated by chemistry-climate models participating in the Atmospheric Chemistry Climate Model Intercomparison Project (ACCMIP), which could partly be re-  5 (2006-2010) simulations, compared with MODIS and ESACCI-AEROSOL satellite data. All models simulate a positive trend in AOD starting around 1950. Some models also show distinct AOD peaks in response to major volcanic eruptions, e.g. El Chichon (1982) and Pinatubo (1991). The models simulate quite a wide range of AODs, between 0.05 and 0.20 in 2010, which largely deviates from the observed values from MODIS and ESACCI-AEROSOL. A significant difference, however, exists also between the two satellite data sets (about 0.05), indicating an observational uncertainty. Similar to Fig. 9.29 of Flato et al. (2013) and produced with namelist_aerosol_CMIP5.xml.
lated to deficiencies in the ozone precursor emissions . Analysis of CMIP5 models with respect to trends in total column ozone show that the multi-model mean of the models with interactive chemistry is in good agreement with observations, but that significant deviations exist for individual models Flato et al., 2013). Large variations in stratospheric ozone in models with interactive chemistry drive large variations in lower stratospheric temperature trends. The results show that both ozone recovery and the rate of GHG increase determine future Southern Hemisphere summer-time circulation changes and are important to consider in ESMs . The namelists implemented in the ESMValTool to evaluate atmospheric chemistry can reproduce the analysis of tropospheric ozone and precursors of Righi et al. (2015) [namelist_righi15gmd_tropo3.xml, namelist_righi15gmd_Emmons.xml] and the study by Eyring et al. (2013) [namelist_eyring13jgr.xml]. The calculation of the RMSE, mean bias, and Taylor diagrams (see Sect. 4.1.1) has been extended to tropospheric column ozone (derived from tro3 fields), ozone profiles (tro3) at selected levels, and surface carbon monoxide (vmrco) (see Righi et al., 2015, for details). This enables a consistent calculation of relative performance for the climate parameters and ozone, which is particularly relevant given that biases in climate can impact on biases in chemistry and vice versa. In addition, diagnostics that evaluate tropospheric ozone and its precursors (nitrogen oxides (vmrnox), ethylene (vmrc2h4), ethane (vmrc2h6), propene (vmrc3h6), propane (vmrc3h8) and acetone (vmrch3coch3)) are compared to the observational data of Emmons et al. (2000). A diagnostic to compare tropospheric column ozone from the CMIP5 historical simulations to Aura MLS/OMI observations (Ziemke et al., 2011) is also included and shown as an example in Fig. 24. This diagnostic also remaps the data to the coarsest grid using local area averaging in order to calculate differences. For the stratosphere, total column ozone (toz) diagnostics are implemented. As an example, Fig. 25 shows the CMIP5 total column ozone time series compared to the NIWA combined total column ozone database (Bodeker et al., 2005).

Linking model performance to projections
The relatively new research field of emergent constraints aims to link model performance evaluation with future projection feedbacks. An emergent constraint refers to the use of observations to constrain a simulated future Earth system feedback. It is referred to as emergent because a relationship between a simulated future projection feedback and an observable element of climate variability emerges from an ensemble of ESM projections, potentially providing a constraint on the future feedback. Emergent constraints can help focus model development and evaluation onto processes underpinning uncertainty in the magnitude and spread of future Earth system change. Systematic model biases in certain forced modes, such as the seasonal cycle of snow cover or interannual variability of tropical land CO 2 uptake appear to project in an understandable way onto the spread of future climate change feedbacks resulting from these phenomena Hall and Qu, 2006;Wenzel et al., 2014).
To reproduce the analysis of Wenzel et al. (2014) that provides an emergent constraint on future tropical land carbon uptake, a namelist is included in ESMValTool (v1.0) to perform an emergent constraint analysis of the carbon cycle-climate feedback parameter (γ LT ) Friedlingstein et al., 2006) [namelist_wenzel14jgr.xml]. This namelist only considers the CMIP5 ESMs that have provided the necessary output for the analysis. This criterion precludes most CMIP5 models and only seven ESMs are therefore considered here. The namelist includes diagnostics which analyse the short-term sensitivity of atmospheric CO 2 to temperature variability on interannual timescales (γ IAV ) for models and observations, as well as diagnostics for γ LT from the models. The observed sensitivity γ IAV is calculated by summing land (nbp) and ocean (fgco2) carbon fluxes which are correlated with tropical near-surface air temperature (tas). Results from historical model simulations are compared to observational-based estimates of carbon fluxes from the Global Carbon Project (GCP, Le Quéré et al., 2015) and reanalysis temperature data from the NOAA National Climate Data Center (NCDC, Smith et al., 2008). For diagnosing γ LT from the models, nbp from idealized fully coupled and biochemically coupled simulations are used as well as  (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012). The values on top of each panel show the global (area-weighted) average, calculated after regridding the data to the horizontal grid of the model and ignoring the grid cells without available observational data. The comparison shows a high bias in tropospheric column ozone in the Northern Hemisphere and a low bias in the Southern Hemisphere in the CMIP5 multi-model mean. Similar to Fig. 13 of Righi et al. (2015) and produced with namelist_righi15gmd_tropo3.xml. tas from fully coupled idealized simulations (see Fig. 26). Emergent constraints of this type help to understand some of the underlying processes controlling future projection sensitivity and offer a promising approach to reduce uncertainty in multi-model climate projections.

Model development
As new model versions are developed, standardized diagnostics suites as presented here allow model developers to compare their results against previous versions of the same model or against other models, e.g. CMIP5 models. Such analyses help to identify different aspects in a model that have either improved or degraded as a result of a particular model development. The benchmarking of ESMs using performance metrics (see Sect. 4.1.1) provides an overall picture of the quality of the simulation, whereas process-oriented diagnostics help determine whether the simulation quality improvements are for the correct underlying physical reasons and point to paths for further model improvement.
The ESMValTool is intended to support modelling centres with quality control of their CMIP DECK experiments and the CMIP6 historical simulation, as well as other experiments from CMIP6-Endorsed Model Intercomparison Projects . A significant amount of institutional resources go into running, post-processing, and publishing model results from such experiments. It is important that centres can easily identify and correct potential errors in this process. The standardized analyses contained in the ES-MValTool can be used to monitor the progress of CMIP experiments. While the tool is designed to accommodate a wide range of time axes and configurations, and many of the diagnostics may be run on control or future climate experiments, ESMValTool (v1.0) is largely targeted to evaluate AMIP and the CMIP historical simulations.

Integration into modelling workflows
The ESMValTool can be run as a stand-alone tool, or integrated into existing modelling workflows. The primary challenge is to provide CF/CMOR compliant data. Not all modelling centres produce CF/CMOR compliant data directly as part of their workflow although we note that more are doing so as the potential benefits are being realized. For many groups conversion to CF/CMOR standards involves significant post-processing of native model output. This may require some groups to perform analysis via the ESMValTool on their model output after conversion to CF/CMOR, or to create intermediate "CMOR-like" versions of the data. Users who wish to use native model output can take advantage of the reformatting routine flexibility (see Sect. 3) to create scripts that convert this data into the CF/CMOR standard. As an example, reformat scripts for the NOAA-GFDL, EMAC and NEMO models are included with the initial release. These scripts are used to convert the native model output for direct use with the ESMValTool. The reformatting routine capability may provide an alternative to more expensive and complete "CMORization" processes that are usually required to formally publish model data on the ESGF.

Running the ESMValTool alongside the ESGF
Large international model inter-comparison projects such as CMIP stimulated the development of a globally distributed federation of data providers, supporting common data provisioning policies and infrastructures. ESGF is an international open-source effort to establish a distributed data and computing platform, enabling worldwide access to Peta-(in the future Exa-) byte-scale scientific climate data. Data can be searched via a globally distributed search index with access possible via HTTP, OpenDAP, and GridFTP. To efficiently run the ESMValTool on CMIP model data and observations alongside the ESGF, the necessary data hosted by the ESGF have to be made locally accessible at the site where ESM-ValTool is executed. There are various ways this might be achieved. One possibility is to run ESMValTool separately at each site holding data sets required by the analysis; then, combine the results. However, this is limited by the extent to which calculations can be performed without requiring data from another site. A more practical possibility is running ES-MValTool alongside a large store of replica data sets gathered from across the ESGF, so that all the required data are in one location. Certain large ESGF sites (e.g. DKRZ, BADC, IPSL, PCMDI) provide replica data set stores, and ESMVal-Tool has been run in such a way at several of these sites.
Replica data set stores do not provide a complete solution however, as it is impossible to replicate all ESGF data sets at one site, so circumstances will arise when one or more required data sets are not available locally. The obvious solution is to download these data sets from elsewhere in the ESGF, and store them locally whilst the analysis is carried out. The indexed search facility provided by the ESGF makes it easy to identify the download URL of such "remote" data sets, and a prototype of the ESMValTool (not included in v1.0) has been developed that performs this search automatically using esgf-pyclient (https://pypi.python.org/pypi/ esgf-pyclient). If the search is successful, the prototype provides the user with the URL of each file in the data set, and the user (or system administrator) is then responsible for performing the download. The workflow of this prototype is illustrated in Fig. 27. It is possible that the fully automated downloading of remote ESGF data sets may be provided by a future version of the ESMValTool, but for now it is preferable for a human to manage the process due to the large size of the files involved. A more complete coupling to the ESGF was originally planned for version 1.0, but was not possible due to the long down period of the ESGF.

Summary and outlook
The Earth System Model Evaluation Tool (ESMValTool) is a diagnostics package for routine evaluation of Earth System Models (ESMs) with observations and reanalyses data or for comparison with results from other models. The ESMValTool has been developed to facilitate the evaluation of complex ESMs at individual modelling centres and to help streamline model evaluation standards within CMIP. Priorities to date that are included in ESMValTool (v1.0) described in this paper concentrate on selected systematic biases that were a focus of the European Commission's 7th Framework Programme "Earth system Model Bias Reduction and assessing Abrupt Climate change (EMBRACE) project, the DLR Earth System Model Evaluation (ESMVal) project and other collaborative projects, in particular: performance metrics for selected ECVs, coupled tropical climate variability, monsoons, Southern Ocean processes, continental dry biases and soil hydrology-climate interactions, atmospheric CO 2 budgets, ozone, and tropospheric aerosol. We have applied the bulk of the diagnostics of ESMValTool (v1.0) to the entire set of CMIP5 historical or AMIP simulations. The namelist on emergent constraints for the carbon cycle has been addi- Figure 26. (a) The carbon cycle-climate feedback (γ LT ) versus the short-term sensitivity of atmospheric CO 2 to interannual temperature variability (γ IAV ) in the tropics for CMIP5 models. The red line shows the best fit line across the CMIP5 simulations and the grey-shaded area shows the observed range of γ IAV . (b) Probability distribution function (PDF) for γ LT . The solid line is derived after applying the interannual variability (IAV) constraint to the models while the dashed line is the prior PDF derived purely from the models before applying the IAV constraint. The results show a tight correlation between γ LT and γ IAV that enables the projections to be constrained with observations. The conditional PDF sharpens the range of γ LT to −44 ± 14 GtC K −1 compared to the unconditional PDF which is (−49 ± 40 GtC K −1 ). Similar to Fig. 9.45 of Flato et al. (2013) and reproducing the CMIP5 model results from Fig. 5 of Wenzel et al. (2014) with namelist_wenzel14jgr.xml. tionally applied to idealized carbon cycle experiments and the emission driven RCP 8.5 simulations.
ESMValTool (v1.0) can be used to compare new model simulations against CMIP5 models and observations for the selected scientific themes much faster than this was possible before. Model groups, who wish to do this comparison before submitting their CMIP6 historical simulations or AMIP experiments to the ESGF can do so since the tool is provided as open-source software. In order to run the tool locally, observations need to be downloaded and for tiers 2 and 3 reformatted with the help of the reformatting scripts that are included. Model output needs to be either in CF compliant NetCDF or a reformatting routine needs to be written by the modelling group, following given examples for EMAC, GFDL models, and NEMO.
Users of the ESMValTool (v1.0) results need to be aware that ESMValTool (v1.0) only includes a subset of the wide behaviour of model performance that the community aims to characterize. The results of running the ESMValTool need to be interpreted accordingly. Over time, the ESMValTool will be extended with additional diagnostics and performance metrics. A particular focus will be to integrate additional diagnostics that can reproduce the analysis of the climate model evaluation chapter of IPCC AR5 (Flato et al., 2013) as well as the projection chapter . We will also extend the tool with diagnostics to quantify forcings and feedbacks in the CMIP6 simulations and to calculate metrics such as the equilibrium climate sensitivity (ECS), transient climate response (TCR), and the transient climate response to cumulative carbon emissions (TCRE) (IPCC, 2013). While inclusion of these diagnostics is straightforward, the evaluation of processes and phenomena to improve understanding about the sources of errors and uncertainties in models that we also plan to enhance remains a scientific challenge. The field of emergent constraints remains in its infancy and more research is required how to better link model performance to projections (Flato et al., 2013). In addition, an improved consideration of the interdependency in the evaluation of a multi-model ensemble (Sanderson et al., 2015a, b) as well as internal variability in ESM evaluation is required.
A critical aspect in ESM evaluation is the availability of consistent, error-characterized global and regional Earth observations, as well as accurate globally gridded reanalyses that are constrained by assimilated observations. Additional or longer records of observations and reanalyses will be used as they become available, with a focus on using obs4MIPs -including new contributions from the European Space Agency's Climate Change Initiative (ESA CCI) -and ana4MIPs data. The ESMValTool can consider observational uncertainty in different ways, e.g. through the use of more than one observational data set to directly evaluate the models, by showing the difference between the reference data set and the alternative observations, or by including an observed uncertainty ensemble that spans the observed uncertainty range (e.g. available for the surface temperature data set compiled for HadISST). Often the uncertainties in the observations are not readily available. Reliable and robust error characterization/estimation of observations is a high priority throughout the community, and obs4MIPs and other efforts that create data sets for model evaluation should encourage the inclusion of such uncertainty estimates as part of each data set.
The ESMValTool will be contributed to the analysis code catalogue being developed by the WGNE/WGCM climate model metrics panel. The purpose of this catalogue is to make the diversity of existing community-based analysis capabilities more accessible and transparent, and ultimately for developing solutions to ensure they can be readily applied to the CMIP DECK and the CMIP6 historical simulation in a coordinated way. We are currently exploring options to interface with complimentary efforts, e.g. the PCMDI Metrics Package (PMP, Gleckler et al., 2016) and the Auto-Assess package that is under development at the UK Met Office. An international strategy for organising and presenting CMIP results produced by various diagnostic tools is needed, and this will be a priority for the WGNE/WGCM climate metrics panel in collaboration with the CMIP Panel (http://www. wcrp-climate.org/index.php/wgcm-cmip/about-cmip).
This paper presents ESMValTool (v1.0) which allows users to repeat all the analyses shown. Additional updates and improvements will be included in subsequent versions of the software, which are planned to be released on a regular basis. The ESMValTool works on CMIP5 simulations and, given CMIP DECK and CMIP6 simulations will be in a similar format, it will be straightforward to run the package on these simulations. A limiting factor at present is the need to download all data to a local cache. This limitation has spurred the development allowing ESMValTool to run alongside the ESGF at one of the data nodes. A prototype exists that couples the tool to the ESGF (see Sect. 5.3). An additional limiting factor is that the model output from all CMIP models has to be mirrored to the ESGF data node where the tool is installed. This is facilitated by providing a listing of the variables and time frequencies that are used in ESMVal-Tool (v1.0) which uses a significantly smaller volume than the data request for the CMIP DECK and CMIP6 simulations includes. This reduced set of data could be mirrored with priority.
Several technical improvements are required to make the software package more efficient. One current limitation is the lack of a parallelization. Given the huge amount of data involved in a typical CMIP analysis, this can be highly CPUtime-intensive when performed on a single processor. In future releases, the possibility of parallelizing the tool will be explored. Additional development work is ongoing to create a more flexible pre-processing framework, which will in-clude operations like ensemble-averaging and regridding to the current reformatting procedures as well as an improved coupling to the ESGF. Here, future versions of the ESM-ValTool will build as much as possible on existing efforts for the backend that reads and reformats data. In this regard it would be helpful if an application programming interface (API) could be defined for example by the WGCM Infrastructure Panel (WIP) that allows for flexible integration of diagnostics across different tools and programming languages in CMIP to this backend.
We aim to move ESM evaluation beyond the state-of-theart by investing in operational evaluation of physical and biogeochemical aspects of ESMs, process-oriented evaluation and by identifying processes most important to the magnitude and uncertainty of future projections. Our goal is to support model evaluation in CMIP6 by contributing the ES-MValTool as one of the standard documentation functions and by running it alongside the ESGF. In collaboration with similar efforts, we aim for a routine evaluation that provides a comprehensive documentation of broad aspects of model performance and its evolution over time and to make evaluation results available at a timescale that was not possible in CMIP5. This routine evaluation is not meant to replace further in-depth analysis of model performance and can to date not strongly reduce uncertainties in global climate sensitivity which remains an active area of research. However, the ability to routinely perform such evaluation will drive the quality and realism of ESMs forward and will leave more time to develop innovative process-oriented diagnostics -especially those related to feedbacks in the climate system that link to the credibility of model projections.

Code availability
ESMValTool (v1.0) is released under the Apache License, VERSION 2.0. The latest version of the ESM-ValTool is available from the ESMValTool webpage at http://www.esmvaltool.org/. Users who apply the Software resulting in presentations or papers are kindly asked to cite this paper alongside with the Software doi (doi:10.17874/ac8548f0315) and version number. In addition, ESMValTool will be further developed in a version controlled repository that is accessible only to the development team. Regular releases are planned for the future. The wider climate community is encouraged to contribute to this effort and to join the ESMValTool development team for contribution of additional more in-depth diagnostics for ESM evaluation. A wiki page for the development that describes ongoing developments is also available. Interested users and developers are welcome to contact the lead author.
The Supplement related to this article is available online at doi:10.5194/gmd-9-1747-2016-supplement.
Acknowledgements. The development of the ESMValTool (v1.0) was funded by the European Commission's 7th Framework Programme, under Grant Agreement number 282672, the "Earth system Model Bias Reduction and assessing Abrupt Climate change (EMBRACE)" project and the DLR "Earth System Model Validation (ESMVal)" and "Klimarelevanz von atmosphärischen Spurengasen, Aerosolen und Wolken: Auf dem Weg zu EarthCARE und MERLIN (KliSAW)" projects. In addition, financial support for the development of ESMValTool (v1.0) was provided by ESA's Climate Change Initiative Climate Modelling User Group (CMUG). We acknowledge the World Climate Research Program's (WCRP's) Working Group on Coupled Modelling (WGCM), which is responsible for CMIP, and we thank the climate modelling groups for producing and making available their model output. For CMIP the US Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We thank Björn Brötz (DLR, Germany) for his help with the release of the ESMValTool and Clare Enright (UEA, UK) for support with development of the ocean biogeochemistry diagnostics. We are grateful to Patrick Jöckel (DLR, Germany), Ron Stouffer (GFDL, USA) and to the two anonymous referees for their constructive comments on the manuscript.
The article processing charges for this open-access publication were covered by a Research Centre of the Helmholtz Association.
Edited by: S. Easterbrook