Practice and philosophy of climate model tuning across six U.S. modeling centers

. Model calibration (or “tuning”) is a necessary part of developing and testing coupled ocean-atmosphere climate models regardless of their main scientiﬁc purpose. There is an increasing recognition that this process needs to become more transparent for both users of climate model output and other developers. Knowing how and why climate models are tuned and which targets are used is essential to avoiding possible misattributions of skillful predictions to data accommodation and vice versa. This paper describes the approach and practice of model tuning for the six major U.S. climate modeling centers. While 5 details differ among groups in terms of scientiﬁc missions, tuning targets and tunable parameters, there is a core commonality of approaches. However, practices differ signiﬁcantly on some key aspects, in particular, in the use of initialized forecast analyses as a tool, the explicit use of the historical transient record, and the use of the present day radiative imbalance vs. the implied balance in the pre-industrial as a target. low-resolution conﬁguration includes an at approximately 1 ◦ and an ocean with varying resolution between 60 and 30 km. high-resolution conﬁguration is based on a 1/4 ◦ atmosphere and an eddy-permitting ocean resolution between 18 and 6 km.

1 Introduction 10 Simulation has become an essential tool for understanding processes in the Earth system, interpreting observations and for making predictions over short (weather), medium (seasonal) and long (climate) terms. The complexity of this system is evident in the myriad processes involved (such as the microphysics of cloud nucleation, land surface heterogeneity, convective plumes, and ocean mesoscale eddies) and in the dynamic views provided by remote sensing. This complexity and wide range of scales that need to be incorporated imply that simulations will necessarily include approximations to well-understood physics and 15 empirical formulations for unresolved effects. The simulations are neither a straightforward encapsulation of some well-known theory, nor are they laboratory experiments probing the real world, though they have features of both (Schmidt and Sherwood, 2014). Despite this, climate and weather simulations have demonstrated useful predictive skill across many emergent diagnostics (Reichler and Kim, 2008;Flato et al., 2013;Bosilovich, 2013). Note that we distinguish fields or statistics in the model 1 https://ntrs.nasa.gov/search.jsp?R=20180004939 2019-04-29T08:52:55+00:00Z that arise from the interactions of multiple physical effects ("emergent properties") from those that are closely related to single processes or parameterizations.
Since the pioneering work in climate modeling in the mid-20th Century (e.g. Phillips, 1956;Manabe and Bryan, 1969;Hansen et al., 1983), climate models have increased enormously in scope and complexity, going from relatively crude discretizations of atmospheric dynamics to now, far more detailed atmospheres, combined with ocean, sea ice, carbon cycles, and 5 interactive composition in the atmosphere including chemistry and multiple aerosol species. As that complexity has grown, more processes are explicitly included and the parameterizations are pushed to a more detailed (and more fundamental) level, allowing for better constraints on unknown parameters. However, at the same time, the process of model development has become more convoluted and now involves many more components than it did originally. This has led somewhat predictably to an unfortunate reduction in transparency over time. 10 It is worth expanding on why this matters: First, model development involves expert judgments which are inevitably subjective and with different choices there would be differences in emergent responses. For instance, in the MPI model, Mauritsen et al. (2012) show that equally valid, but distinct tunings can impact model sensitivity. This kind of behavior should therefore be reported more widely to improve the assessment of the robustness of specific responses. Second, models used as part of international assessment projects (such as the Coupled Model Intercomparison Project, Phase 5 (CMIP5)) are increasingly being 15 weighted or subset in order to refine predictions. If the skill measure that is used to filter or weight models has been tuned for in some cases rather than in others, the subset or weighted average will be biased towards models where that tuned over those that weren't, and that may not correspond to better physics nor better predictions (Knutti et al., 2010).
Thus it has become increasingly clear that a more transparent process is necessary. A survey of modeling groups involved in CMIP5 (Hourdin et al., 2016, hereafter H16) provides a good background on tuning practices and makes a plea for better 20 coordination of documentation of these issues. This paper is a more detailed follow-up for a subset of climate models associated with laboratories in the US (3 of which were surveyed by H16, three of which weren't). The six modeling centers that are the focus of this paper have all developed and maintain Earth system models that (at minimum) have a dynamic atmosphere, coupled ocean components and are global in scope. Additional components (such as ice sheets, the carbon cycle, atmospheric chemistry and aerosols) are also common. While two of the models discussed (NCEP CFS and GEOS-5 from NASA GMAO) 25 are primarily used for short-term (daily to seasonal) predictions, there is sufficient overlap with the models focused on longerterm problems (decadal to multi-decadal periods) to warrant describing them all as 'climate models' below.
2 Why is climate model tuning necessary?
Climate and weather models consist of three levels of representation of physical processes: fundamental physics (such as conservation of energy, mass and momentum), approximations to well-known physical theories (the discretization of the 30 Navier-Stokes equations, broadband approximations to line-by-line radiative transfer codes, etc.) and empirical approximations ("parameterizations") needed to match the phenomenology of unresolved or poorly understood sub-gridscale or excluded processes (Hourdin et al., 2016). The degree of approximation and complexity in the empirical parameterizations vary greatly across models and processes, and the resolved scales. Many parameterizations employ an underlying paradigm that makes use of well known or well observed processes so that the fundamental dependence on the atmospheric state is approximated, albeit only at a phenomenological level.
Parameters in climate models vary widely in their physical interpretation. Some are well-determined physical values, such as the Coriolis parameter, the acceleration due to gravity, or the Stefan-Boltzmann constant. Some, such as reaction rates 5 for chemical or microphysical processes, may be inferred from laboratory or field measurements (with some uncertainty).
Some emerge from the construction of parameterizations but not correspond directly to well-defined physical processes, e.g., "erosion rates" for clouds (Tiedtke, 1993). Others also emerge from the characterization of model sub-grid scale variations in the parameterizations, such as the "critical relative humidity" for cloud formation (Schmidt et al., 2006), or equivalent mixing rates for turbulent transport, and may be loosely approximated from either observations or higher resolution models Siebesma 10 and Cuijpers (e.g. 1995).
Individual parameterizations for a specific phenomenon are generally calibrated to process-level data using high resolution modeling and/or field campaigns to provide constraints. For instance, boundary layer parameterizations might be tuned to well-observed case studies such as in Larcfrom (Pithan et al., 2016) or DICE (http://appconv.metoffice.com/dice/dice.html).
However, in some cases, even when parameter values are well-constrained physically or experimentally, simulations can often 15 be improved by choosing values that violate these constraints. For example, Golaz et al. (2011) find that cooling by interactions between anthropogenic aerosols and clouds in GFDL's AM3 model depends strongly on the volume-mean radius at which cloud droplets begin to precipitate. By altering CM3 after its configuration used for CMIP5 was established as described in Donner et al. (2011), Golaz et al. (2013) found that its 20th-Century temperature increase could be simulated more realistically (larger increase) using values for this threshold drop size smaller than observed (Pawlowska and Brenguier, 2003;Suzuki et al., 2013) 20 (c.f. section 4.1 below). Another example is the variation in the effective diffusion constants for momentum, moisture and temperature which has been used to decrease large root-mean-square errors in tropical winds in the NCEP model.
There remain a number of parameters that are not strongly constrained by process-level observations or theory but that nonetheless have large impacts on emergent properties of the simulation. It is these additional degrees of freedom that are used to "tune" or calibrate the emergent properties of the model against a selected set of target observations. The decisions on what 25 to tune, and especially what targets to tune for, undoubtedly involve value judgments (Hourdin et al., 2016;Winsberg, 2012;Schmidt and Sherwood, 2014;Intemann, 2015). Notably, there isn't any obvious consensus in the modeling community as to what extent parameter choices should be guided by conforming to process-level knowledge as opposed to optimizing emergent behaviors in climate models. At many centers, the philosophy for the most part has been to tune parameters in ways that make physical sense, with the expectation that in the long run that should be the best strategy. Increasing skill in climate models over 30 time does support this approach (Reichler and Kim, 2008).
Additionally, climate simulations depend not only parameter choices within an established model structure but also on the structural choices made in the parameterization itself. Examples include experimentation with alternate closures and triggers for the cumulus parameterization at GFDL during the development of GFDL AM3 (Benedict et al., 2013), and evaluation of two candidate parameterizations for cloud macrophysics and convection in NCAR CESM2 (Bogenschutz et al., 2013;Park, 35 2014). Theoretically, all such structural choices could be coded to vary with a parameter and so there is no strong theoretical distinction between parameter and structural variations. In practice however perturbed physics ensembles (PPEs) do not span as wide a range of structural variations as multi-model ensembles of opportunity (Yokohata et al., 2012). These examples suggest that it can be hard to distinguish model tuning from model development (writ large) in practice, since both happen concurrently.
For our purposes, we define tuning as a change occurring within a fixed structural framework that does not involve adding new 5 physics.
Targets for possible tuning fall into three classes. First there are targets that need to be satisfied in order for useful numerical experiments to be performed in the first place (usually related to the equilibration of model components with long timescales).
The most important of these is a requirement of near energy balance at the top of the atmosphere and surface in an initial state of a coupled model to prevent temperature drifts over time. Strictly speaking this is not tuning to an observed quantity, 10 but rather is a tuning to a situation that was approximately inferred to hold in the "pre-industrial" (PI). Note that while the concept of a pre-industrial period is a little elusive (Hawkins et al., 2017), we refer to conditions around the mid-19th Century around 1850. To avoid dealing with the lack of sufficient observational data from the 19th Century, some modeling groups (see below) alternatively choose to tune to present-day (PD) conditions, including an energy imbalance at the top-of-theatmosphere (TOA) as inferred from ocean observations today (Loeb et al., 2009). Another tuning target is sometimes referred 15 to as the radiative forcing perturbation (RFP) (or the effective radiative forcing) and is the change in net flux which occurs in a multi-year integration with specified, climatological SSTs when emissions (primary aerosols and short-lived gases), long-lived greenhouse gas concentrations (carbon dioxide, nitrous oxide, methane, and the halocarbons) and solar irradiance are changed from pre-industrial to present day values.
A second class of tuning targets are well-characterized climatological observations which might include annual means, av-20 erage seasonal cycles or interannual variance. A third potential class are observations of transient events (at daily to centennial scales) or trends. Some observational targets have important (and sometimes unrecognized) structural uncertainties and therefore any tuning to those targets risks over-fitting the model to imperfect data, potentially reducing skill in "out-of-sample" predictions (those for which the evaluation data either did not exist at the time of the prediction or was not used in model development or tuning). This is a particular problem for transient observations such as estimates of early 20th Century tem- 25 perature changes (Thompson et al., 2008;Richardson et al., 2016), or pre-1979 sea ice extent (Meier et al., 2012;Walsh et al., 2016), of pre-1990 ocean heat content change (Levitus et al., 2000;Church et al., 2011), or water vapor trends (Dessler and Davis, 2010), which have all been corrected in recent years as non-climate artifacts in the raw observations have been found and adjusted for. In contrast, many climatologies over the satellite era are robust metrics whose estimates over any fixed period have not changed appreciably as understanding of the observations evolved. 30 Models equipped for data assimilation or that are used for operational forecasts have the additional possibility of tuning parameters to improve skill scores in those forecasts at multiple time scales -whether it's 6 hours, daily, weekly, or even for many months for seasonal forecasts of, for instance, the state of the tropical Pacific.
We note here a distinction between fields that are closely monitored during the model development process (many examples are given below), and specific tuning targets. Monitored diagnostics tend to be complex emergent diagnostics that do not 35 depend in any simple way to adjustable parameters, and thus are difficult (or impractical) to tune for. For example, note that the range of pre-industrial global temperatures in CMIP5 is [12.0,14.8] • C is noticeably wider than the uncertainties in that quantity (±0.5 • C (Jones et al., 2012)). Changes in such a monitored field are kept track of, but unless the values stray beyond a nominal acceptable range no action to change the code would be taken. If the values do stray, the principal action is often taken to go into details and examine what has happened more closely. 5 The limitations of tuning are well-known (Mauritsen et al., 2012;Schmidt and Sherwood, 2014;Hourdin et al., 2016).
First, it provides remarkably little leverage in improving overall model skill once a reasonable part of parameter space has been identified -for instance, tuning has been unable to resolve the persistent so-called "double ITCZ" problem (Lin, 2007;Oueslati and Bellon, 2015). Second, improvements in one field are often accompanied by degradation in others, thus the final choice of parameters involves subjective judgments about the relative importance of different aspects of the simulations. For 10 example, the Australian contribution to CMIP5 (ACCESS v.1) used a version of the UK Met Office atmosphere model with small modifications to mitigate problems in the tropics and Southern Hemisphere that affect Australian forecasts, at the expense of performance elsewhere (Bi et al., 2013). There are additionally many obvious biases in model simulations that persist across model generations, indicating that these aspects are robustly stubborn to development changes in the model (including the tuning) (Masson and Knutti, 2011). 15 Most discussions of tuning deal with explicit calibration of parameters to match a target observation. However, analysis of the CMIP3 ensemble (Kiehl, 2007;Knutti, 2008) suggested that there may have been some kind of implicit tuning related to aerosol forcing and climate sensitivity among a subset of models, with models with higher sensitivity having a tendency to have higher (more negative) aerosol forcing (this situation was less evident in CMIP5 (Forster et al., 2013)). Both of these correlations however seem rather low (CMIP3: 0.24; CMIP5: 0.19) and so do not provide evidence for a general tuning related to forcing and 20 sensitivity. That models with accurate historical simulations must trade off forcing and sensitivity is not necessarily evidence they have been tuned to do so. Since the CMIP3 models' aerosol forcings were not explicitly tuned to enforce the observed historical trend in temperature, the mechanisms that might explain this observation are unclear. With further data on the current top-of-atmosphere radiative imbalance (Allan et al., 2014;von Schuckmann et al., 2016), this issue will however need to be revisited for the latest generation of models. 25 Model selection can also act as an implicit form of tuning, even though this might be seen by others as simple model development. In deciding between two versions of a dynamical core or convection parameterizations, skill in El Niño/Southern Oscillation (ENSO) variability or reductions of ocean drifts may play an important role. Conceivably, a modeling center may decide not to release or use a particular version because it fails to meet certain criteria perceived to be essential though more generally this will simply spur further development. One candidate criteria would be a realistic simulation of the 20th Century, 30 however the wide spread in 20th Century trends in the CMIP5 ensemble (Forster et al., 2013, Fig. 7) would indicate that this has not been generally applied (though see below for more detailed discussion of the use of historical changes).
Within climate models, there is always a choice as to whether to tune a specific component (such as the atmosphere, sea ice, land surface or ocean) with tightly constrained boundary conditions, or to tune the coupled model as a whole. In practice, both approaches are taken, though the relative importance and computation resources available vary across groups. Tuning  components is generally fast and efficient, but does not necessarily prove robust when those components are coupled. However, coupled models take a very long time to equilibriate and their quasi-stable states may be too far from the observed climate to be useful. Assuming that models conserve energy appropriately, all control runs will eventually drift to a quasi-steady state with a near-zero energy balance at the TOA and at the surface of the ocean. However, the realism of the final state is not guaranteed and indeed, given the long time constants in the ocean, might require many thousands of years of integration to get to the wrong 5 answer. Thus a balance must be struck between approaches.

Specific practices
Each of 6 US modeling centers described below have specific missions and foci that drive different aspects of their modeling.
For instance, NASA GMAO and NCEP have operational data assimilation products for short-term weather, longer seasonal forecasts and reanalyses, that form the core of their tasks. NCAR CESM, GFDL and NASA GISS have more long-term climate 10 change issues at the forefront of their research, but each with different mandates -respectively, to be a community model, to advance NOAA's mission goal to understand and predict changes in climate, to help interpret and use NASA remote sensing products. The DOE ACME project has been tasked to a very specific role to serve DOE's energy planning and computational resource needs.
For each modeling group, we describe the principal targets and tuning strategies for their atmosphere-only GCM, their 15 coupled ocean-atmosphere GCM, and additional earth system components as relevant. The specific models referred to are described in Table 1. We outline the commonalities of approaches and key differences in Section 4, and the discuss the implications and ways forward in Section 5.  (Ringler et al., 2013) as well as updated atmosphere and land components. ACME v1 is being developed at two horizontal resolutions. A low-resolution configuration which includes an 5 atmosphere at approximately 1 • and an ocean with varying resolution between 60 and 30 km. The high-resolution configuration is based on a 1/4 • atmosphere and an eddy-permitting ocean resolution between 18 and 6 km.
Tuning is performed iteratively at the component levels and on the fully coupled system. Most of the component level tuning takes place in the atmosphere. The atmosphere is primarily tuned using short simulations (2 to 10 years) with climatological SSTs and sea ice boundary conditions, either for present-day (circa 2000) or pre-industrial conditions. The tuning targets a near 10 zero TOA radiation balance for 1850 by adjusting cloud-related parameters. Overall simulation fidelity is another important aspect of the tuning process with the goal of minimizing errors in important climatological fields such as sea level pressure, short and long-wave cloud radiative effects, precipitation, near surface land temperature, surface wind stress, 300 hPa zonal wind, aerosol optical depth, zonal mean temperature and relative humidity. The magnitude of the aerosol indirect effects is also evaluated and adjusted if deemed to be inconsistent with the observed historical warming (specifically if it has a magnitude 15 greater than 1.5 W m −2 ). Cess climate sensitivity (Cess et al., 1990) is monitored using idealized SST+4K simulations. To date, there have been no situations where the estimated sensitivity was deemed to be unacceptable based on expert judgment.
Should such a situation arise, the model would receive extra scrutiny to better understand what may have cause the climate sensitivity to change compared to previous developmental versions. The radiative imbalance in the 21st Century with observed SST must be positive with a target range of 0.5 to 1 W m −2 .

20
Most of the tuning is performed using the low-resolution atmosphere. However, cloud parameterizations need to be retuned separately for the high-resolution atmosphere. Because of the cost of the high-resolution atmosphere, it is more effective to use short hindcast simulations (Ma et al., 2015) to first evaluate the parameter space.
Tuning is also performed with the fully coupled system using perpetual pre-industrial or present-day forcing. Ocean and sea-ice initial conditions are either from rest Zweng et al., 2013) or derived from separate CORE ex- are also conducted to estimate the equilibrium climate sensitivity.

GFDL
In developing the GFDL atmospheric model AM3 and coupled model CM3, parameter choices and some structural choices as to how to deploy parameterizations were guided by multiple goals. In addition to choosing parameters within plausible ranges suggested by observations, experiments, theory, or higher-resolution modeling, these goals included simulating thermodynamic and dynamical fields, as well as TOA regional short-wave and long-wave fluxes, as realistically as possible. The global and 5 annual mean net TOA radiative flux in integrations with specified, present-day (1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000) sea surface temperatures (SSTs) was tuned to a slight positive imbalance (0.8 W m −2 ) within observational estimates (Loeb et al., 2009). Particular attention was also given to surface properties important for successfully coupling AM3 to models for sea ice (high-latitude surface energy balance) and ocean (wind stresses and implied ocean heat transports). Many of the changes in parameters from earlier GFDL models or nominal values in literature describing the model parameterizations are summarized in Donner et al. (2011). 10 For example, the momentum source in the Alexander and Dunkerton (1999) parameterization for gravity wave drag was chosen based on the stratospheric circulation it yielded. To facilitate optimizing input parameters to this parameterization, the orographic wave parameterization was limited in the vertical extent of its application. Additionally, the autoconversion threshold (volume-mean radius at which cloud droplets begin to precipitate), cloud erosion scales, and ice fall speeds in the Rotstayn (1997) and Tiedtke (1993) cloud microphysics and macrophysics parameterizations were tuned to improve regional 15 patterns of TOA shortwave and long-wave fluxes, TOA shortwave and long-wave cloud radiative effects, the Earth's energy imbalance, precipitation, and implied ocean heat transports.
The choices of a closure based on convective available potential energy (CAPE) for the Donner (1993)  Aspects of AM3 related to variability, including stationary wave patterns, relationships between Niño-3 index and regional precipitation, relationships between the Northern Hemisphere Annular Mode and regional pressure and temperature patterns, 30 tropical cyclones, and the tropical wave spectrum were monitored during AM3 development . Optimal tuning for mean state and variability in some cases conflicted. In AM3, this was particularly evident for the tropical wave spectrum, including the Madden-Julian Oscillation. Deep convective closures and triggers which produced a realistic mean simulation did so at the expense of of the tropical wave spectrum (Benedict et al., 2013).
AM3 includes prognostic aerosols based on emissions, transport, chemical processes, and dry and wet removal. An important aerosol tuning parameter is the strength of wet scavenging. In-cloud condensate fractions were prescribed to provide a reasonable simulation of the global mean and regional distribution of aerosol optical depth. These condensate fractions maintain relative solubilities among the various aerosols in AM3.
AM3 is the first GFDL model to include cloud-aerosol interactions. At the outset of this aspect of AM3 development, esti-5 mates of climate forcing by cloud-aerosol interactions ranged to -3 W m −2 (Lohmann and Feichter, 2005), and GFDL's AM2, modified to include cloud-aerosol interactions, yielded an associated climate forcing of -2.3 W m −2 (Ming et al., 2005). Since climate forcing by greenhouse gases is around 3 W m −2 , the most extreme estimates of climate forcing by cloud-aerosol interactions would not be compatible with observed historical temperature increases. Given the approximate treatments of cloud-aerosol interactions in climate models, the possibility that some parameter combinations or formulations 10 could lead to these extreme estimates could not be ruled out during model development. Indeed, Golaz et al. (2011) show that the magnitude of climate forcing by cloud-aerosol interactions depends strongly on the volume-mean drop radius at which cloud droplets begin to precipitate. Golaz et al. (2011) also find that assumptions regarding the sub-grid distribution of updraft speeds is an important control, though exerted through re-tuning for radiative balance as the distribution of updraft speeds is changed. The effective cloud-droplet radius and cloud droplet number concentration are both central to climate forcing by 15 cloud-aerosol interactions and vary strongly with aerosol size distribution (Feingold, 2003;McFiggans et al., 2006). Ming et al. (2006), which is used to parameterize aerosol activation in AM3, supports a range of aerosol size distributions.
The TOA RFP (c.f. Section 2) was monitored during AM3 development, as was the Cess climate sensitivity. A configuration for which the ratio of the RFP to the Cess sensitivity was about 15% less than the value for AM2 (The GFDL Global Atmospheric Model Development Team, 2004) was selected for AM3. This imposes a bound on RFP which depends on AM3 20 sensitivity and the forcing/sensitivity ratio in AM2. Within the limitations of the Cess sensitivity, this ratio was used to compare without coupling the changes in global-mean surface temperature that might be expected from CM3 relative to CM2.
The basis for using this ratio as a tuning target is past experience that CM2 generally simulates historical temperature change reasonably well. Without coupling, this target, while not constraining forcing or sensitivity independently, aimed to exploit knowledge from earlier models about their joint behavior associated with realistic simulation of historical temperature change. 25 The AM3 RFP is 0.99 W m −2 , with the aerosol contribution about -1.6 W m −2 ). The coupled model CM3 was not further constrained with respect to its simulation of 20th Century climate change. Although the ratio of RFP to Cess sensitivity for AM3 is only about 15% less than for AM2, 20th Century temperature increases in CM3 are less than observed, while CM2.1 temperature increases are greater than observed . The uncoupled ratio is clearly limited in its ability to fully indicate temperature response in the coupled model. 30 The CM3 coupled model was initialized from present-day ocean conditions and allowed to adjust to a pre-industrial, quasi- albedos were generally more realistic than those used in CM2.1. The change was made possible by CM3's improved realism in regions with sea ice .
Note that the above description applies only to AM3/CM3. For CM4 (Zhao et al., 2016), development is ongoing, and the specific tuning practices will be documented in future papers.

5
Tuning strategies in GISS ModelE2 are described in Schmidt et al. (2014). In the atmosphere-only simulation under 1850 pre-industrial conditions, the parameters in the cloud schemes that control the threshold relative humidity and the critical ice mass for condensate conversion are used to achieve global radiative balance and a global mean albedo of between 29 and 30%.
Additionally, parameters in the gravity wave drag are chosen to optimize the simulation to the lower stratospheric seasonal zonal wind field and the minimum tropopause temperature. This also impacts high-latitude sea level pressure. In ocean-only 10 simulations as described in the CORE protocol (Griffies et al., 2009), mixing parameters are chosen to minimise drift from observations in the basin-averaged temperature and salinity.
Upon coupling the ocean and atmosphere models, there is an initial drift to a quasi-stable equilibrium which is judged on overall terms for realism, including the overall skill in the climatological metrics for zonal mean temperature, surface temperatures, sea level pressure, short and long wave radiation fluxes, precipitation, lower stratospheric water vapor, and microphysics have free latitude to produce whatever forcing is calculated.
In simulations with interactive atmospheric composition, there are two specific tunings for ozone chemistry: the photolysis rate in the atmospheric window region for incoming solar radiation, and temperature threshold for the formation of polar stratospheric clouds (and hence the heterogeneous chemistry associated with them) . The former is tuned so that N 2 O and O 3 fields in the lower tropical stratosphere match observations, while the latter can be used to ensure that the polar ozone hole timing is correct despite potential biases in polar vortex temperatures. With respect to dust aerosols, emissions are tuned so that the model can match retrieved aerosol optical depths for the present-day (Miller et al., 2006), similarly tuning of the lightning parameterization (and associated source for NO x ) is done against modern observations of flash rate, and tropospheric ozone amounts. 5 For the E2.1 model and subsequent CMIP6 submissions, all tuning is being done with pre-industrial and present-day fully interactive simulations (including chemistry and aerosols and indirect effects) and the non-interactive versions will use the composition derived from those simulations and the same tuning.

NASA GMAO
The Goddard Earth Observing System (GEOS) model is currently in use at the NASA GMAO at a wide range of resolutions and with the daily TOA long-wave and short-wave distributions, which are monitored to ensure that performance does not degrade through the development or tuning process. 30 The contribution of cloudy effects is approached by adjusting the parameters that describe the cloud radiative effect (cloud particle size and autoconversion rates). The clear sky portion of the TOA fluxes is matched by tuning the parameters that govern the mean atmospheric humidity and surface albedo over ice covered surfaces. The free atmosphere specific humidity is quite sensitive to the "critical relative humidity" specified in the cloud macro-physical scheme (Molod, 2012), and so although this parameter is largely dictated by observed subgrid scale moisture variations, the fine tuning and the details of the vertical profile are tuned to match a consensus of reanalysis estimates of specific and relative humidity and SSM/I total precipitable water.
The boreal winter mean circulation as compared to reanalyses (as seen by the 200 hPa eddy height or by the 300 hPa velocity potential) was found to be quite sensitive to the intensity of the hydrological cycle, largely dictated by the rates of re-evaporation or sublimation of rain and snow. These parameters are chosen so as to ensure agreement of the seasonal mean 5 circulation with reanalysis, the seasonal mean precipitation with observations from GPCP and TRMM, and the agreement of the cloud radiative effects with CERES and with SRB at the surface. The behavior of the atmosphere-ocean coupled system is particularly sensitive to the geographical distribution of the surface shortwave cloud radiative forcing in the tropics.
Additional observations of aerosol optical depth (from MODIS) and other chemical constituents (e.g. CO from the Tropospheric Emission Spectrometer (TES)) are used in the GMAO to validate the simulated turbulent and convective transport. Data  (Molod, 2012). The second of the resolution dependent parameters is the so-called Tokioka limit in the convective parameterization. Again based on the expectation that the larger convective motions are resolved explicitly and on evidence from global mesoscale model results, the parameters that govern the stochastic Tokioka limit changes so as to 25 restrict parameterized deep convection at higher resolutions.
At the higher resolutions (25 km and better) the tuning parameters are chosen based on short term forecasts and the behavior as part of the data assimilation system. Forecast skill scores, the fidelity of the spin-up of tropical cyclones and the innovation vector for data assimilation (Observation-Forecast statistics) are critical relevant metrics for new tuning choices, and any new choices of tuning parameters are evaluated with an ensemble of forecasts. The analysis increments during both data assimilation 30 and replay experiments provide the key guidance for choosing the parameters to tune. Under the general assumption that the mean analysis increments indicate systematic errors in the model physics (which is not always valid), correlations between the tendency term from any individual physical parameterization and the analysis increment reveals errors due to the behavior of that parameterization, and parameters of that scheme are adjusted so as to minimize the mean analysis increments.
High resolution forecasts are also evaluated and tuned based on comparisons with spatial and temporal variability of high resolution top of the atmosphere fluxes and radar-derived precipitation. As with the lower resolutions, the parameters which are adjusted to meet the tuning targets are the autoconversion/ice-fall rates and the cloud drop size. In addition to these parameters, high resolution tuning also includes adjustments of the Tokioka limit and the time scale of adjustment in the convective parameterization. As an aside, We note that resolution decisions almost always affect tunings (and development) and the 5 goal that parameterized physics or models can be independent of resolution while a noble aim, is not yet a reality.
The ability to spin up tropical cyclones and match the correct track was found to be quite sensitive to the magnitude of low level drag. Based on theoretical considerations and the results of laboratory experiments, the model's function which relates surface stress to roughness height over the oceans (the "Charnock coefficient") was adjusted to decrease the drag at high wind speeds and resulted in substantial improvements in the simulation of tropical cyclones (Molod et al., 2013).

NCAR
The Community Earth System model (CESM, Hurrell et al. (2013)) is a joint NCAR and university-wide activity and gover-20 nance takes place through a working group structure. Working groups are teams of scientists that contribute to the development of each individual component (atmosphere, land, ocean, sea-ice, land-ice, chemistry and bio-geochemistry) and relevant topics (such as climate variability or climate change).
Tuning begins as a generally separate activity for each components within the working groups. During this initial phase of tuning, periodic pre-industrial control coupled simulations are performed as a check on the impact of each components' 25 developments to date on the whole coupled system, and to insure features of the simulation have not significantly degraded.
The atmosphere model tuning strategy initially performs 'stand-alone' experiments using the AMIP protocol with interactive land and atmosphere components and with prescribed observed Sea Surface Temperatures (SSTs) and sea-ice distributions.
Initial development testing is performed using SSTs of the climatological period centered around the year 2000 for 5-10 year periods. This length of simulation is necessary due to the high-arctic variability. The first key measure of a simulation that 30 will be appropriate to the fully coupled simulation is the TOA energy balance. Estimates of the observed present-day energy imbalance are of order 0.5-1.0 W m −2 (Loeb et al., 2009) and the aim is to achieve close to that through modification of cloud related fields that have an impact both on the short wave and long-wave components of the energy budget. The first quantitative assessment of simulation fidelity is given by summary RMSE and bias scores for a number of variables key to the fully coupled system including surface stresses, precipitation, temperature, cloud forcings and surface pressure. A secondary assessment involves 'pre-industrial minus present day' simulations to determine the aerosol indirect effects that would be expected in historical coupled simulations. This involves ensuring that the net aerosol forcing isn't greater in magnitude than about (negative) 1.5 W m −2 , based on guidance from Myhre et al. (2013).
In parallel to the atmosphere component activities, the ocean and ice working groups perform equivalent 'stand alone ex-5 periments' with forcing provided by multiple cycles of the CORE forcing protocol (Griffies et al., 2009). The phenomena of key importance are the meridional overturning circulation, particularly in the North Atlantic, Gulf Stream separation, Drake passage flow, equatorial thermocline depth and SSTs in the Pacific. The land tuning approach uses land-only configurations forced by bias-corrected reanalysis-based meteorological forcing products. Metrics of performance are generally assessed for leaf area index, gross primary productivity, river discharge, latent heat flux and vegetation and soil carbon stocks. Other phys-10 ical components of the coupled system, including land-ice and bio-geochemistry, will also be developed and tuned in parallel within their respective working groups.
Ideally, this would translate into a well-tuned atmosphere into a configuration with SSTs, sea-ice and land conditions relevant to the pre-industrial, and a simulation that should in principle, would translate well to a coupled system close to energy balance i.e., with no net increase or decrease of energy into the whole coupled system. However, coupled system biases in the surface 15 distribution of SSTs, sea-ice means that tuning also needs to be performed in the fully coupled system.
Coupled model tuning brings together the individual fully active 'tuned' components and their associated working groups to perform a series of pre-industrial climate experiments. The same performance metrics applied in atmosphere AMIP simulations apply to the coupled simulation; namely top of atmosphere zero energy imbalance. An equilibrium energy imbalance is the most challenging task in coupled CESM tuning. The difficulty lies in spin-up and drift of the system. Two ocean initialization 20 approaches are used. The first is to use an observed Levitus temperature and salinity state with the ocean at rest. The second approach is to initialize from an ocean state of a previously run simulation. This has the advantages of a spun-up ocean state, and in particular the deep ocean and that is more 'familiar' with the overlying atmosphere component. However it is undesirable from the perspective of simulation provenance. A combination of the two are used. If the equilibrium energy imbalance is greater than 0.1-0.2 W m −2 then the system will need to be retuned, again most commonly through minor adjustments of 25 cloud radiative impact parameters. If the energy imbalance and surface temperature drifts are observed to be small in short decadal runs, then longer 50-100 year simulations are performed to analyze longer to determine whether the performance of the ocean-ice only simulations translate to the fully coupled system.
For the coupled simulations to be considered successful, they have to satisfy many of the requirements outlined above in addition to the dominant ENSO mode of variability; also a very challenging task. For instance, the initial implementation of 30 more advanced convection parameterizations in CAM6 gave rise to a degradation in ENSO performance, but with some tuning to those schemes, ENSO performance skill was enhanced. Another example of a coupled issues that arose in constructing the CMIP6 version of the code were a persistent cold bias and excessive sea ice in the Labrador Sea, which was mitigated by more accurate routing of local river runoff. In previous versions (such as CCSM4), there were evaluations of the coupled model in historical transient mode, specifically of the September Arctic sea ice trend from 1979 which was improved after adjustments 35 14 to the sea ice albedo formulation to affect the PI ice thickness (Gent et al., 2011). A "reasonable" historical temperature trend remains the primary metric of success, but no attempts are made to tune for it explicitly.

NCEP
In recent history two fully coupled climate models have become operational at NCEP, the Climate Forecast System (CFS) version 1 (Saha et al., 2006) and CFS version 2 (Saha et al., 2010(Saha et al., , 2014. For the most part, the CFS and its predecessors 5 (since there have been global climate models at NCEP since 1995) have been developed in the same way as weather prediction models. Indeed, the atmospheric component of the CFS is taken from the Global Forecast System (GFS), which is the NCEP flagship that makes weather forecasts from day 1 to 15. Verification against independent future reality (the weather happen- ing worldwide every day) shows the GFS and similar operational models elsewhere steadily improving their skill scores on independent data over the last 50 years. 10 The daily verification skill scores are the dominant source for tracking model improvement. This is a powerful target for tuning which confronts the model with real-time observations in evolving data assimilation (DA) systems and then verifying the forecasts, from the initial conditions provided by these DA systems, with independent observations.
A new CFS is built by taking a snap-shot of the latest state-of-the-art GFS as its atmospheric component, along with state-ofthe-art ocean, sea-ice and land models which are available at that time. In developing CFSv1 in 2002, a 'large' (≈10) number 15 of candidate coupled ocean-atmosphere models were constructed, which were then run on a limited number of test cases, with differing vertical and horizontal resolutions, as well as with different physics parameterizations, such as convection and radiation schemes. The results were then judged, along with the normal verification metrics, on whether the 9-month predictions produced skillful ENSO predictions. Our goal at that time was to be competitive with the statistical/empirical models that were predominantly being used for ENSO predictions. After initial testing, the model version that gave the best ENSO predictions 20 was used to make retrospective forecasts over a 20+ year period (going back to 1982) in order to calibrate (remove the systematic bias in) the model forecasts and to make a priori skill assessments. These were then used in subsequent real-time operational forecasts made by the CFS. Since it is very expensive to make retrospective forecasts over long periods (20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) years) for every imaginable model configuration, the preliminary test over a set of limited cases was extremely important. The dominant changes that improved skill were associated with tunings in the convection as model vertical layers were increased 25 from 28 to 64 levels.
Having achieved some success in the prediction of ENSO in seasonal forecasts out to 9 months in CFSv1, the goal for CFSv2 was to tackle sub-seasonal predictions, mainly of the MJO in the tropics. Prediction of the MJO from 5 days was successfully extended to nearly 21 days by improving model physics and having a high resolution state-of-the-art data assimilation system to assimilate direct satellite radiance data. Also, greenhouse gas (GHG) concentration changes were implemented in the NCEP 30 forecast system. While the NCEP focus is short-term (seasonal) climate prediction, it has been recognized that even for these predictions, the forecast needs to be warmer than a 'normal' that, by necessity, is based on past data. The increase in GHGs also played an important role in improving the data assimilation of satellite radiance data. Each satellite over the 1979-present history was calibrated using GHG concentrations observed at the time these satellite were operational. The result was a reason-able upward temperature trend over 1979-present period, much better than at the time of CFSv1, when the upward trend over land was brought about only by the warming in initial global ocean conditions. As described in Saha et al. (2014), the seasonal prediction model may not be exactly the same as the model used for weather forecasts. In the absence of data assimilation, coupled ocean-atmosphere models can drift and produce, for instance, a very cold Pacific ocean due to a boundary layer parameterization change in weather model that produced more marine stratus clouds, but which became excessive in the fully 5 coupled runs. This change was thus reversed in the seasonal simulations.
Development is now underway for the next model, CFSv3. NCEP/EMC has a strategic plan to unify the global forecast systems and develop a Unified Global Coupled System (UGCS) for both weather and seasonal climate prediction. This system will have six fully coupled model components, namely the atmosphere, ocean, sea-ice, land, waves and aerosols. It will also have a strongly coupled data assimilation system in each of these six components.

4 Commonalities and differences
As might be expected the broad picture of tuning across the climate model groups is consistent. The key adjustable parameters are those associated with uncertain and poorly constrained processes such as clouds, convection, gravity wave drag, and ocean mixing parameters. Common too are the broad array of targets against which skill of the models are judged e.g. the TOA short-wave and long-wave radiation, 500 hPa geopotential height, surface temperatures, sea level pressure, precipitation, etc.

15
However it is also abundantly clear that the procedures at each group are quite distinct and can reasonably be surmised to reflect different scientific priorities and missions, and thus will produce different outcomes.
The model groups also differ in whether they focus on pre-industrial conditions or present day simulations. The former has the benefit of being closer to climate stability, while the latter has substantially more observational data. The groups focusing on the pre-industrial are judging (mostly correctly) that the errors in the control simulation (whether run for pre-industrial or 20 present) are larger than the trends between those periods. A stark difference does exist between the models that have operational data assimilation products (NCEP and GMAO) and those who don't. The ability to assess improvements in fast physics based on short forecasts is an excellent resource that, even if the climate models were not run operationally in this way, should become a more widely used methodology (e.g. Hurrell et al., 2006). Recent experience with this mode of testing in the GISS model has shown very positive results for representation of the MJO and tropical convection (Del Genio et al., 2015). Groups also differ 25 on which metrics they monitor for whether the are within an acceptable range, or if a specific value is tuned for directly (for instance, as for the present day energy balance for some groups in Table 2).
There are also some clear commonalities in approaches. All groups focus on atmospheric models at first either in an AMIPstyle mode (annually varying modern SST/sea ice), or using a climatological approach (decadal mean observed ocean conditions and forcings), or in weather forecast mode. Tunings for atmospheric composition and key atmospheric diagnostics use 30 these experiments which have the advantage of fast equilibration times and reduced computational load. Tunings for ocean components can be done in stand-alone experiments, but often are done within the full coupled framework, with at least some Using atmosphere-only/AMIP simulations.
1 PD has to be warmer than PI.
2 However sensitivity and forcing were jointly constrained w.r.t. the previous model.
3 Set in simulations with non-interactive composition only.
4 It was a necessary criteria for CCSM4, but not specifically tuned for.
model groups tuning sea ice and ocean mixing parameterizations to produce acceptable sea ice cover and ocean circulation metrics.

Use of recent trends and present-day radiative imbalance
Because of the high importance and visibility of climate models' simulation of the historical period (PI to PD), model groups have to be particularly clear in how information that reflects the ongoing trends in temperature and ocean heat content have 5 been used in the tuning process.
The descriptions above suggest increasing knowledge over time about the current radiative imbalance has clearly influenced model development. Developers prior to CMIP3 (circa 2004) had a general expectation that net radiative forcing over the 20th Century was positive, but they were not able to use a specific value for the present-day energy imbalance because oceanic analyses were not accurate enough: compare Levitus et al. (2000) to Allan et al. (2014) for instance. Thus a posterior quantita- 10 tive test of the model imbalance in coupled runs compared to (improving) observations was a valid test of skill (Hansen et al., 2005). This may not be true for a large fraction of simulations in CMIP6.
We summarize the results in Table 2. None of the models described here use the temperature trend over the historical period directly as a tuning target, nor are any of the models tuned to set climate sensitivity to some pre-existing assumption. However, NCAR, GFDL and DOE do tune for a global radiative imbalance at near present-day conditions. For instance, GFDL AM3 15 with observed SSTs was tuned to have a positive imbalance, with a magnitude less than about 1 W m −2 for 1981-2000.
As discussed above, the radiative imbalance can be affected in two ways, by adjusting internal parameters (mostly associated with clouds), and/or by using a different historical forcing. Four models adjust their historical aerosol forcing: GISS, though only in its non-interactive runs, aims for an indirect aerosol forcing of -1 W m −2 (Schmidt et al., 2014); NCAR CESM and DOE ACME tune for a substantive positive effective radiative forcing at near-present conditions (implying a limit of -1.5 W m −2 for aerosols); GFDL AM3 constrained its ratio of Cess sensitivity to RFP to be close to its value in its prior-generation coupled model, which implied an aerosol forcing around -1.6 W m −2 in AM3.
At least three of the model groups discussed here find a difference between the energy imbalance using year 2000 forcings together with observed SST and sea ice, and the transient coupled simulations for the same time-period and forcings. However 5 the differences in how this calculation is done can be important and the implications for the coupled model simulations are unclear. For example in the GISS-E2 model, the decadal mean imbalance (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)  to zeroth order) the coupled imbalance, the "committed warming" (at constant concentrations) for the model, or the historical trend? With a perfect coupled model, and perfect knowledge of the forcings, this might be the case, but the imperfections in both imply that tuning to the PD imbalance is less of a constraint than might be assumed.

Discussion and future approaches
As models are continually evaluated at the process-level against an increasing number of observations, analyses often show 20 that existing parameterizations lack enough flexibility to represent the coupling between the sub-gridscale and the environment in all relevant climate regimes. The response is often to increase the complexity of a parameterization, which comes at the cost of an increased number of tunable parameters. With that increase, the challenges faced by the developers also rise and the potential for "local minima" to occur i.e. different parameter combinations have similarly good agreement according to standard GCM validation metrics (e.g. Taylor Diagrams, climate state mean biases, spatial correlations). 25 If these distinct/separate volumes of tuning parameter space lead to simulations that exhibit similarly good agreement with observations, there is no clear scientific reason to prefer one over another. But will our decisions on parameter combinations today have noticeable impact on the simulated climate several centuries from now or to climate sensitivity more broadly?
Specifically, does choosing different local minima in parameter phase space 'matter'?
With more combinations, is there room for improving regional biases in simulations while simultaneously making the tuning 30 process more automated? These questions have motivated an effort, using the GISS model as a test-bed, for developing a more robust framework for assessing the true existence of local minima in a multi-dimensional space (see also Hourdin et al. (2016)). This is being explored by incorporating situational or regime-dependent errors in observations or regional biases in GCM fields in weighted cost functions that define model "goodness". We hope that this endeavor will increase the objectivity for deciding on the most appropriate tuning parameters and either lead to improved metrics for diagnosing the fidelity of a particular model, or reveal the spread in simulated climate sensitivity arising from settling on very different, but seemingly optimal, combinations of tuning parameters.
More generally, the large variety of approaches demonstrated among just these 6 models indicates that the documentation 5 of tuning procedures across a multi-model ensemble like CMIP6 will be quite challenging. What role should the degree of tuning matter when assessing the coupled model skill? Should simulations be up-weighted in the ensemble because of a closer climatology to observations, or down-weighted because this is partly due to accommodation? Should models that are tuned differently but have similar physics be treated as independent or not? (Annan and Hargreaves, 2016;Knutti et al., 2017). These questions play into more fundamental issues related to how one should think about an unstructured multi-model ensemble (i.e. 10 Knutti et al. (201010 Knutti et al. ( , 2013).
At minimum, we recommend that all future model description papers (or systematic documentation projects such as ES-DOC http://es-doc.org) include a list of tuned-for targets, monitored diagnostics, and describe clearly (as in Table 2) their use of historical trends and imbalances in the development process.
While we have only discussed tuning in the context of historical and modern simulations, it is vital to assess the credibility 15 of models by examining their performance in out-of-sample situations. This is easy for the models with an operational weather forecast mode (at least for some aspects of the climate system), and participation in paleo-climate model tests by NCAR and GISS are also invaluable. Medium term climate forecasts based on anticipated changes in forcings (such as the eruption of Mt. Pinatubo (1991) or the rise in greenhouse gases have been shown to have skill (Hansen et al., 1988(Hansen et al., , 1992Hargreaves, 2010).
The importance (or lack thereof) of tuning always needs to be seen within that context. This paper alone cannot hope to answer 20 all of the above questions, but we hope that it can contribute to a more transparent and more widely usable discussion.

Data and Code availability
The availability of code for the models discussed in this paper is laid out in Table 3, though note that some decisions/tunings mentioned above are associated with current development which may only be accessible with a collaborative agreement in place. instrumental for putting this paper together. Manuscript reviews by Larry Horowitz and Levi Silvers (GFDL) are appreciated. We'd like to thank Steve Sherwood and an anonymous reviewer for constructive comments on the discussion version of the paper.