Calibrating Climate Models Using Inverse Methods: Case studies with HadAM3, HadAM3P and HadCM3

. Optimisation methods were successfully used to calibrate parameters in an atmospheric component of a climate model using two variants of the Gauss–Newton line-search algorithm: (1) a standard Gauss–Newton algorithm in which, in each iteration, all parameters were perturbed and (2) a randomised block-coordinate variant in which, in each iteration, a random sub-set of parameters was perturbed. The cost function to be minimised used multiple large-scale multi-annual average observations and was constrained to produce net radiative ﬂuxes close to those observed. These algorithms were used to calibrate the HadAM3 (third Hadley Centre Atmospheric Model) model at N48 resolution and the HadAM3P model at N96 resolution. For the HadAM3 model, cases with 7 and 14 parameters were tried. All ten 7-parameter cases using HadAM3 converged

Abstract. Optimisation methods were successfully used to calibrate parameters in an atmospheric component of a climate model using two variants of the Gauss-Newton linesearch algorithm: (1) a standard Gauss-Newton algorithm in which, in each iteration, all parameters were perturbed and (2) a randomised block-coordinate variant in which, in each iteration, a random sub-set of parameters was perturbed. The cost function to be minimised used multiple large-scale multi-annual average observations and was constrained to produce net radiative fluxes close to those observed. These algorithms were used to calibrate the HadAM3 (third Hadley Centre Atmospheric Model) model at N48 resolution and the HadAM3P model at N96 resolution.
For the HadAM3 model, cases with 7 and 14 parameters were tried. All ten 7-parameter cases using HadAM3 converged to cost function values similar to that of the standard configuration. For the 14-parameter cases several failed to converge, with the random variant in which 6 parameters were perturbed being most successful. Multiple sets of parameter values were found that produced multiple models very similar to the standard configuration. HadAM3 cases that converged were coupled to an ocean model and run for 20 years starting from a pre-industrial HadCM3 (3rd Hadley Centre Coupled model) state resulting in several models whose global-average temperatures were consistent with preindustrial estimates. For the 7-parameter cases the Gauss-Newton algorithm converged in about 70 evaluations. For the 14-parameter algorithm, with 6 parameters being ran-domly perturbed, about 80 evaluations were needed for convergence. However, when 8 parameters were randomly perturbed, algorithm performance was poor. Our results suggest the computational cost for the Gauss-Newton algorithm scales between P and P 2 , where P is the number of parameters being calibrated.
For the HadAM3P model three algorithms were tested. Algorithms in which seven parameters were perturbed and three out of seven parameters randomly perturbed produced final configurations comparable to the standard hand-tuned configuration. An algorithm in which 6 out of 13 parameters were randomly perturbed failed to converge.
These results suggest that automatic parameter calibration using atmospheric models is feasible and that the resulting coupled models are stable. Thus, automatic calibration could replace human-driven trial and error. However, convergence and costs are likely sensitive to details of the algorithm. metrics used opaque (Mauritsen et al., 2012;Hourdin et al., 2013), with the main approach being trial and error. Consequently, expensive person time is needed to calibrate or tune climate models. Methods that could automatically calibrate model parameters would allow easier development of parametrisations, objective discussion of the observed targets and more rapid development of climate models. Such an approach would also facilitate uncertainty analysis and would improve understanding of the contribution of parametrisation compared to resolved dynamics in model properties, including model error. Tett et al. (2013b) (T13 from here on) outlined an approach to model parameters calibration by considering it as an inverse optimisation problem for which the aim is to find the parameter values which produce an atmospheric model with the smallest error relative to a predetermined set of weighted observations. T13 focused on only two observations, global mean outgoing longwave and reflected shortwave radiation, and modified only four parameters. They were able to calibrate the model parameters to several different observational targets. In this paper we further develop the approach taken by T13 to increase the number of observations and parameters used. We then couple some of the resulting atmospheric models to an ocean model to test if the resulting coupled model is stable.
Various approaches have been taken to optimising model parameter values. Golaz et al. (2013) hand-tuned the GFDL CM3 model to radiation balance by adjusting several parameters in the cloud scheme, finding a significant impact on aerosol forcing but not on greenhouse gas forcing or on "Cess" climate sensitivity (Cess et al., 1990). They found very large differences during the 20th century due to the perturbed impact of aerosols. Bellprat et al. (2012Bellprat et al. ( , 2015 generated a model emulator for three climate variables from a regional model. From this emulator by Latin hypercube sampling they found the parameter combinations that minimised error. Their earlier work focused on five parameters while their recent paper used eight parameters and considered North American and European regions. They found the calibrated model improved the simulation of summers in both regions. Williamson et al. (2013) use a combination of emulation and ruling out implausible observations to construct models. They used four observational constraints: global average surface air temperature (SAT), Northern Hemisphere meridional temperature gradient, seasonal cycle of temperature in the Northern Hemisphere and global average precipitation. They found that SAT was the most important constraint. Later they included the strength of the Antarctic Circumpolar Current (ACC) in their analysis and found parameter combinations where the model had a good simulation of both the ACC and SAT (Williamson et al., 2014). Irvine et al. (2013) generated 200 variants of HadCM3 using a Latin hypercube experimental design and splitting each parameter range into 200 bins. They then ran the resulting coupled models and found that about 10 % were acceptable. Tomassini et al. (2015), using a low-resolution version of the MPI-ESM model, perturbed eight parameters randomly across their plausible range and generated coupled models with a broad range of global average temperatures. They then examined the different feedbacks and mechanisms for those feedbacks in their model, finding that four convective parameters related to convective mixing had strong impacts on both the mean tropical circulation and on climate sensitivity. Such brute force approaches become extremely expensive as the dimensionality of the problem increases, though the use of emulators may help.
Attempts have been made using data assimilation techniques to calibrate parameters. Such systems simultaneously estimate the atmospheric state and the parameter values. Schirber et al. (2013) reported on a study in which they used that approach but found no improvement in the model climatology. Ruiz and Pulido (2015) used a similar algorithm and found an improvement in medium-range forecast skill but did not report on the impact on model climatology.
Another approach is to use forecast error. Ollinaho et al. (2012) updated four parameter values and their covariances iteratively using a set of 3-day forecasts of ECHAM5 and found a modest reduction in forecast error. When they ran the model with observed sea surface temperature and seaice they found a reduction in top-of-atmosphere flux errors. They followed up this study with one in which they minimised forecast errors in the total energy (Ollinaho et al., 2014). They also applied the technique to the ECMWF forecasting system and found a modest change in parameter values and an increase in forecast skill in the tropics (Ollinaho et al., 2013).
The approach we consider is optimisation via direct evaluation of the model, something attempted by Jones et al. (2005) for a low-resolution version of HadCM3. Yang et al. (2013), building on Jackson et al. (2004), applied the SSAA algorithm to tune parameters in CAM5 to improve the simulation of the partitioning between convective and large-scale precipitation. Zou et al. (2014) applied a similar approach to an East Asian regional model by modifying seven parameters and optimising only mean precipitation. They found a significant improvement in both the rainfall pattern and daily rainfall distribution.
Here we update T13 to include a larger number of observations and parameters. The observations we use, such as T13 used, are multi-annual, large-spatial averages. As before we continue to use a Gauss-Newton algorithm but include a randomised block-coordinate variant where, on each iteration, a random sub-set of the parameters are perturbed.
Our objectives are as follows: 1. test how well a Gauss-Newton algorithm does in minimising error in the HadAM3 N48 model (Pope et al., 2000) with 7 and 14 parameters and multiple observations; 2. test for equifinality in which models with different parameter values have similar observed values (Beven and Freer, 2001); 3. see how coupled model variants of HadCM3 (Gordon et al., 2000), with the parameters taken from the optimisation, behave; 4. test these algorithms with the N96 HadAM3P model (Massey et al., 2015).
The remainder of this paper first describes the models, the optimisation method and the observational metrics used. We next describe results of optimisation, the properties of the atmospheric models and how the coupled models behave. We discuss our results before concluding.

Methods
In this section we outline our methods. We first describe two related atmospheric models we use. Next we outline the Gauss-Newton algorithm and a randomised blockcoordinate variant of it, deal with the need to regularise matrices and describe how the algorithm terminates. We then describe the choices we made in parameter selection and parameter perturbation as well as the observations and covariance matrices we used. Finally we describe how we evaluate the optimised configurations and estimate uncertainties in the parameter values.

Models
We use the N48 (3.75 • × 2.5 • ) resolution configuration of HadAM3, which uses a 360-day calendar, driven with the same package of forcings used by T13. Simulations were run from 1 December 1998 to 1 March 2005 (6.25 years), and the period 1 March 2000 to 30 February 2005 was compared with observations. In addition we use the N96 (1.875 • × 1.25 • ) configuration of HadAM3P (Massey et al., 2015) with a similar package of forcings to that used in the N48 configuration. This model was run from 1 December 1999 to 1 March 2005 (5.25 years) We use the standard landsurface dataset rather the time-varying dataset used in the N48 case, including both the direct and indirect effect of SO 2 aerosols on clouds (Jones et al., 2001), and used, after interpolation, the same ozone dataset as we used in the N48 case. Some results from the default configuration are described in Tett et al. (2013a).

Gauss-Newton and line-search
We build on the approach used by T13 which minimised an objective function which was the root mean square of the global average outgoing longwave radiation and reflected shortwave radiation. We extend this to a larger number of observations, taking account of both observational error and simulated internal variability. As we focus on large-scale, multi-annual averages we assume that both terms can be represented by multivariate Gaussian distributions characterised by covariance matrices C O (observational error) and C i (internal variability), respectively. If the model was perfect we would expect (S − O) ∼ N (0, C) where C = C O + 2C i . Therefore, the cost function (F (p)) depending on parameters (p) we minimise is as follows: where N is the number of observations, S is the simulated observations, and O is the target observations. This requires that C is invertible and, if necessary, we regularise it (see below). This way of defining F (p) allows for covariance between observations to be taken account of. For example internal variability might generate large correlation between total outgoing radiation and precipitation and so not weighting them would give greater weight to configurations with small error in outgoing radiation and precipitation than is justified. We also want to reduce the importance of observations with high uncertainty and, conversely, increase the weight of observations with small uncertainty. We follow Sexton and Murphy (2011) and generally use a crude estimate of observational error based on the difference between two different observational datasets. Our aim in this paper is the application of inverse methods to parameter calibration, not the production of good estimates of observational error. That is a matter for the groups that produce the observational datasets and so is beyond the scope of the work reported on here.
We estimate C i from 100 simulations of the standard N48 HadAM3 model configuration. Estimating observational error is more difficult. For the radiation observations we use the fractional error estimates from Loeb et al. (2009) and apply them to each regional value. For other datasets we define them as the difference between the default values and the equivalent from another observational dataset. We explored applying a covariance structure to the observational error but found this did not work well (see discussion), nor was there very strong objective justification for any covariance structure.
The Gauss-Newton algorithm is an iterative two-step algorithm. The first step is to compute the Jacobian J : where S i (p) is the ith simulated observation when the model is run with the vector of parameters p and p j is the j th parameter. We approximate this using finite differences (Nocedal and Wright, 2006): with p j being a suitably small perturbation to the j th parameter and e j the j th coordinate vector with the j th element being 1 and all other elements being 0 (e.g. (0, . . ., 0, 1, 0. . ., 0)). In order to avoid using parameter values outside the expert range, we chose, at each iteration, the sign of p j so as to perturb towards the middle of the allowed range. Note that p j is ideally chosen such that the Jacobian is above internal variability and not, as is common, to machine precision. The choice of p j follows our noise estimates, and we use ideas from implicit filtering techniques for derivative free optimisation (Nocedal and Wright, 2006, Sect. 9) Having computed the Jacobian, the algorithm proceeds by computing the line-search vector (s) to proceed along to minimise the cost function, F , through solving the linear problem: where H = J T C −1 J is the finite-difference approximation to the Hessian matrix (H = J T C −1 J ).
Having computed the line-search vector, s, we then evaluate F 2 (p) at several steps along it ("line-search"). The values, and number, of the line-search steps are defined when we describe the algorithms we trial later in the paper. If any of the chosen line-search parameter values are outside the expert-defined plausible range, we project these to the appropriate boundary. The minimum value of F 2 from the line search is used as the starting point for the next iteration.
The Gauss-Newton algorithm can be modified to include an additional constraint by modifying the cost function to the following: where O c and S c are the values we wish to constrain, and µ is an user choice to be decided on after experimentation This can be rewritten in the same form as Eq. (1): Building on ideas of Nesterov (2012) and Kim and Lee (2008), we also tried a randomised block-coordinate version of Gauss-Newton in which, in each iteration, P rand different parameters were chosen at random and used in both the Gauss-Newton and line-search steps. Non-perturbed parameters used the values from the previous iteration.

Scaling and regularisation
Our algorithm could suffer from using ill-conditioned matrices in two places.
First, if the Hessian matrix is singular or ill-conditioned, defined as having a condition number greater than 10 10 , we use a Tihkonov regularisation (Nocedal and Wright, 2006) in which we add a small multiple of an identity matrix to the Hessian matrix. We iteratively increase the identity matrix scaling by a factor of 10 starting with 10 −7 until the regularised Hessian is no longer ill-conditioned or the scaling is 10 −2 . In the latter case our algorithm terminates with an error. This regularisation introduces a scale dependence into the algorithm. Each time we compute the Jacobian, we scale all parameters whose magnitudes are less than 1 so they have magnitude 1 and invert this scaling when computing the linesearch direction.
Secondly, we also regularise C. Rather than adding the identity matrix, we scale the diagonal of the covariance matrix by increasing factors of 2 until the condition number of the entire matrix is less than 5 times the condition number of the diagonal matrix. We apply this regularisation after scaling all values and before computing the Jacobian. For the bulk of our work C is well conditioned, so this regularisation is not applied.

Algorithm termination
We need criteria to terminate the algorithm. Classical Gauss-Newton terminates when sufficiently close to the stationary point of the cost function (F (p)), and so F stops reducing (Nocedal and Wright, 2006). However, the climate is a chaotic system which introduces noise into the model evaluations. Therefore, the algorithm may continue to iterate even when it is not making any significant progress or terminate because of not improving due to this noise.
The algorithm terminates on iteration k when one, or more, of the following occurs: 1. F (p k ) − F (p k−1 ) < c, where p k are the parameter values at iteration k. That is, F (p) has not reduced by a critical amount c.

2(S
where S k is the simulated observations at iteration k and c i is a critical value from a χ 2 distribution with N degrees of freedom. This checks that the new and previous simulated observations (S) are statistically similar.
value from a χ 2 distribution with N degrees of freedom. This checks that the current simulated observations are in statistical agreement with the target observations.
In our implementation c, c i and c o are all choices to be made in the algorithm.
For the random variant of the algorithm, if the cost function did not reduce by c, then the algorithm was restarted from the previous best parameter set by rerunning that case and another set of random perturbations. If the error then Table 1. Parameters, default values, and allowed ranges and perturbations. Shown for each parameter name are the component of HadAM3 they are from, the default value, allowed range, perturbations used in HadAM3-7 cases ( 1 ) and perturbations used in all HadAM3-14 cases and HadAM3P cases ( 2 ). For more information on the parameters see Yamazaki et al. (2013) failed to reduce by c the algorithm would terminate. This means that the random variant will require at least two iterations before it terminates. This approach results in some duplicate simulations, although because of model chaos the simulated observations differ. Some inefficiency results from this which could be reduced by keeping track of all cases that have run and not rerunning those cases. For ease of implementation we did not do this. Future work could implement such an optimisation.

Parameter selection and step size
We used up to 14 parameters from the analysis of Yamazaki et al. (2013) but restricted our analysis to parameters that varied continuously. Some of those parameters are "metaparameters" in that changes in one affected other parameters. We used the same algorithms as Williamson et al. (2013) did to modify parameters from the meta-parameters. Ranges of allowed parameter values were taken from Murphy et al. (2004). We carried out three cases: 1. We adjusted seven parameters using HadAM3.
Step sizes for the Jacobian calculation were taken from T13 for ENTCOEFF, VF1, CT and RHCRIT. For the remaining three parameters we used 10 % of their range.
2. We adjusted 14 parameters, again, using HadAM3. To compute the step size for the additional parameters we set the value to the upper or lower range value that was most different from the standard value. Then for all 14 parameters, we computed Where d i was greater than 100 we reduced p i by approximately √ d i /100. And where d i was less than 100 we increased p i , limiting the increase to 50 % of the allowed range, so that d i would, assuming linearity, be greater than 100.
3. We adjusted 7 and 13 parameters using HadAM3P using the same step sizes as in the 14-parameter HadAM3 cases.
Parameters, ranges, default values and step sizes for the Jacobian computations are shown in Table 1.

Observations, covariance matrices and optimisation choices
Here we describe the choices we made in our optimisation study.
We focus on large-scale properties of the climate system and so consider the northern hemispheric extra-tropical (θ > 30 • N) and tropical (30 • S ≥ θ ≤ 30 • N) means and the southern hemispheric extra-tropical (θ < 30 • S) means. We do this for seven variables, all as an average over the 5-year period March 2001 to February 2005 (inclusive), for the following: Land air temperature (LAT) Land temperature has an impact on simulated biology, evaporation, snow and other important parts of the Earth system with changes in it being a significant impact from climate change. We use the observed CRU TS Vn 3.21 dataset (Harris et al., 2014) and the HadAM3 N48 land-sea mask to determine land, and restrict to data north of 60 • S. For a second estimate of LAT we use ERA-Interim data (Dee et al., 2011).
Land precipitation (LP) This is a key measure of the hydrological cycle. We also use the CRU TS Vn 3.21 dataset, the HadAM3 N48 land-sea mask, and restrict to data north of 60 • S. For a second estimate we use the vn6 GPCC dataset (Schneider et al., 2011). All simulated and observed values were converted to millimetres per day (mm day −1 ).
Mean sea level pressure (SLP) We use this as a measure of the planetary scale circulation. To correct for model mass loss we used sea-level pressure differences between the global-average value and the extra-tropical Northern Hemisphere and tropics. We did not include the southern extra-tropics as that provided no new information and consequently made the covariance matrix uninvertable. We used values from ERA-Interim as observations and, for a second estimate used, the NCEP reanalysis (Kalnay et al., 1996). All observations and simulations were converted to hPa.
Reflected shortwave radiation (RSR) This measures the reflectivity of the Earth and is driven by clouds, snow, sea-ice and other surface properties. We compute values, and uncertainties, from the vn2.8 EBAF dataset (updated from Loeb et al., 2009).
Outgoing longwave radiation (OLR) This is a measure of the outgoing thermal radiation from the Earth and is driven by atmospheric temperatures and clouds. We also use the vn2.8 EBAF dataset.
Temperature at 500 hPa (T500) This gives an estimate of the temperature lapse rate. We use ERA-Interim data as observations and for a second estimate use the NCEP reanalysis.
Relative humidity at 500 hPa (q500) This provides a measure of mid-troposphere water vapour, which is an important greenhouse gas. We also estimate values from ERA-Interim and use the NCEP reanalysis as a second estimate.
See Table 2 for target values used in all our studies. We need to estimate a total covariance matrix (C) and a covariance matrix for internal variability (C i ). We estimated observational uncertainty for each regional OLR and RSR from the fractional uncertainties in Loeb et al. (2009). For other datasets we estimated the standard deviation (SD) as the difference between two different datasets. We assumed no correlation in observational error, so C O is diagonal. The diagonal values of C O were significantly larger than the equiva-lent values of C i (internal variability), so observational error is the dominant term in the total error-covariance matrix (C).
We also applied a constraint (see Sect. 2.6) in order to generate atmospheric models that had a net radiative flux close to the observed value which double-counts the OLR and RSR observations (or at least their sum). After some experimentation we settled on a value for µ of 0.01, corresponding to an observational error of 0.015 W m −1 , close to the observational error of about 0.2 W m −1 that Tett et al. (2013c) estimated from the difference of observational datasets.
When producing the datasets for the 7-parameter cases we made two errors in the computation of C. First we computed it as C i + C O and secondly we mis-specified the three precipitation components. Given the focus of our work was on optimisation rather then the exact definition of the cost function, we do not believe these errors are very significant.

Evaluation
We evaluate the inverse approach in several different ways. For the algorithm we consider the expected number of iterations, evaluations and final error, following the approach of T13 of using a strategy of repeatedly running the Gauss-Newton algorithm after it failed until convergence. This gives the expected number of model evaluations (E): where E c , E f and f are the mean number of evaluations (or simulations) for studies that were comparable to, or better than, the standard configurations, the mean number of evaluations for studies that failed and the fraction that failed, respectively. The expected number of iterations is computed similarly except that iterations rather than simulations are used.
The line-search component of the algorithm has a selection effect as it takes the parameter combination that produced the smallest cost function. Due to chaos in the model which leads to pseudo-random noise, this will lead to a selection effect as the smallest cost-function values may have arisen by chance. To avoid this effect and to examine the properties of the resulting models, we take the final optimised parameter sets and for each one run an ensemble of two simulations from December 1998 to April 2010. Each simulation was started from the same initial state but with a different small perturbation. We compare results of these independent simulations for 2000-2005 with the standard configuration and each other and look for evidence of equifinality (Beven and Freer, 2001), in which different parameter combinations produce models that appear similar. For greater out-of-sample comparison we compare differences between the 2005-2010 and 2000-2005 periods from the observations we use and the standard and independent optimised simulations.
For the HadAM3 cases we also carry out 20-year simulations of HadCM3 (Gordon et al., 2000) using the converged  parameter sets. We compare results from the last 10 years of the 20-year simulation with the standard control simulation of HadCM3, all started from the same initial state corresponding to about 5000 years of spinup.

Parameter covariance
Assuming that the parameter perturbations are small, we can compute the covariance matrix for the parameter error (C p ). We do this following Nocedal and Wright (2006) by a linear transformation of the total observational covariance matrix: where P is a transformation matrix = (J T C −1 J) −1 J T C −1 . From these parameter error covariance matrices we can compute a distance between two parameter sets (p i and p j ) as follows: where C p i and C p j are the parameter error covariance matrices for sets i and j , respectively; d 2 ij is roughly χ 2 dis-tributed, though given the crudeness of our observational error estimates we err on the conservative side. So we claim that if d 2 ij > 100 then the parameter sets are different.

Results
In this section we present our results. We tried several different algorithms using the HadAM3 and HadAM3P atmospheric models. We first present numerical results on the convergence behaviour of those algorithms, then compare some aspects of the climatologies of the modified models with the standard mode. Finally, we report on results of variants of the coupled atmosphere-ocean HadCM3 model that uses the optimal parameter sets from the HadAM3 test cases.

Atmospheric model convergence
We carried out several case studies. The first was one in which we perturbed seven parameters using the Gauss-Newton algorithm. Using 14 parameters, we tested the Gauss-Newton algorithm and two random parameter variants. Finally we tested three algorithms using the HadAM3P  model configuration. In no case did the algorithm terminate because the cost function was small. Given the crudeness of the observational covariance used in our cost function, we do not draw any inference from this. That would require a much better estimate of observational error than we made. Instead we take the pragmatic view that a perturbed model is comparable to (substantially better than) the standard configuration, in the simulation of the observations we used, if the cost function is less than 120 (80) % of standard models' cost function. We stress that this is a subjective choice that we made.

HadAM3 7-parameter case
For the 7-parameter (HadAM3-7) trials, we generated 12 random initial parameter choices by selecting values from their extreme limits (Table 3). For this algorithm we tried out five line-search evaluations at scalings of 1.0, 0.7, 0.3, 0.1 and 0.01 of the search vector and required F (p) to reduce to keep iterating (i.e. c = 0). Two cases failed in the first iteration with a model error, with the remaining 10 cases terminated when they failed to make progress. All those 10 had cost values similar to the default model's value of 5.0 (Fig. 1a). These cases took between 3 and 12 iterations, requiring 37 to 145 model evaluations, to terminate. As in our earlier study (T13) the cost function reduces rapidly over the first one to two iterations with slow reduction after that (Fig. 1a). We carried out five line searches partially to test if any of the scalings on the search vector were preferred. We found no strongly preferred scaling value (Table 4). In the rest of the paper we use scalings of 1, 0.7 and 0.3 on the search vector.

HadAM3 14-parameter cases
We trialled three related algorithms to perturb 14 parameters. The algorithms we tested were the standard Gauss-Newton algorithm (HadAM3-14) and two variants with random perturbations. In one we perturbed six random parameters (HadAM3-14r6) and the other eight (HadAM3-14r8). For each algorithm we did five studies with each one being started from the same random extreme parameter choices ( Table 3). As described above we corrected the error in the computation of C and adjusted the parameter perturbations (Table 1). For the random variants we required that the cost function reduce by 0.2 to continue iterating. Many of the simulations failed due to being marginally unstable, in which case we perturbed parameters by about 1 part in 1000 and reran that case. An operational system would restart the model with a small perturbation to a previous state and run past the failure point.
Unlike the HadAM3-7 cases the HadAM3-14 cases did not all produce cost function values comparable to the default model (Fig. 1b), with three cases failing and two succeeding. The successful cases took between four (74) and six (108) iterations(evaluations), with the unsuccessful cases taking one to four iterations. Neither of the successful cases are obviously better than the standard configuration.
Next we turn to the HadAM3-14r6 cases. This algorithm performed well, with four out the five cases succeeding taking between 6 (60) and 9 (87) iterations (evaluations). Three of the cases had cost functions less than the standard configuration but not substantially so (Fig. 1b). In contrast the HadAM3-14r8 algorithm performs poorly, with only one case having a cost function comparable with the standard configuration. This case took 5 (61) iterations (evaluations) to terminate. The unsuccessful cases took four iterations to terminate.

HadAM3P cases
The HadAM3P cases differ from the standard configuration not only in increased resolution but in the addition of a cloud anvil parametrisation and the indirect effects of aerosols on cloud optical properties (Massey et al., 2015). One approach to model development would be to take the parameters from the previous model version and then re-calibrate the parameters using inverse methods with the new model. We tested three algorithms with all cases starting from the default HadAM3 parameters. Our comparison case is the default HadAM3P configuration.
Unless stated otherwise all studies used the same choices of covariance matrices, observations, parameter perturbations and other choices as the HadAM3 14-parameter studies (Table 1). So, for example, at each iteration the cost function would need to reduce by 0.2 for the algorithm to continue. The three algorithms were as follows: HadAM3P-13r6 The diffusion parameter was kept at its default HadAM3P values but all remaining 13 parameters were changed, with 6 being chosen, at random, in each iteration.
HadAM3P-7 Here the same parameters as used in the HadAM3 seven-parameter cases were perturbed and termination occurred immediately if the cost function did not decrease by 0.2.
HadAM3P-7r3 As HadAM3P13r6 but with, at each iteration, three parameters, of the seven used in the sevenparameter HadAM3 case, perturbed at random.
The standard configuration of HadAM3P (Fig. 1c) is substantially worse, using our metric, than the standard HadAM3 configuration (Fig. 1b). Starting from the standard parameters the cost function reduces less than for the HadAM3 cases which all started from extreme parameter choices. The HadAM3P-7 and HadAM3P-7r3 cases produced configurations comparable with the standard HadAM3P model. The HadAM3P-7r3 study took 5 iterations with 31 evaluations. The HadAM3-7 case took 3 iterations also needing 31 evaluations of the model. The HadAM3P-13r6 case failed to converge and needed 3 iterations to fail.

Algorithm performance
For each algorithm we tested using HadAM3 we characterised its performance using Eq. (6). For each of the three HadAM3P algorithms we only carried out one case, so algorithm performance is evaluated from that single case.
As discussed earlier there is a potential selection effect in that from the line-search evaluations we chose the one case with minimum error. To examine the effect of this we compared the average cost from the optimised cases with the independent runs and with the cost values for the standard cases. Note that the independent and optimised cases have identical parameter sets but the 14-and 7-parameter algorithms use slightly different cost functions. The mean cost from the independent simulations is, except for the HadAM3P-7r3 algorithm, larger than the mean cost for the optimised simulations (Table 5). The mean difference between the independent and best optimisation depends on the algorithm but ranges from 0.2 to 0.6 (5 to 15 %) of the cost function for the standard configurations.
The expected number of iterations increases from the HadAM3-7 to HadAM3-14 algorithms but does not double. Our earlier work (T13) found that the median number of iterations for optimisation using two observations and four parameters required between three and five iterations. This suggests that the cost of increasing the number of parameters is not excessive, with the iteration count increasing less than P (the number of parameters). As each iteration needs P model evaluations then the total number of iterations likely increases between P and P 2 .
The six-random-parameter (HadAM3-14r6) algorithm worked well with an average cost function slightly better than the standard configuration (Table 5). Though requiring 60 % more iterations than the seven-parameter case, it only has an additional 20 % more expected evaluations for twice as many parameters. Random selection of 6 parameters has many less expected evaluations than perturbing all 14 parameters on each iteration. However, perturbing 8 parameters at random performs very much worse than perturbing 6 at random or all 14 parameters. We will explore possible reasons for this later. For the HadAM3P cases the HadAM3P-13r6 algorithm failed while both the HadAM3P-7r3 and HadAM3P-7 algorithms succeeded.
To summarise this subsection we find that a relatively simple Gauss-Newton algorithm works well to automatically calibrate parameters in an atmospheric model. The algorithm did not reduce the error to zero and so terminated when it stopped improving. We found that the expected number of iterations increases, though less than linearly, as we increased the number of parameters. Random selection of 6 out of 14 parameters worked well though random selection of 8 from 14 worked poorly. We were also able to reduce the cost function of the HadAM3P model relative to the standard configuration of that model.

Atmospheric model evaluation
We now investigate the behaviour of the optimised HadAM3 and HadAM3P models by first focusing on the optimal parameters, then examining the simulation of the target observations in the independent simulations before comparing the model fields of key variables with observations. We aim to test for equifinality (Beven and Freer, 2001), where different parameter sets can lead to very similar outputs. This could arise from multiple minima or a single broad flat minima.
We normalise the parameter values by their expert-based plausible ranges, with 0 being the minimum and 1 the maximum. We find for both the 7-and 14-parameter HadAM3 case studies that many of the parameters have a broad range of optimal values (Fig. 2). For each parameter we test if the distribution of optimal values is significantly different from a 0-1 uniform distribution using a Kolmogorov-Smirnov test. For 3 parameters (RHCRIT, ENTCOEFF and CW_LAND), in the 7-parameter case, we can reject this null hypothesis. For the 14-parameter cases we can reject the null hypothesis of a uniform distribution for the same 3 parameters and, in addition, an additional 5 parameters have distributions inconsistent with a uniform distribution. These results suggest that minimising the cost function does provide a weak constraint on some individual parameters.
Using Eqs. (7) and (8) we can test if the optimised 7parameter values are within parameter error of one another. We compute, for each optimised parameter set, the Jacobian from the final iteration and use this to compute squared distances between all 10 parameter sets. We find that the minimum value of this is 1.7 × 10 10 . It may be that the Jacobian has significant noise contamination, so we repeat the calculations with the mean parameter error covariance and Jacobian. We find minimum distances squared of 10 9 and 3×10 8 , respectively. This suggests that the 10 parameter sets found through optimisation are all significantly different from one another.
We now consider how the independent simulations behave for the successfully optimised HadAM3-7-and -14, and HadAM3P parameter sets. These, to remind the reader, are two simulations run with the same parameter set as the successful optimised case. All model observation differences are normalised by the diagonal elements of the covariance matrix which is dominated by our crude estimate of observational error.
For the HadAM3 7-and 14-parameter cases the optimised simulations are, for many target observations, similar to the standard configuration (Fig. 3) with little scatter across the best cases. The 14-parameter cases have larger scatter than the 7-parameter cases, suggesting the additional parameters lead to more ways to produce an optimised model. The medians are generally, though not always, a small improvement (closer to zero) on the standard cases. However, for the optimised and standard parameter sets several simulated observations are outside the ±2σ uncertainty range, suggesting that further model improvement would need better representation of processes either through new parametrisations or higher resolution. Reflected shortwave radiation biases show the greatest variation across the optimised cases with Northern Hemisphere extra-tropical land air temperature and tropical RH at 500 hPa, also showing large variation across the optimised cases.
We now turn to the two optimised HadAM3P cases. These configurations have, like the standard HadAM3P, smaller biases in land air temperature across the three large regions we consider. This is particularly so in the northern hemispheric extra-tropics, suggesting that enhanced resolution improves this particular observation. However, this model has a much worse simulation of precipitation in the tropics, even with tuning, than does the HadAM3 case. Optimising the parameters does reduce biases in the HadAM3P model but not enough to support the claim that is better than its lowerresolution and computationally cheaper HadAM3 cousin.
Comparison of the optimised cases with the initial extreme random parameter choices gives a sense of how important variation in the parameters is for those observational biases. One thing that stands out is that large-scale biases in the tropics (Fig. 4) are sensitive to parameter values. In contrast biases in extra-tropical relative humidity at 500 hPa are insensitive to changes in parameter values, suggesting this is driven by the large-scale resolved dynamics rather than parameterisation. In the extra-tropics biases in RSR and OLR are the most sensitive to parameter variation, with temperature at 500 hPa, MSLP and northern hemispheric precipitation being least sensitive. That would suggest that the behaviour of these latter variables are mainly driven by the large-scale resolved dynamics rather than the parametrisations.
We now examine how the bias changes when we consider a period outside the period we used to calibrate the model. Here we compare changes in bias between March 2005-February 2010and March 2000-February 2005. We normalise by the expected internal variability. For most observations and optimised configurations the bias does not significantly change between the two periods (Fig. 5), with the standard configurations and optimised cases behaving similarly. However, the extra-tropical relative humidity shows significant changes in bias between the two periods, with all simulations showing a significant increase in bias. As all models behave similarly this suggests either a lack of homogenisation in the ERA-Interim reanalysis or some systematic bias in all models.
So far we have focused on large-scale biases. We use Taylor diagrams (Taylor, 2001) to examine how fields from the independent simulations compare with the observations. We focus on the same fields and observational datasets as those used to compute the biases described above. Taylor diagrams summarise field similarity by computing field correlations and centred field SDs. We use the normalised variant where the centred field SDs are scaled by the equivalent values from  the observed field we use. This allows us to compare fields with different units. We find that for land air temperature, 500 hPa temperature and outgoing long wave radiation there is little variation in the location on the Taylor diagram ( Fig. 6a and b). For SLP patterns the scatter does not appear much greater than would be expected by chance for both HadAM3 and HadAM3P. Precipitation is generally slightly worse for the HadAM3 optimised cases than the standard configuration with spread to smaller correlations and larger RMS differences. For the HadAM3P configuration the optimisation slightly improves the spatial patterns of precipitation (Fig. 6a). For RSR, pattern correlations and centred RMS differences show the largest spread across the variables we consider with some of the seven-parameter optimised cases an improvement on the standard configuration. For HadAM3P the centred RSR patterns are worse than the standard HadAM3P case. The optimised HadAM3 cases for Relative Humidity at 500 hPa scatter around the standard cases with some better and some worse, though as with other variables the differences are small. Overall, the HadAM3 optimised and standard values are very similar.

Coupled model results
To test if calibrating atmospheric parameters results in reasonable coupled models, we took the calibrated parameters  from all successful 7-and 14-parameter cases in a set of control simulations of HadCM3. The surface temperature adjusts in the first decade (Fig. 7a), though the deep ocean is still adjusting during the 20-year simulations (Fig. 7b). Williamson et al. (2013) estimated that pre-industrial temperatures were 13.6 • C, with a robust error estimate of ±0.5 • C. We claim that a coupled model is "good" if the global and time average of its surface air temperature for years 10-19 is consistent with that estimate. The standard configuration is, just, within  this range and as noted by Gordon et al. (2000) HadCM3 is somewhat too cool. For the HadAM3-7 cases we find that eight of the parameter combinations produce temperatures within the target range (Table 5). For the HadAM3-14 cases five out of seven parameter combinations give temperatures within the target range. For all four of the cases that failed to produce coupled models, this was because they are too cold rather than too warm. As we start from the standard configuration we may be more able, in the 20-year simulations we did, to identify cooling rather than warming biases. Though all atmospheric models were constrained to be in rough energy balance, the individual fluxes are less constrained. For three of the cases that cooled, RSR rapidly increases over the first 5 years with OLR decreasing over the same period. However, the RSR increases by more than the OLR decreases, so the coupled model is out of balance and cools (Fig. 7). This may be due to negative cloud feedbacks in these model configurations.
The remaining coupled models show a range of OLR and RSR values but are generally stable. We now examine if there is any relationship between properties in the atmospheric model simulation and the coupled model simulation. Above we showed that RSR changes were somewhat larger than OLR changes and, across the optimised parameter sets, RSR variability was larger, relative to its uncertainty, than OLR variability (Fig. 3). Thus, we focus on relationships between global-average RSR and various properties of the coupled models. We examine the 10-year global average for 2001-2010 from the independent atmospheric simulations and years 10-19 from the control simulations.
For surface air temperature and volume average ocean temperature, there is a relationship between atmospheric model RSR and coupled model values, with an increase in atmospheric RSR leading to cooling in the coupled model (Fig. 8), though with some scatter around this general relationship. Uncertainties on the regression are small. We also find an inverse relationship between the strength of the Atlantic Meridional Overturning Circulation (AMOC) in the control simulation and the atmospheric RSR, likely because cold models have a stronger AMOC. Similar results hold true for northern hemispheric snow area and sea ice area. For land precipitation the scatter is too large to conclude there is a strong linear relationship. We repeated this analysis using OLR from the atmospheric simulations and found similar, though opposite signed and weaker, results. This likely arises from the constraint on the net flux, meaning that enhanced RSR must be balanced by reduced OLR. Note that the range of atmospheric RSR values is within the estimated uncertainty estimate for RSR (See Fig. 1 of T13) and so all cases (after running the atmospheric optimisation) are "good".

Discussion
Our results suggest that calibrating the atmospheric component of a coupled model to multiple observations is computationally feasible, with the resulting coupled models behaving well much, but not all, of the time. However, we found that calibration of 14 parameters was less successful than that of 7 parameters. We now investigate potential reasons for this by looking at the Jacobian matrices from all 7-and non-random 14-parameter studies. We also examine the Jacobian of the HadAM3P 7-parameter cases to see if changing resolution affects the Jacobian, which might explain the failure of the HadAM3P-13r6 case.
We computed Jacobians for each iteration with the parameters normalised by their range so that 0 (1) is the minimum (maximum) value and normalised each bias by its simulated internal variability. To see which parameters have the strongest effect on simulated observations, we compute the mean, over all iterations, of the absolute Jacobian values. We compare this to internal variability by comparison with a folded normal distribution (Leone et al., 1961) using a 90 % critical value. To derive the parameters for this distribution, we assume that the underlying normal distribution arises from the difference of two random distributions with unit variance and zero mean (σ = √ 2, µ = 0). We see that in the 7-parameter cases (Fig. 9a) that all parameters, except ICE_SIZE, have an significant impact on net flux and the cost function (F ). ICE_SIZE affects both OLR and RSR outside the northern hemispheric extratropics, but changes in OLR and RSR must offset one another, leading to a small impact on net flux and on F . All parameters affect RSR in the tropics and almost all affect it in the extra-tropics. In contrast, tropical OLR is significantly affected by only three parameters (ICE_SIZE, VF1 and ENTCOEFF) with the remaining four parameters having little impact on OLR. Northern hemispheric extra-tropical land precipitation and SLP and tropical SLP are not significantly affected by any of the parameter perturbations. In the Southern Hemisphere, land temperature is only weakly affected by changes in VF1, while precipitation is not signif-icantly impacted. These likely reflect the small land area in the Southern Hemisphere and the resulting increase in internal variability. In terms of relative importance we see that changes in the ENTCOEFF parameter has the most impact on the cost function with RSR being most affected by parame-  ter changes. In contrast we see that ICE_SIZE has the least impact on the cost function and extra-tropical land precipitation and pressure gradients, being unaffected by parameter perturbations.
Examining the 14-parameter Jacobians (Fig. 9b), we see that 4 of the additional 7 parameters have a significant impact on the cost function. However, of these only DYNDIFF has a more than small impact on the cost function. For these six parameters, that generated small or insignificant perturbations to the cost function, our preliminary tuning (see above) had led to parameter perturbations that were large relative to the range (Table 1). As with the seven-parameter cases, ENTCOEFF has the largest impact, with RHCRIT the second most important. However, from this larger set of parameters all simulated observations, except Southern Hemisphere land temperature and precipitation, are affected. The CHARNOCK, ICE_SIZE and ALPHAM parameters have no significant impact on the cost function. Further, the CHARNOCK parameter was perturbed by about one-third of its range, meaning there is little freedom to further perturb it.
The mean of the absolute Jacobians between the 14-and 7-parameter cases shows some differences in detail (compare Fig. 9a with b) suggesting that the Jacobians are, as expected, not constant. More detailed examination of this (not shown) suggests that within an individual study, after the first itera-tion, the Jacobians are fairly stable but within different parts of parameter space the Jacobians differ even if the final states appear quite similar.
Looking at the absolute Jacobians from the HadAM3-7 computations (Fig. 9c) we see differences from the two HadAM3 results, with VF1 and RHCRIT no longer having a significant impact on the cost function. This likely arises from the smaller impacts on the net flux than in HadAM3, which has, in our constrained optimisation, a large effect on the cost function. In contrast the effect of ENTCOEFF and CT on the cost function is much larger in HadAM3P than it is in HadAM3.
Regarding the poor performance of the HadAM3-14r8 algorithm, it is unclear at this stage precisely what has caused it, given that HadAM3-14r6 behaves very well. We speculate that this may be caused by noise contamination, and that the fewer parameters we perturb in the algorithm, the smaller the chance of seeing the effect of noise. Alternatively there could be instability in the randomised algorithmic variant, again due to noise. We note that if the cost function is smooth and accurate derivatives were available, one can easily observe improving rates of convergence for randomised block Gauss-Newton variants the more parameters one chooses in the block (Eizenberg, 2015). As part of the development of our approach we carried out four trial cases where we started from parameter sets (Table 3) with the largest climate sensitivities. We present results from them to explore the sensitivity of our results to changes in the algorithm and cost function. We also carried out a case parallel to the HadAM3-7 cases where we started the optimisation with the standard HadAM3 parameters and used the correct cost function calculation (as done for the 14parameter cases). The five cases are as follows:  Figure 10. (a) Cost as function of iteration for four trial 7-parameter and stdopt cases (lines and symbols). Also shown are the cost functions for the standard model for each case (coloured horizontal lines) with the colour corresponding to the trial case. Large symbols for each trial case show cost from independent atmospheric simulation with other details, as in Fig. 1. (b) Scatter plot (symbols as a) of optimal minus standard configurations for 2000-2010 mean RSR from atmospheric-only simulations against the coupled control mean for years 10-19 of 1.5 m temperature. Vertical grey region show estimates of the difference between pre-industrial global-average air temperatures and the standard configuration; configurations in this region are "good".
All trials (Fig. 10) converged to states with cost functions similar, though slightly larger, than the reference model. Independent simulations have cost functions slightly larger than the cases from the optimisation, with the difference being largest for trial7#15m. However, all cases produced models that cooled and had temperatures outside the range of acceptable coupled models. This suggests that the Gauss-Newton algorithm converges for a range of cost functions but not necessarily to a case that produces an acceptable coupled model. Starting from the default parameter set, we found that the algorithm produced a model with a slightly improved cost function, taking four iterations before terminating. The resulting coupled model is just outside the acceptable range. Zhang et al. (2015) reported successful optimisation of the IAP LASG version 2 atmospheric model. They focused on only seven parameters and, unlike us, used a root-meansquare error between simulation and observations normalised by the SD of the standard simulation. They considered a broader range of variables than we did, though they used the older ERBE data rather than the recent CERES data. Unlike us they screened out three of the parameters using the Morris (1991) method. Starting from the default parameter set, they improved their skill score by a small amount, although unlike us they did not test if this was a selection effect. Their best algorithm took about 60 iterations, broadly consistent with our expected number of about 70 iterations.
Various other studies have attempted to produce stable coupled models. Yamazaki et al. (2013) used emulation to find parameter sets that would be expected to produce, in HadCM3, RSR and OLR, values that, relative to the standard configuration of HadCM3, are within the uncertainty limit of Tett et al. (2013c). They found global average temperatures of 289.9 ± 3.6 K, which is a range larger than that of the CMIP3 and CMIP5 ensembles. The uncertainty estimate used in their study includes several sources of uncertainty in addition to observational error. Restricting their analysis to model configurations that have RSR and OLR values within 20 % of that uncertainty range, which has a net TOA flux range of ±1.1 W m −2 , they found those models had a broad range of climate sensitivities and a global mean temperature range of 286-291 K (Fig. 5 of Yamazaki et al., 2013). Irvine et al. (2013) used a Latin hypercube design to produce 200 versions of the coupled atmosphere-ocean HadCM3 model, with 8 parameters being perturbed. They ran each version for 20 years, estimated the final equilibrium temperature and discarded cases which were outside the range 13.6 ± 2 • C. From their 200 initial versions they found 20 cases that met that criteria. How does the computational cost of this compare with our approach of perturbing the atmospheric model then coupling the perturbed atmosphere to the ocean model? The nearest cases we have are the HadAM3-7 cases which need an expected 68 evalua-tions (Table 5), each of 6.25 years for a total of 425 years of atmospheric simulation. As the atmospheric model is about half the cost of the coupled model, this is equivalent to about 210 years of coupled simulation. We then need to carry out 10/8 coupled model simulations, each of 20 years, to get one that is within observational uncertainty for a grand total of 225 coupled-model years. This is approximately the same computational resource that Irvine et al. (2013) need but produces coupled models in better agreement with the pre-industrial temperature estimates.

Conclusions
Using multi-annual, large-spatial-scale observations, we have automatically calibrated HadAM3 and HadAM3P. Much of the time we ended up with models that have similar cost functions to the standard configuration, or for HadAM3P, better than the standard configuration. We used two variants of the Gauss-Newton algorithm. One in which all parameters were varied and a second random blockcoordinate variant in which a sub-set of the parameters, chosen at random on each iteration, were varied. For the studies in which we perturbed 7 parameters in HadAM3 we found that all cases converged, taking an average number of 68 evaluations for a total of 425 simulated years.
For the 14-parameter cases we used both the standard Gauss-Newton algorithm and a variant where a random number of parameters were selected. We tried two random cases. One in which 6 parameters were perturbed and an another in which 8 were perturbed. For each algorithm five studies starting from the same initial parameter choices were carried out. We find large differences in the performance of these algorithms, with the 6 random perturbation algorithm performing best, the 8 random perturbation cases worst and the standard Gauss-Newton algorithm performing intermediately. The 6-random case needs an expected number of 82 evaluations (or 512 simulated years) and, on average, produces models that are slightly better than the standard configuration. We found considerable sensitivity to the number of random parameters in the total number of iterations needed to produce acceptable models. This suggests that further work is needed to determine how many parameters should be perturbed.
As discussed above, the poor performance on the 14parameter case seems to be due to some of the parameter perturbations having only a small impact on the cost function, leading to noise contamination of the line-search vector and causing the algorithm to head in random directions. The poor performance of the random variant that perturbed 8 out of the 14 parameters at random may also be due to noise contamination arising again from unimportant parameters being included, similarly to the full 14-parameter case, or causing some kind of algorithm instability. We recall that Eizenberg (2015) illustrated numerically that for smooth problems with available derivatives, the randomised variants' rates of convergence improve continuously the more parameters are included in the blocks being perturbed.
We also found that several different parameter combinations led to models that were broadly comparable with the standard configurations. This suggests that HadAM3 exhibits equifinality (Beven and Freer, 2001), with different parameter sets leading to models that appear similar. Further, many, though not all, of the resulting coupled models are consistent with pre-industrial temperatures, without any need for flux correction. This is a significant advance on previous work using perturbed physics models, which have generally had to flux-correct the resulting models.
If these techniques could be successfully applied to stateof-the-art models it would be practical to do the following: generate perturbed models to test if an observationally constrained ensemble has a narrow range of climate feedbacks; add new parametrisations of processes to a model then recalibrate the model; explore the effect of changing resolution without large changes in the simulation of large-scale climate.
Though our algorithm works reasonably well for a modest number of parameters, it would benefit from a better understanding of the effect of noise on it. Both the line-search through a selection effect and the computation of the Jacobian/Hermitian matrices are affected by noise. A better algorithm would identify parameters that did not appear to impact the cost function and remove them from the analysis, as done by Zhang et al. (2015). Another potential approach might be to update the components of the Jacobian from the previous iterations values depending on the relative amount of noise contamination in them. We hope that our derivative-based experience with randomised block variants of Gauss-Newtonwhere the rates improve the larger the size of the block of parameters being perturbed -would then be observed here as well. This would further allow us to quantify the trade-offs of the lower evaluation cost per iteration of the small-block randomised variants against their respective global rates of convergence.
Our work focused on optimisation rather than the cost function. We used a cost function based on crude estimates of observational uncertainty and a subjective choice of largescale observations. Future work would benefit from much better estimates of observational uncertainty and an objective means of selecting observations. One approach might be to choose observations of which we have good evidence matter for climate feedbacks or other properties of the model we are concerned about.
Nevertheless, our results suggest that it is possible and computationally feasible to automatically calibrate the atmospheric component of a climate model and generate a plausible coupled model. We implemented and developed the algorithms described above using bash shell scripts and ipython (Pérez and Granger, 2007) with the numpy (Van Der Walt et al., 2011), pandas (http:// pandas.pydata.org/) and iris (http://scitools.org.uk/iris/docs/latest/ index.html) modules. Each iteration was managed using Grid Engine with runs of the climate models each being followed by a job that computed the simulated observables. A final job in each iteration tested for termination and, if required, set up the next iteration. Visualisation was done using Matplotlib (Hunter, 2007)  Number of converged cases with cost function similar to standard model. N Control Number of coupled control simulations consistent with pre-industrial temperatures (Williamson et al., 2014).