The nonlinMIP intercomparison project : physical basis , 1 experimental design and analysis principles 2 3

2 nonlinMIP aims to quantify and understand, at regional scales, climate responses that are non3 linear under CO2 forcing (mechanisms for which doubling the CO2 forcing does not double 4 the response). Non-linear responses can be large at regional scales, with important 5 implications for understanding mechanisms and for GCM emulation techniques (e.g. energy 6 balance models and pattern-scaling methods). However, these processes are hard to explore 7 using traditional experiments, explaining why they have had little attention in previous 8 studies. Some single model studies have established novel analysis principles and some 9 physical mechanisms. There is now a need to explore robustness and uncertainty in such 10 mechanisms across a range of models. 11


Introduction
Robust climate impacts assessments require, at regional scales, understanding of physical mechanisms of climate change in GCM projections.A further, pragmatic requirement for impacts assessments is the ability to emulate (using fast but simplified climate models) GCM behaviour for a much larger range of policy-relevant scenarios than may be evaluated using GCMs directly.These two requirements may be combined into a single question: what is the simplest conceptual framework, for a given well defined model application, that has quantitative predictive power and captures the key mechanisms behind GCM scenario projections?
Often, a choice has been to assume some form of linearity.In studies of the global energy balance, linearity is often assumed in the form of a constant climate feedback parameter.This parameter may be used to quantify feedbacks in different models (e.g.Zelinka et al., 2013) or, in emulation methods, to parameterise global energy balance models (e.g.Huntingford and Cox, 2000).In understanding or emulating regional patterns of climate change, it is often assumed that regional climate change is roughly proportional to global mean warming.In emulation work, this is termed 'pattern scaling' (Mitchell, 2003;Santer et al., 1990;Tebaldi and Arblaster, 2014), but this assumption may also be applied either explicitly or implicitly in understanding mechanisms.Sometimes, patterns of change per K of global warming are quantified; often, physical mechanisms are studied for a single period of a single forcing scenario (implicitly assuming that the understanding is relevant for other periods or scenarios).The use of pattern-scaling is prevalent in studies of climate impacts.
While these approximations appear to work well under many circumstances, significant limitations are increasingly being revealed in such assumptions.These are of two types: different timescales of response, and non-linear responses.In discussing this, a complication arises in that different linearity assumptions exist.Henceforth we define 'linear' as meaning 'consistent with linear systems theory' -i.e.responses that are linear in model forcing (i.e. where doubling the forcing doubles the response).This is different from assuming that regional climate change is proportional to global mean warming -as in pattern scaling.
Even in a linear system (where responses are linear in forcing), the relationship between two system outputs (e.g. between global-mean temperature and regional sea surface temperature -SST) will in general not be linear.This is due to different timescales of response in different locations and/or variables (section 3.1).Examples include lagged surface ocean warming due to a connection with the deeper ocean (Chadwick et al., 2013;Held et al., 2010;Williams et al., 2008;Manabe et al., 1990;Andrews and Ringer, 2014) or the direct response of precipitation to forcings (Andrews et al., 2010;Allen and Ingram, 2002;Mitchell et al., 1987;Bony et al., 2014).One (generally false, but potentially acceptable) assumption of pattern scaling, then, is that regional climate responds over the same timescale as global-mean temperature.Different timescales of response are especially important in understanding and predicting behaviour under mitigation and geoengineering scenarios (or over very long timescales).Non-linear system responses (e.g.Schaller et al., 2013) are more complex to quantify, understand and predict than those of linear systems (section 3.2).Some examples have been known for some time, such as changing feedbacks through retreating snow/sea-ice or increasing water vapour (Colman and McAvaney, 2009;Jonko et al., 2013;Meraner et al., 2013;Hansen et al., 2005), or the behaviour of the Atlantic Meridional Overturning Circulation.More recently, substantial non-linear precipitation responses have been demonstrated in spatial patterns of regional precipitation change in two Hadley Centre climate models with different atmospheric formulations (Good et al., 2012;Chadwick and Good, 2013).This is largely due to simultaneous changes in pairs of known robust pseudo-linear mechanisms (Chadwick and Good, 2013).Regional warming has been shown to be different for a first and second CO2 doubling, with implications primarily for impact assessment models or studies combining linear energy balance models with pattern scaling (Good et al., 2015).Non-linearity has also been demonstrated in the response under idealised geoengineering scenarios, of ocean heat uptake, sea-level rise, and regional climate pattterns, with different behaviour found when forcings are decreasing than when they are increasing (Bouttes et al., 2013;Schaller et al., 2014;Bouttes et al., 2015).
Investigation of these mechanisms at regional scales has been constrained by the type of GCM experiment typically analysed.Most previous analyses (e.g.Solomon et al., 2007) have used results from transient forcing experiments, where forcing changes steadily through the Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016Discuss., doi:10.5194/gmd- -56, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 25 April 2016 c Author(s) 2016.CC-BY 3.0 License.experiment.There are three main problems with this approach.First, information about different timescales of response is masked.This is because the GCM response at any given time in a transient forcing experiment is a mixture of different timescales of response (Good et al., 2013;Held et al., 2010;Li and Jarvis, 2009), including short-timescale responses (e.g.ocean mixed layer response from forcing change over the previous few years) through longtimescale behaviour (including deeper ocean responses from forcing changes multiple decades to centuries earlier).Secondly, in transient forcing experiments, non-linear behaviour is hard to separate from linear mechanisms.For example, in an experiment where CO2 is increased by 1% per year for 140 years ('1pctCO2'), we might find different spatial patterns at year 70 (at 2xCO2) than at year 140 (at 4xCO2).This could be due to nonlinear mechanisms (due to the different forcing level and associated different climate state).However, it could also be due to linear mechanisms: year 140 follows 140 years of forcing increase, so includes responses over longer response timescales than at year 70 (only 70 years of forcing increase).
Thirdly, signal/noise ratios of regional climate change can be relatively poor in such experiments.
These three issues may be addressed by the use of idealised abruptCO2 GCM experiments (Forster et al., 2012;Zelinka et al., 2013;Jonko et al., 2013;Good et al., 2013;Good et al., 2012;Chadwick and Good, 2013;Chadwick et al., 2013;Bouttes et al., 2013;Gregory et al., 2004): an experiment where CO2 forcing is instantaneously changed, then held constant.In these abrupt CO2 experiments, responses over different timescales are separated from each other.Further, responses at different forcing levels may be directly compared, e.g. by comparing the response in abrupt2xCO2 and abrupt4xCO2 experiments over the same timescale -both have identical forcing time histories, apart from the larger forcing magnitude in abrupt4xCO2.Thirdly, high signal/noise is possible: averages may be taken over periods of 100 years or more (after the initial ocean mixed layer adjustment, change is gradual in such experiments).Recent work (Good et al., 2015;Good et al., 2012;Good et al., 2013;Zelinka et al., 2013;Bouttes et al., 2015) has established that these experiments contain global and regional-scale information quantitatively traceable to more policy-relevant transient experiments -and equivalently, that they form the basis for fast simple climate model projections traceable to the GCMs.In other studies (e.g.Frolicher et al., 2014), pulse experiments have been used to separate different timescales of response (where forcing is abruptly increased, then abruptly returned to the control state).We use abruptCO2 experiments because they offer greater signal/noise in the change signal (important for regional-scale studies); and also for consistency with the CMIP6 DECK abrupt4xCO2 experiment.
The CMIP5 abrupt4xCO2 experiments have thus been used widely: including quantifying GCM forcing and feedback behaviour (Gregory et al., 2004;Zelinka et al., 2013), and for traceable emulation of GCM projections of global-mean temperature and heat uptake (Good et al., 2013;Stott et al., 2013).Abrupt4xCO2 is also part of the CMIP6 DECK protocol (Meehl et al., 2014).
NonlinMIP builds on the CMIP5 and CMIP6 DECK designs to explore non-linear responses (via additional abruptCO2 experiments at different forcing levels.It also explores responses over slightly longer timescales -extending the CMIP5 abrupt4xCO2 experiment by 100 years).

Relating abruptCOto gradual forcing scenarios: the step-response model
In using the highly-idealised abruptCO2 experiments, it is essential that their physical relevance (traceability) to more realistic gradual forcing experiments is determined.We cannot apriori reject the possibility that some GCMs could respond unrealistically to the abrupt forcing change.A key tool here is the step-response model (described below).This response-function method aims to predict the GCM response to any given transient-forcing experiment, using the GCM response to an abruptCO2 experiment.Such a prediction may be compared with the GCM transient-forcing simulation, as part of a traceability assessment (discussed in detail in section 5).potentially as a basis for GCM emulation.The method description below also serves to illustrate the assumptions of linear system theory.
The step-response model represents the evolution of radiative forcing in a scenario experiment by a series of step changes in radiative forcing (with one step taken at the beginning of each year).The method makes two linear assumptions.First, the response to each annual forcing step is estimated by linearly scaling the response in a CO 2 step experiment according to the magnitude of radiative forcing change.Second, the response y i at year i of a scenario experiment is estimated as a sum of responses to all previous annual forcing changes (see Figure 1 of Good et al., 2013 for an illustration): where x j is the response of the same variable in year j of the CO 2 step experiment.j i w − scales down the response from the step experiment (x j ) to match the annual change in radiative forcing during year i-j of the scenario (denoted where s F ∆ is the radiative forcing change in the CO 2 step experiment.All quantities are expressed as anomalies with respect to a constant-forcing control experiment. This approach can in principle be applied at any spatial scale for any variable for which the assumptions are plausible (e.g.Chadwick et al., 2013).

Linear mechanisms: different timescales of response
Even in a linear system, regional climate change per K of global warming will evolve during a scenario simulation.This happens because different parts of the climate system have different timescales of response to forcing change.
This may be due to different effective heat capacities.For example, the ocean mixed layer responds much faster than the deeper ocean, simply due to a thinner column of water (Li and Jarvis, 2009).However, some areas of the ocean surface (e.g. the Southern Ocean and southeast subtropical Pacific) show lagged warming, due to a greater connection (via upwelling or mixing) with the deeper ocean (e.g.Manabe et al., 1990;Williams et al., 2008).The dynamics of the ocean circulation and vegetation may also have their own inherent timescales (e.g.vegetation change may lag global warming by years to hundreds of years, Jones et al., 2009).
At the other extreme, some responses to CO2 forcing are much faster than global warming: such as the direct response of global mean precipitation to forcings (Allen and Ingram, 2002;Andrews et al., 2010;Mitchell et al., 1987) and the physiological response of vegetation to CO2 (Field et al., 1995).
In a linear system, patterns of change per K of global warming are sensitive to the forcing history.For example in Figure 1, a scenario is illustrated where forcing is ramped up, then stabilized.Three periods are highlighted, which may have different patterns of change per K of global warming, due to different forcing histories: at the leftmost point, faster responses will be relatively more important, whereas at the right, the slower responses have had some time to catch up.This is illustrated in Figure 2 for sea-level rise.The blue curves show that for RCP2.6, global-mean warming ceases after 2050, while sea-level rise continues at roughly the same rate throughout the century.This is largely because deep ocean heat uptake is much slower than ocean mixed-layer warming.By design, abruptCO2 experiments separate different timescales of GCM response to forcing change.This is used, for example, (Gregory et al., 2004) to estimate radiative forcing and feedback parameters for GCMs: plotting radiative flux anomalies against global mean warming can separate 'fast' and 'slow' responses (see e.g. Figure 3).

Non-linear responses
Nonlinear mechanisms arise for a variety of reasons.Often, however, it is useful to describe them as state-dependent feedbacks.For example, the snow-albedo feedback becomes small at high or low snow depth.Sometimes, nonlinear mechanisms may be better viewed as simultaneous changes in pairs of properties.For example, convective precipitation is broadly a product of moisture content and dynamics (Chadwick and Good, 2013;Chadwick et al., 2012;Oueslati et al., 2016;Bony et al., 2014).Both moisture content and atmospheric dynamics respond to CO2 forcing, so in general we might expect convective precipitation to have a nonlinear response to CO2 forcing.Of course, more complex nonlinear responses exist, such as for the Atlantic Meridional Overturning Circulation.
In contrast to linear mechanisms, nonlinear mechanisms are sensitive to the magnitude of forcing.For example, the two points highlighted in Figure 4 may have different patterns of change per K of global warming, due to nonlinear mechanisms.
An example is given in Figure 5, which shows the albedo feedback declining with increased global temperature, due to declining snow and ice cover, and the remaining snow and ice being in areas of lower solar insolation (Colman and McAvaney, 2009).forcing is approximately linear in log[CO2], Myhre et al., 1998).With this assumption, a linear system would have zero doubling difference everywhere.Therefore, the doubling difference is used as a measure of nonlinearity.The question of which abruptCO2 experiments to compare, and over which timescale, is discussed in section 5.
In some GCMs, the forcing per CO2 doubling has been shown to vary with CO2 (Colman and McAvaney, 2009;Jonko et al., 2013).However, this variation depends on the specific definition of forcing used (Jonko et al., 2013).Currently this is folded into our definition of nonlinearity.If a robust definition of this forcing variation becomes available in future, it could be used to scale out any difference in forcing between pairs of abruptCO2 experiments, to calculate an 'adjusted doubling difference'.
As an example, Figure 7 maps the response to abrupt2xCO2 and abrupt4xCO2, and the doubling difference, for precipitation in HadGEM2-ES over the ocean (taken from Chadwick and Good).The nonlinearities are large -comparable in magnitude to the responses to abrupt2xCO2, albeit with a different spatial pattern.

Experimental design
nonlinMIP is composed of a set of abruptCO2 experiments (the primary tools), plus a CO2forced transient experiment.AbruptCO2 experiments are driven by changes in atmospheric CO2 concentration: CO2 is abruptly changed, then held constant.These build on the CMIP5 and CMIP6 DECK protocols (the required runs from these are detailed in Table 1).The additional nonlinMIP runs (Table 2) are assigned three priority levels.Three options for participation are: 1) only the 'essential' simulation; 2) all 'high priority' plus the 'essential' simulations; or, preferably, 3) all simulations.The experiments in Table 1 are required in all cases.All experiments must be initialized from the same year of a pre-industrial control experiment, except for abrupt4xto1x (see Table 2).A typical analysis procedure is outlined in section 5. Geosci. Model Dev. Discuss., doi:10.5194/gmd-2016-56, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 25 April 2016 c Author(s) 2016.CC-BY 3.0 License.
The nonlinMIP design is presently limited to CO2 forcing, although the same principles could be applied to other forcings.

Basic analysis principles
This section outlines the general principles behind analysis of nonlinMIP results.The primary idea is to find where the step-response model (section 2) breaks: since the stepresponse model is based on a linear assumption, this amounts to detecting non-linear responses.
The aim is to focus subsequent analysis.If non-linearities in a quantity of interest are found to be small, then analysis may focus on understanding different timescales of response from a single abruptCO2 experiment: linearity means that the physical response (over a useful range of CO2 concentrations) is captured by a single abruptCO2 experiment.This represents a considerable simplification.If, on the other hand, non-linearities are found to be important, the focus shifts to understanding the different responses in different abruptCO2 experiments.
The choice of which abruptCO2 experiments to focus on, and over which timescales, is discussed below.

First step: check basic traceability of abrupt4xCO2 to the transient-forced response near 4xCO2
The test described here is recommended as a routine analysis of the CMIP6 DECK experiments (even if nonlinMIP experiments are not performed).The aim is to confirm whether the abruptCO2 experiments contain realistic physical responses in the variables of interest (as previously done for global-mean temperature and heat uptake for a range of CMIP5 models (Good et al., 2013), for regional-scale warming and ocean heat updake (Good et al., 2015;Bouttes et al., 2015) and for other global-mean quantities for HadCM3 (Good et al., 2011).This also, rules out the most pathological non-linearities (e.g. if the response to an abrupt CO2 change in a given GCM was unrealistic).Although this test has been done for a range of models and variables, traceability cannot be assumed to hold for all models and variables.
The linear step-response model should first be used with the abrupt4xCO2 response, to predict the response near year 140 of the 1pctCO2 experiment (i.e.near 4xCO2).This prediction is then compared with the actual GCM 1pctCO2 result.This should first be done for global mean temperature: this assessment has been performed for a range of CMIP5 models (Good et al., 2013; see Figure 8), giving an idea of the level of accuracy expected.If the abruptCO2 response is fundamentally unrealistic, it is likely to show up in the global temperature change.This approach may then be repeated for spatial patterns of warming, and then for the quantities of interest.Abrupt4xCO2 is used here as it has larger signal/noise than abrupt2xCO2, yet is representative of forcing levels in a business-as-usual scenario by 2100.
However, the tests may also be repeated using abrupt2xCO2 -but compared with year 70 of the 1pctCO2 experiment (i.e. at 2xCO2).
The step-response model emulation under these conditions should perform well for most cases: the state at year 140 of the 1pctCO2 experiment is very similar to that of abrupt4xCO2 (same forcing, similar global-mean temperature), so errors from non-linear mechanisms should be minimal.If large errors are found, this may imply caution about the use of abruptCO2 experiments for these variables, or perhaps point to novel non-linear mechanisms that may be understood by further analysis.

Second step: characterising nonlinear responses
Having established some level of confidence in the abruptCO2 physical response, the second step is to look for nonlinear responses.This first involves repeating the tests from step 1 above, but for different parts of the 1pctCO2 and 1pctCO2 ramp-down experiments, and using different abruptCO2 experiments for the step-response model.
An example is given in Figure 9 (but for different transient-forcing experiments).This shows results for global-mean precipitation in the HadCM3 GCM (Good et al., 2012).Here, the step-response model prediction using abrupt4xCO2 (red curves) only works where a transient-forced experiment is near to 4xCO2.Similarly, the prediction using abrupt2xCO2 Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016Discuss., doi:10.5194/gmd- -56, 2016 Manuscript under review for journal Geosci. Model  Having identified some non-linear response, and highlighted two or more abruptCO2 experiments to compare (in the previous example, abrupt2xCO2 and abrupt4xCO2), the nonlinear mechanisms may be studied in detail by comparing the responses in the different abruptCO2 experiments over the same timescale (e.g. via the doubling difference, as in Figures 6,7).This allows (Good et al., 2012;Chadwick and Good, 2013;Good et al., 2015) non-linear mechanisms to be separated from linear mechanisms (not possible in a transientforcing experiment).

Conclusions
There is a need to quantify and understand, at regional scales, nonlinear mechanisms of climate change.This is difficult to do using transient model experiments alone, for two reasons: contamination due to different timescales of response, and noise from internal variability.This paper outlines the basic physical principles behind the nonlinMIP design, and the method of establishing traceability from abruptCO2 to gradual forcing experiments, before detailing the experimental design and finally some general analysis principles that should apply to most studies based on this dataset.

Data availability
Results will be made available as part of the CFMIP project, within the sixth model intercomparison project, CMIP6.
Once some confidence is established in traceability of the abruptCO2 experiments to transient-forcing scenarios, the step-response model has other roles: to explore the implications, for different forcing scenarios, of physical understanding gleaned from abruptCO2 experiments; to help separate linear and nonlinear mechanisms (section 5); and Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-56,2016 Manuscript under review for journal Geosci.Model Dev.Published: 25 April 2016 c Author(s) 2016.CC-BY 3.0 License.
Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016Discuss., doi:10.5194/gmd--56, 2016     Manuscript under review for journalGeosci.Model  Dev.Published: 25 April 2016 c Author(s) 2016.CC-BY 3.0 License. 3 Linear and non-linear mechanisms, and the relevance of abruptCO2 experiments Here we discuss further, with examples, the distinction between linear and nonlinear mechanisms, when they are important, and the relevance of abruptCO2 experiments.
AbruptCO2 experiments may be used to separate nonlinear from linear mechanisms.This can be done by comparing the responses at the same timescale in different different abruptCO2 experiments.Figure 6 compares abrupt2xCO2 and abrupt4xCO2 experiments over years 50-149.A 'doubling difference' is defined, measuring the difference in response to the first and second CO2 doublings.In most current simple climate models (e.g.Meinshausen et al., 2011), the radiative forcing from each successive CO2 doubling is assumed identical (because Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-56,2016 Manuscript under review for journal Geosci.Model Dev.Published: 25 April 2016 c Author(s) 2016.CC-BY 3.0 License.
Dev. Published: 25 April 2016 c Author(s) 2016.CC-BY 3.0 License.(blue curves) works only near 2xCO2.Otherwise, quite large errors are seen, and the predictions with abrupt2xCO2 and abrupt4xCO2 are quite different from each other.This implies that there are large non-linearities in the precipitation response in this GCM, and that they may be studied by comparing the responses in the abrupt2xCO2 and abrupt4xCO2 experiments.

Figure 1 .
Figure 1.Schematic illustrating a situation where linear mechanisms can cause climate patterns to evolve.This represents a scenario where forcing (black line) is ramped up, then stabilised.

Figure 3 .
Figure 3. Illustrating a method (Gregory et al., 2004) for separating 'fast' and 'slow' responses to radiative forcing change.Figure adapted (labels in rectangles overlaid) from Zelinka et al. (2013).Global-mean cloud-induced SW flux anomalies against global warming, for the CanESM2 model (black & grey represent two methods of calculating cloudinduced fluxes).This also illustrates one test of traceability of abrupt4xCO2 to 1pctCO2 responses: the linear fit to the abrupt4xCO2 response (straight lines) passes through the 1pctCO2 response near 4xCO2 (i.e.near year 140 of that experiment).

Figure 4 .
Figure 4. Schematic illustrating the point that nonlinear mechanisms can cause climate patterns to differ at different forcing (and hence global temperature) levels.

Figure 6 .
Figure 6.Defining the 'doubling difference'.Doubling difference = ∆42 -∆21 (the difference in response between the first and second CO2 doublings.This is defined for a specific timescale after the abrupt CO2 change -in this example, it is the mean over years 50-149.

Figure 9 .
Figure 9. Finding nonlinear responses in transient forcing experiments.(figure from Good et al., 2012).Left: where CO2 is increased by 1% per year, then stabilised at 2x pre-industrial levels.Right: where CO2 is increased by 2% per year for 70 years, then decreased by 2% per year for 70 years.Black: GCM.Red: step-response model using the abrupt4xCO2 response.Blue: the abrupt2xCO2 response.