nonlinMIP contribution to CMIP6: model intercomparison project for non-linear mechanisms: physical basis, experimental design and analysis principles (v1.0)

. nonlinMIP provides experiments that account for state-dependent regional and global climate responses. The experiments have two main applications: (1) to focus understanding of responses to CO 2 forcing on states relevant to speciﬁc policy or scientiﬁc questions (e.g

Abstract. nonlinMIP provides experiments that account for state-dependent regional and global climate responses. The experiments have two main applications: (1) to focus understanding of responses to CO 2 forcing on states relevant to specific policy or scientific questions (e.g. change under low-forcing scenarios, the benefits of mitigation, or from past cold climates to the present day), or (2) to understand the state dependence (non-linearity) of climate change -i.e. why doubling the forcing may not double the response. State dependence (non-linearity) of responses can be large at regional scales, with important implications for understanding mechanisms and for general circulation model (GCM) emulation techniques (e.g. energy balance models and pattern-scaling methods). However, these processes are hard to explore using traditional experiments, which explains why they have had so little attention in previous studies. Some single model studies have established novel analysis principles and some physical mechanisms. There is now a need to explore robustness and uncertainty in such mechanisms across a range of models (point 2 above), and, more broadly, to focus work on understanding the response to CO 2 on climate states relevant to specific policy/science questions (point 1).
nonlinMIP addresses this using a simple, small set of CO 2forced experiments that are able to separate linear and nonlinear mechanisms cleanly, with a good signal-to-noise ra-tio -while being demonstrably traceable to realistic transient scenarios. The design builds on the CMIP5 (Coupled Model Intercomparison Project Phase 5) and CMIP6 DECK (Diagnostic, Evaluation and Characterization of Klima) protocols, and is centred around a suite of instantaneous atmospheric CO 2 change experiments, with a ramp-up-ramp-down experiment to test traceability to gradual forcing scenarios. In all cases the models are intended to be used with CO 2 concentrations rather than CO 2 emissions as the input. The understanding gained will help interpret the spread in policy-relevant scenario projections.
Here we outline the basic physical principles behind non-linMIP, and the method of establishing traceability from abruptCO 2 to gradual forcing experiments, before detailing the experimental design, and finally some analysis principles. The test of traceability from abruptCO 2 to transient experiments is recommended as a standard analysis within the CMIP5 and CMIP6 DECK protocols.
further, pragmatic requirement for impact assessments is the ability to emulate (using fast but simplified climate models) GCM behaviour for a much larger range of policy-relevant scenarios than may be evaluated using GCMs directly. These two requirements may be combined into a single question: what is the simplest conceptual framework, for a given welldefined model application, that has quantitative predictive power and captures the key mechanisms behind GCM scenario projections?
Often, one choice has been to assume some form of linearity. In studies of the global energy balance, linearity is often assumed in the form of a constant climate feedback parameter. This parameter may be used to quantify feedbacks in different models (e.g. Zelinka et al., 2013) or, in emulation methods, to parameterize global energy balance models (e.g. Huntingford and Cox, 2000). In understanding or emulating regional patterns of climate change, it is often assumed explicitly that regional climate change is roughly proportional to global-mean warming. In emulation work, this is termed "pattern scaling" (Santer et al., 1990;Mitchell, 2003;Ishizaki et al., 2012;Tebaldi and Arblaster, 2014), but this assumption may also be applied implicitly in understanding mechanisms. Often, physical mechanisms are studied for a single period of a single forcing scenario or in a single highforcing experiment such as abrupt4xCO2 (implicitly assuming that the understanding is relevant for other periods or scenarios). The use of pattern scaling is prevalent in studies of climate impacts.
While these approximations appear to work well under many circumstances, significant limitations are increasingly being revealed in such assumptions. The following are of two types: different timescales of response, and non-linear responses. In discussing this, a complication arises in that different linearity assumptions exist. Henceforth, we define "linear" as meaning "consistent with linear systems theory"; i.e. responses that are linear in model forcing (i.e. where doubling the forcing doubles the response). This is different from assuming that regional climate change is proportional to global-mean warming -as in pattern scaling.
Even in a linear system (where responses are linear in forcing), the relationship between two system outputs (e.g. between global-mean temperature and regional sea surface temperature -SST) will in general not be linear. This is due to different timescales of response in different locations and/or variables (Sect. 3.1). Examples include lagged surface ocean warming due to a connection with the deeper ocean (Manabe et al., 1990;Williams et al., 2008;Held et al., 2010;Chadwick et al., 2013a;Andrews and Ringer, 2014) or the direct response of precipitation to forcings (Mitchell et al., 1987;Allen and Ingram, 2002;Andrews et al., 2010;Bala et al., 2010;Bony et al., 2014). One (generally false, but potentially acceptable) assumption of pattern scaling is that regional climate responds over the same timescale as globalmean temperature. Different timescales of response are especially important in understanding and predicting behaviour under mitigation and geo-engineering scenarios (or over very long timescales). Non-linear system responses (e.g. Schaller et al., 2013) are more complex to quantify, understand and predict than those of linear systems (Sect. 3.2). Some examples have been known for some time, such as changing feedbacks through retreating snow/sea ice or increasing water vapour (Hansen et al., 2005;Colman and McAvaney, 2009;Jonko et al., 2013;Meraner et al., 2013). Some palaeoclimate evidence supports the idea that climate sensitivity increases with warming (Caballero and Huber, 2013;Shaffer et al., 2016), which is important for the risk of high-end global warming (Bloch-Johnson et al., 2015). The non-linear behaviour of the Atlantic Meridional Overturning Circulation is another example (Hofmann and Rahmstorf, 2009;Ishizaki et al., 2012). More recently, substantial non-linear precipitation responses have been demonstrated in spatial patterns of regional precipitation change in two Hadley Centre climate models with different atmospheric formulations (Good et al., 2012;Chadwick and Good, 2013). This is largely due to simultaneous changes in pairs of known robust pseudo-linear mechanisms . Regional warming has been shown to be different for a first and second CO 2 doubling, with implications primarily for impact assessment models or studies combining linear energy balance models with pattern scaling . Non-linearity has also been demonstrated in the response under idealized geoengineering scenarios, of ocean heat uptake, sea-level rise, and regional climate patterns, with different behaviour found when forcings are decreasing than when they are increasing Schaller et al., 2014).
Investigation of these mechanisms at regional scales has been constrained by the type of GCM experiment typically analysed. Most previous analyses (e.g. Solomon et al., 2007) have used results from transient-forcing experiments, where forcing changes steadily through the experiment. There are three main problems with this approach. First, information about different timescales of response is masked. This is because the GCM response at any given time in a transientforcing experiment is a mixture of different timescales of response (Li and Jarvis, 2009;Held et al., 2010;, including short-timescale responses (e.g. ocean mixed-layer response from forcing change over the previous few years) through long-timescale behaviour (including deeper ocean responses from forcing changes multiple decades to centuries earlier). Second, in transient-forcing experiments, non-linear behaviour is hard to separate from linear mechanisms. For example, in an experiment where CO 2 is increased by 1 % per year for 140 years (1pctCO 2 ), we might find different spatial patterns at year 70 (at 2xCO2) than at year 140 (at 4xCO2). This could be due to non-linear mechanisms (due to the different forcing level and associated different climate state). However, it could also be due to linear mechanisms: year 140 follows 140 years of forcing increase, and therefore includes responses over longer response Geosci. Model Dev., 9,[4019][4020][4021][4022][4023][4024][4025][4026][4027][4028]2016 www.geosci-model-dev.net/9/4019/2016/ timescales than at year 70 (only 70 years of forcing increase). Third, signal-to-noise ratios of regional climate change can be relatively poor in such experiments. These three issues may be addressed by the use of idealized abruptCO 2 GCM experiments: an experiment where CO 2 forcing is instantaneously changed, then held constant. The simplified forcing in such experiments simplifies the understanding of physical mechanisms of response. In these abrupt CO 2 experiments, responses over different timescales (fast and slow responses) are separated from each other. Further, responses at different forcing levels may be directly compared, e.g. by comparing the response in abrupt2xCO2 and abrupt4xCO2 experiments over the same timescale -both have identical forcing time histories, apart from the larger forcing magnitude in abrupt4xCO2. Finally, high signal-to-noise is possible: averages may be taken over periods of 100 years or more (after the initial ocean mixedlayer adjustment, change is gradual in such experiments). Recent work (Good et al., 2012Zelinka et al., 2013;Bouttes et al., 2015;Good et al., 2015) has established that these experiments contain global-and regional-scale information quantitatively traceable to more policy-relevant transient experiments -and, equivalently, that they form the basis for fast simple climate model projections traceable to the GCMs. In other studies (e.g. Frolicher et al., 2014), pulse experiments have been used to separate different timescales of response (where forcing is abruptly increased, then abruptly returned to the control state). We use abruptCO 2 experiments because they offer greater signal-to-noise in the change signal (important for regional-scale studies) as well as also for consistency with the CMIP6 (Coupled Model Intercomparison Project Phase 6) DECK (Diagnostic, Evaluation and Characterization of Klima) abrupt4xCO2 experiment.
The CMIP5 abrupt4xCO2 experiments have thus been used widely: including quantifying GCM forcing and feedback behaviour (Gregory et al., 2004;Zelinka et al., 2013), and for traceable emulation of GCM projections of globalmean temperature and heat uptake Stott et al., 2013). Abrupt4xCO2 is also part of the CMIP6 DECK protocol (Meehl et al., 2014). nonlinMIP builds on the CMIP5 and CMIP6 DECK designs to explore non-linear responses (via additional abruptCO 2 experiments at different forcing levels). It also explores responses over slightly longer timescales -extending the CMIP5 abrupt4xCO2 experiment by 100 years.
2 Relating abruptCO 2 to gradual forcing scenarios: the step-response model In using the highly idealized abruptCO 2 experiments, it is essential that their physical relevance (traceability) to more realistic gradual forcing experiments is determined. We cannot a priori reject the possibility that some GCMs could respond unrealistically to the abrupt forcing change. A key tool here is the step-response model (described below). This (Hasselmann et al., 1993) is a response-function method, which aims to predict the GCM response to any given transient-forcing experiment, using the GCM response to an abruptCO 2 experiment. Such a prediction may be compared with the GCM transient-forcing simulation, as part of a traceability assessment (discussed in detail in Sect. 5).
Once some confidence is established in traceability of the abruptCO 2 experiments to transient-forcing scenarios, the step-response model has other roles: to explore the implications (for different forcing scenarios) of physical understanding gleaned from abruptCO 2 experiments, to help separate linear and non-linear mechanisms (Sect. 5), and potentially as a basis for GCM emulation. The method description below also serves to illustrate the assumptions of linear system theory.
The step-response model represents the evolution of radiative forcing in a scenario experiment by a series of step changes in radiative forcing (with one step taken at the beginning of each year). The method makes two linear assumptions. First, the response to each annual forcing step is estimated by linearly scaling the response in a CO 2 step experiment according to the magnitude of radiative forcing change. Second, the response y i at year i of a scenario experiment is estimated as a sum of responses to all previous annual forcing changes (see Fig. 1 of Good et al., 2013, for an illustration): where x j is the response of the same variable in year j of the CO 2 step experiment. w i−j scales down the response from the step experiment (x j ) to match the annual change in radiative forcing during year i −j of the scenario (denoted F i−j ): where F s is the radiative forcing change in the CO 2 step experiment. All quantities are expressed as anomalies with respect to a constant-forcing control experiment. This approach can in principle be applied at any spatial scale for any variable for which the assumptions are plausible (e.g. Chadwick et al., 2013a). Schematic illustrating a situation where linear mechanisms can cause climate patterns to evolve. This represents a scenario where global-mean radiative forcing (black line) is ramped up, then stabilized. At the time indicated by the left red oval, responses with shorter timescales are relatively important, due to the recent increase in forcing. At the time marked by the right-hand oval, forcing has been stabilized for an extended period, so the responses with longer timescales (such as sea-level rise) have had more time to respond to the initial forcing increase.

Linear mechanisms: different timescales of response
Even in a linear system, regional climate change per kelvin of global warming will evolve during a scenario simulation. This happens because different parts of the climate system have different timescales of response to forcing change. This may be due to different effective heat capacities. For example, the ocean mixed layer responds much faster than the deeper ocean, simply due to a thinner column of water (Li and Jarvis, 2009). However, some areas of the ocean surface (e.g. the Southern Ocean and south-east subtropical Pacific) show lagged warming, due to a greater connection (via upwelling or mixing) with the deeper ocean (e.g. Manabe et al., 1990;Williams et al., 2008). The dynamics of the ocean circulation and vegetation may also have their own inherent timescales (e.g. vegetation change may lag global warming by years to hundreds of years; Jones et al., 2009). At the other extreme, some responses to CO 2 forcing are much faster than global warming: such as the direct response of global-mean precipitation to forcings (Mitchell et al., 1987;Allen and Ingram, 2002;Andrews et al., 2010) and the physiological response of vegetation to CO 2 (Field et al., 1995). In a linear system, patterns of change per kelvin of global warming are sensitive to the forcing history. For example in Fig. 1, a scenario is illustrated where forcing is ramped up, then stabilized. Three periods are highlighted, which may have different patterns of change per kelvin of global warming, due to different forcing histories: at the leftmost point, faster responses will be relatively more important, whereas at the right the slower responses have had some time to catch up. A key example is the different responses of globalmean warming and global-mean sea-level rise under Representative Concentration Pathway 2.6 (RCP2.6), as shown in Figs. SPM.7 and SPM.9 of the IPCC Fifth Assessment Report (IPCC, 2013). Under RCP2.6, global-mean warming ceases after 2050, when radiative forcing is approximately stabilized (corresponding qualitatively to the period when the black line is horizontal in Fig. 1). In contrast, sea-level rise continues at roughly the same rate throughout the century. Therefore, in RCP2.6, the sea-level rise per kelvin of global warming increases after 2050. This is largely because the timescale of deep ocean heat uptake is much longer than that of ocean mixed-layer warming.
By design, abruptCO 2 experiments separate GCM responses with different timescales (i.e. separating faster responses from slower responses): the response of a given variable in year Y of the experiment corresponds to the response of that variable over the timescale Y . This is used, for example, (Gregory et al., 2004) to estimate radiative forcing and feedback parameters for GCMs: plotting radiative flux anomalies against global-mean warming can separate "fast" and "slow" responses. For example, the top-of-atmosphere outgoing shortwave flux shows a rapid initial change before the global-mean temperature has had time to respond.

Non-linear responses
Non-linear mechanisms arise for a variety of reasons. Often, however, it is useful to describe them as state-dependent feedbacks. For example, the snow-albedo and sea-icealbedo feedbacks become small at high or low snow depth (Hall, 2004;Eisenman, 2012). Soil moisture-temperature feedbacks can also be state dependent (Seneviratne et al., 2006(Seneviratne et al., , 2010: feedback is small when soil moisture is saturated, or so low that moisture is tightly bound to the soil (in both regimes, evaporation is insensitive to change in soil moisture). Sometimes, non-linear mechanisms may be better viewed as simultaneous changes in pairs of properties. For example, convective precipitation is broadly a product of moisture content and dynamics (Chadwick et al., 2013b;Chadwick and Good, 2013;Bony et al., 2014;Oueslati et al., 2016). Both moisture content and atmospheric dynamics respond to CO 2 forcing, so in general we might expect convective precipitation to have a non-linear response to CO 2 forcing. In addition, the Clausius-Clapeyron equation introduces some non-linearity in the increase of specific humidity with warming. Of course, more complex non-linear responses exist, such as for the Atlantic Meridional Overturning Circulation.
In contrast to linear mechanisms, non-linear mechanisms are sensitive to the magnitude of forcing. For example, the two points highlighted in Fig. 2 may have different patterns of change per kelvin of global warming, due to non-linear mechanisms. In contrast, linear mechanisms would cause no difference in the patterns of change per kelvin of global warming between the two points in Fig. 2, because the two scenarios have the same forcing history apart from a constant scaling factor.
An example is the snow-ice albedo feedback, which tends to change in magnitude with increased global temperature, Figure 2. Schematic illustrating the point that non-linear mechanisms can cause climate patterns to differ at different forcing (and hence global temperature) levels. This represents two different scenarios, whose forcing time series is identical apart from a constant scale factor (the higher forcing scenario has about twice the forcing of the lower scenario). due to declining snow and ice cover, and the remaining snow and ice being in areas of lower solar insolation (Colman and McAvaney, 2009).
AbruptCO 2 experiments may be used to separate nonlinear from linear mechanisms. This can be done by comparing the responses at the same timescale in different abruptCO 2 experiments. Figure 3 compares abrupt2xCO2 and abrupt4xCO2 experiments over years 50-149. A "doubling difference" is defined , measuring the difference in response to the first and second CO 2 doublings. In most current simple climate models (e.g. Meinshausen et al., 2011), the radiative forcing from each successive CO 2 doubling is assumed identical (because forcing is approximately linear in log[CO 2 ]; Myhre et al., 1998). With this assumption, a linear system would have zero doubling difference everywhere. Therefore, the doubling difference is used as a measure of non-linearity. The question of which abruptCO 2 experiments to compare, and over which timescale, is discussed in Sect. 5.
In some GCMs, the forcing per CO 2 doubling has been shown to vary with CO 2 (Colman and McAvaney, 2009;Jonko et al., 2013). However, this variation depends on the specific definition of forcing used (Jonko et al., 2013). Currently, this is folded into our definition of non-linearity. If a robust definition of this forcing variation becomes available in the future, it could be used to scale out any difference in forcing between pairs of abruptCO 2 experiments, to calculate an "adjusted doubling difference".

Experimental design
nonlinMIP is composed of a set of abruptCO 2 experiments (the primary tools), plus a CO 2 -forced transient experiment.
AbruptCO 2 experiments are driven by changes in atmospheric CO 2 concentration: CO 2 is abruptly changed, then held constant. These build on the CMIP5 and CMIP6 DECK protocols (the required runs from these are detailed in Table 1). The additional nonlinMIP runs (Table 2) are assigned three priority levels. The three options for participation are (1) only the "essential" simulation, (2) all "high priority" plus This is defined for a specific timescale after the abrupt CO 2 change -in this example, it is for means over years 50-149. the "essential" simulations, or, preferably, (3) all simulations. The experiments in Table 1 are required in all cases. All experiments must be initialized from the same year of a preindustrial control experiment, except for abrupt4xto1x (see Table 2). A typical analysis procedure is outlined in Sect. 5.
The nonlinMIP design is presently limited to CO 2 forcing, although the same principles could be applied to other forcings.

Basic analysis principles
This section outlines the applications and general principles behind analysis of nonlinMIP results. First, some general applications are introduced, before giving more detail on how one particular application (quantifying and understanding non-linear change) may be analysed. The addition of the abrupt2xCO2 experiment to the standard DECK abrupt4xCO2 permits quantifying and understanding climate change due to CO 2 for three main applications: 3. non-linear change (the difference between 2 and 1).
Applications 1 and 2 are expected to be of the widest interest to the community, as they could be analysed using the same methods as have already been used extensively to study To test traceability of the abruptCO 2 experiments to more realistic transient-forcing conditions. Adding the ramp-down phase explores physics relevant to mitigation and geo-engineering scenarios.
the response in the CMIP5 abrupt4xCO2 experiment, but for climate states more relevant to the policy questions outlined in (1) and (2). Useful signal-to-noise ratios should be possible because ∼ 100-year means may be analysed (e.g. over years 50-149, where climate is relatively stable as it follows the initial ocean mixed-layer warming). Application 3 is more specialized, and is discussed in more detail below. The abrupt0.5xCO2 experiment permits analogous work, extending the relevance to colder past climates, and exploring one aspect of how past change may differ from future change. It also allows non-linear mechanisms to be studied with greater signal-to-noise ratio: 4. change under past cold climates (abrupt0.5xCO2 -pi-Control) 5. non-linear change: like 3, but with larger signal-tonoise ratio ( [abrupt4xco2 -abrupt2xco2] -[piControl -abrupt0.5xCO2] ).
In quantifying non-linear change (applications 3 or 5 above), the primary idea is to find where the step-response model (Sect. 2) breaks: since the step-response model is based on a linear assumption, this amounts to detecting non-linear responses.
The aim is to focus subsequent analysis. If non-linearities in a quantity of interest are found to be small, then analysis may focus on understanding different timescales of response from a single abruptCO 2 experiment: linearity means that the physical response (over a useful range of CO 2 concentrations) is captured by a single abruptCO 2 experiment. This represents a considerable simplification. If, on the other hand, non-linearities are found to be important, the focus shifts to understanding the different responses in different abruptCO 2 experiments. The choice of which abruptCO 2 experiments to focus on, and over which timescales, is discussed below.

First step: check basic traceability of abrupt4xCO2
to the transient-forced response near 4xCO2 The test described here is recommended as a routine analysis of the CMIP6 DECK experiments (even if nonlinMIP experiments are not performed). The aim is to confirm whether the abruptCO 2 experiments contain realistic physical responses in the variables of interest, as previously done for globalmean temperature and heat uptake for a range of CMIP5 models ), for regional-scale warming and ocean heat uptake Good et al., 2015), and for other global-mean quantities for HadCM3 (Good et al., 2011). This also, rules out the most pathological nonlinearities (e.g. if the response to an abrupt CO 2 change in a given GCM was unrealistic). Although this test has been done for a range of models and variables, traceability cannot be assumed to hold for all models and variables. The linear step-response model should first be used with the abrupt4xCO2 response, to predict the response near year 140 of the 1pctCO2 experiment (i.e. near 4xCO2). This prediction is then compared with the actual GCM 1pctCO2 result. This should first be done for global-mean temperature: this assessment has previously been performed for a range of CMIP5 models , giving an idea of the level of accuracy expected. If the abruptCO 2 response is fundamentally unrealistic, it is likely to show up in the global temperature change. This approach may then be repeated for spatial patterns of warming, and then for the quantities of interest. Abrupt4xCO2 is used here as it has larger signal-to-noise than abrupt2xCO2, yet is representative of forcing levels in a business-as-usual scenario by 2100. However, the tests may also be repeated using abrupt2xCO2 -but compared with year 70 of the 1pctCO2 experiment (i.e. at 2xCO2).
The step-response model emulation under these conditions should perform well for most cases: the state at year 140 of the 1pctCO2 experiment is very similar to that of abrupt4xCO2 (same forcing, similar global-mean temperature), so errors from non-linear mechanisms should be minimal. If large errors are found, this may imply caution about the use of abruptCO 2 experiments for these variables, or perhaps point to novel non-linear mechanisms that may be understood by further analysis.

Second step: characterising non-linear responses
Having established some level of confidence in the abruptCO 2 physical response, the second step is to look for non-linear responses. This first involves repeating the tests Geosci. Model Dev., 9, 4019-4028, 2016 www.geosci-model-dev.net/9/4019/2016/ Table 2. nonlinMIP experimental design. Three options are only the "essential" simulation, all "high priority" plus the "essential" simulations, or all simulations. The experiments in Table 1 are required in all cases.
to diagnose non-linear responses (in combination with abrupt4xCO2); assess climate response and (if appropriate) make climate projections with the step-response model at forcing levels more relevant to mid-or low-forcing scenarios.
Abrupt0.5xCO2 (essential) as abrupt4xCO2 (see Table 1), but at half pre-industrial CO 2 concentration to diagnose non-linear responses (in combination with abrupt4xCO2 and abrupt2xCO2); offers greater signal-to-noise ratios for regional precipitation change than if just abrupt2xCO2 was used; also relevant to palaeoclimate studies.
Extend both abrupt2xCO2 and abrupt4xCO2 by 100 years (high priority) permit improved signal-to-noise ratio in diagnosing some regional-scale non-linear responses; explore longer timescale responses than in CMIP5 experiment; permit step-response model scenario simulations from 1850 to 2100; allow traceability tests (via the step-response model) against most of the 1pctCO2 ramp-up-ramp-down experiment; provide a baseline control for the abrupt4xto1x experiment.
1pctCO2 ramp-down (medium priority) initialized from the end of 1pctCO2; CO 2 is decreased by 1 % per year for 140 years (i.e. returning to preindustrial conditions).
to test traceability of the abruptCO 2 experiments to more realistic transient-forcing conditions; adding the ramp-down phase explores a much wider range of physical responses, providing a sterner test of traceability; relevant also to mitigation and geoengineering scenarios, and offers a sterner test of.

Abrupt4xto1x
(medium priority) initialized from year 100 of abrupt4xCO2, CO 2 is abruptly returned to pre-industrial levels, then held constant for 150 years. quantify non-linearities over a larger range of CO 2 (quantifies responses at 1xCO2); assess non-linearities that may be associated with the direction of forcing change.
Abrupt8xCO2 (medium priority) as abrupt4xCO2, but at 8× preindustrial CO 2 concentration; only 150 years required here. quantify non-linearities over a larger range of CO 2 . from step 1 above, but for different parts of the 1pctCO2 and 1pctCO2 ramp-down experiments, and using different abruptCO 2 experiments for the step-response model. An example is given in Fig. 4 (but for different transientforcing experiments). This shows results for global-mean precipitation in the HadCM3 GCM (Good et al., 2012), under an idealized simulation where forcing is ramped up at a constant rate for 70 years, then ramped down at the same rate for 70 years. Here, the step-response model prediction using abrupt4xCO2 (red curves) is only close to the actual GCM simulation (black), where the transient-forced simulation is near to 4xCO2 (i.e. near year 70). Similarly, the prediction using abrupt2xCO2 (blue curves) works only near 2xCO2 (near years 35 or 105). Otherwise, quite large errors are seen, and the predictions with abrupt2xCO2 and abrupt4xCO2 are quite different from each other. This implies that there are large non-linearities in the global-mean precipitation response in this GCM, and that they may be studied by comparing the responses in the abrupt2xCO2 and abrupt4xCO2 experiments.
Having identified some non-linear response, and highlighted two or more abruptCO 2 experiments to compare (in the previous example abrupt2xCO2 and abrupt4xCO2), the non-linear mechanisms may be studied in detail by comparing the responses in the different abruptCO 2 experiments over the same timescale (e.g. via the doubling difference, as in Fig. 3). This allows for (Good et al., 2012;Chadwick and Good, 2013;Good et al., 2015) non-linear mechanisms to be separated from linear mechanisms (not possible in a transient-forcing experiment). It is expected that analysis will focus on the 100-year period over years 40-139 of the experiments (the relatively stable period after the initial ocean mixed-layer warming).
In the same spirit as other CMIP5 and CMIP6 idealized experiments, nonlinMIP will help understand non-linear mechanisms by isolating the signal of non-linear mechanisms  Good et al., 2012). Time series of global-mean precipitation change under two experiments. Left: where CO 2 is increased by 1 % per year, then stabilized at 2× pre-industrial levels. Right: where CO 2 is increased by 2 % per year for 70 years, then decreased by 2 % per year for 70 years. Black: GCM. Red: step-response model using the abrupt4xCO2 response. Blue: the abrupt2xCO2 response. more effectively. This occurs in two ways: first, by using simplified forcing compared to the time-dependent, RCP projections (the latter feature multiple forcings of evolving strength). The simplified forcing means that alternative mechanisms (from different forcing agents or linear mechanisms) may be ruled out by design. Secondly, contamination of the signal from internal variability may be reduced, as averages of around 100 years are possible.
The magnitude of internal variability may also be estimated at the different levels of CO 2 forcing. This could be used to help explore changes in variability with warming (Seneviratne et al., 2006;Screen, 2014), and to assess significance of any signal of non-linear change in the time mean climate. Internal variability could be estimated from years 40 to 139 of the experiments (after the initial warming of the ocean mixed layer), after removing a fitted linear trend.

Conclusions
These experiments can help improve climate science and consequent policy advice in a number of ways. The focus is on understanding mechanisms (given the idealized nature of the experiments). A further application, however, is that energy balance models could be tuned to the different experiments, to explore the importance, for projections, of state dependence of feedback parameters (Hansen et al., 2005;Colman and McAvaney, 2009;Caballero and Huber, 2013). Also, if certain regions are found to show strongly non-linear behaviour in these experiments, this could help focus assessment of impact tools like pattern scaling or time shifting (e.g. Herger et al., 2015).
Probably of widest interest is the fact that the additional experiments will allow understanding work to focus on climate states more directly relevant to discrete policy/science questions (the benefits of mitigation; impacts of scenarios consistent with the Paris agreement; or understanding past cold climates; see start of Sect. 5). These questions may show important differences, due to state dependence (nonlinearity) of mechanisms, but for many cases the nature of the non-linearity may not need to be assessed. A classical example is the snow-albedo feedback: the strength of this would be different in a warm vs. a cold world (due to different baseline snow cover), but if the focus is on understanding the warm world, the first priority is to study experiments representative of the warm world (with the correct climate state).
There is also a need to quantify and understand, at regional scales, non-linear mechanisms of climate change; that is, do the above science/policy questions give significantly different answers (e.g. different patterns of rainfall change), and why? This is difficult to do using transient model experiments alone, for two reasons: contamination due to different timescales of response, and noise from internal variability.
This paper outlines the basic physical principles behind the nonlinMIP design, and the method of establishing traceability from abruptCO 2 to gradual forcing experiments, before detailing the experimental design and finally some general analysis principles that should apply to most studies based on this dataset.

Data availability
Results will be made available as part of the CFMIP project, within the sixth model intercomparison project, CMIP6.