Overview of experiment design and comparison of models participating in phase 1 of the SPARC Quasi-Biennial Oscillation initiative (QBOi)

. The Stratosphere-troposphere Processes And their Role in Climate (SPARC) Quasi-Biennial Oscillation initiative (QBOi) aims to improve the ﬁdelity of tropical stratospheric variability in general circulation and Earth system models by conducting coordinated numerical experiments and analysis. In the equatorial stratosphere, the QBO is the most conspicuous mode of variability. Five coordinated experiments have therefore been designed to (i) evaluate and compare the verisimilitude of modelled QBOs under present-day conditions, (ii) identify robustness (or, alternatively the spread/uncertainty) in the simulated 5 QBO response to commonly imposed changes in model climate forcings (e.g., a doubling of CO 2 amounts) and, (iii) examine model dependence of QBO predictability. This paper documents these experiments and the recommended output diagnostics. The rationale behind the experimental design and choice of diagnostics is presented. To facilitate scientiﬁc interpretation of the results in other planned QBOi studies, consistent descriptions of the models performing each experiment set are given, with those aspects particularly relevant for simulating the QBO tabulated for easy comparison. and resolved waves that must be overcome by tuning of the gravity wave sources. The GWD proﬁles between 20 and 40 km are approximately Gaussian in form and can be simpliﬁed by ﬁtting the zonal-mean 20 GWD to a function of the form A exp[  (( z  B ) /C ) 2 ] . The three ﬁt parameters are shown in the insets in the middle and right panels. The increase in inter-model spread of the maximum GWD (ﬁt parameter A) in the experiment using the models’ launch amplitudes and heights is more readily seen. As observed (not simulated) precipitation is used in the ofﬂine calculations for two of the models using parameterized gravity wave sources (LMDz6 and UMGA7gws), the results in the right-hand panels of Figure 7 may not accurately reﬂect what the models themselves would produce. Hence the parameterized-source and ﬁxed- 25 source results in the right-hand panels are not entirely comparable. A case in point is the rather large difference in the peak GWD in the UMGA7 (ﬁxed source) and UMGA7gws (parameterized source) results; for this reason the UMGa7gws results have been scaled to ﬁt on the plot. Note also in the 1 mPa experiment that the GWD peaks are wider in the vertical and weaker for the models that use Hines than for the others. This is consistent with the vertical smoothing of the momentum ﬂuxes that is conventionally applied in the Hines scheme before the GWD is computed. The differences in the 1 mPa Hines results are a 30 consequence of the different amount of smoothing used by the different models; if the smoothing is removed from the ofﬂine calculation, the 1 mPa Hines results for the different models are identical.

emphasis in particular on the non-orographic gravity wave drag (GWD) parameterizations used by almost all of the QBOi models. Closing remarks including future plans follow in Section 6.

Scientific rationale
A crucial test of our understanding and ability to model the QBO occurred around the beginning of 2016 when the QBO cycle was unexpectedly disrupted for the first time since its discovery in the late 1950s (Dunkerton et al., 2016;Newman et al., 2016;15 Osprey et al., 2016;Coy et al., 2017). The well established QBO paradigm, originating from the 1960s, of alternate eastward and westward momentum deposition from vertically propagating equatorial waves (Baldwin et al., 2001) could not account for this disruption . Despite the fact that the QBO is normally highly predictable (Pohlmann et al., 2013;Scaife et al., 2014) the disruption was completely missed by seasonal forecasts, and this failure illustrates the difficulty models have in capturing the complex phenomenology of the QBO and its full range of variability. Similar disruptions have only very 20 rarely been seen in multi-decadal simulations and from just a few models with QBO-like oscillations (e.g., Osprey et al., 2016).
It is possible that the models may be over-tuned to ensure that they capture the mean behaviour of selected metrics (e.g., mean period and amplitude) of the present-day QBO. Furthermore, the disruption itself raises the possibility that the real QBO is less robust than previously thought, although it has since returned to its usual cycling as predicted.
With the advent of non-orographic GWD parameterizations and/or the use of increased vertical resolution in the stratosphere, 25 a growing number of global models have been able to reproduce QBO-like variability in the equatorial stratosphere (e.g., Takahashi, 1996;Scaife et al., 2000;Hamilton et al., 2001;Giorgetta et al., 2002;Shibata and Deushi, 2005;Anstey et al., 2010;Kawatani et al., 2010;Orr et al., 2010;Lott and Guez, 2013;Richter et al., 2014;Rind et al., 2014;McCormack et al., 2015). However, common deficiencies exist in all current simulations, notably with QBO winds often being unrealistically weak in the lowermost stratosphere and having unrealistically small cycle-to-cycle variability (e.g., Schenzinger et al., 2017). 30 The simulated QBOs can also be quite "fragile"-which is to say, sensitive to many different aspects of model formulation depending on the model. For example the QBO in the Canadian Middle Atmosphere Model (AGCM3-CMAM) is sensitive to the balance of resolved and parameterized wave forcing  while in different versions of the Met Office Unified Model (MetUM) the QBO is sensitive to the specification of stratospheric ozone (Butchart et al., 2003;Bushell et al., 2010) and/or the parameterized gravity waves (Bushell et al., 2010;Kim et al., 2013). Sensitivity to vertical resolution has been reported by numerous studies, for example by Giorgetta et al. (2006) for the Middle Atmosphere version of the ECHAM5 (MAECHAM5) model and by Geller et al. (2016) for the NASA Goddard Institute for Space Studies (GISS) climate model. In addition Yao and Jablonowski (2015) identified a sensitivity to the choice of dynamical core. Other key questions concerning 5 simulation of the QBO lie with its possible synchronisation with other modes of variability such as the annual cycle (e.g., Rajendran et al., 2016) and El Niño-Southern Oscillation (e.g., Christiansen et al., 2016), with the QBO's predictability (e.g., Pohlmann et al., 2013;Scaife et al., 2014) and finally with the robustness of the QBO response to climate change (e.g. Kawatani and Hamilton, 2013;Schirber et al., 2015).
Phase 1 of QBOi focuses on reducing these uncertainties in simulated QBOs by conducting coordinated experiments that 10 will allow for more rigorous intercomparison of models than is otherwise possible from individual studies. The aim is to address the ability of GCMs to capture the QBO in the present climate, to predict its behaviour under climate-change forcings, and to predict its evolution when initialized with observations (i.e., hindcasts). Anstey et al. (2015) and Hamilton et al. (2015) briefly describe a set of five QBO experiments which are designed to be 15 simple and accessible to a wide range of modelling groups. The motivation and specific goals for each of these experiments is presented below with the technical specifications given in Appendix A. The aim is for modelling groups to perform all five experiments and even if this is not possible, it is important that the same model version is used for the subset of experiments that are conducted, i.e., there should be no tuning of free parameters between experiments. Use of the same model version for the different experiments is crucial for learning the most from this study. The model version used should be that which the 20 group considered gave the "best" representation of the QBO under present day conditions (e.g., in Experiment 1 or similar preparatory simulations). Of course there are situations when two different versions of a model might be used to perform the experiment set, such as when high and low resolution versions or alternative non-orographic GWD parameterizations are available. In these situations the results would then be treated for the purpose of the QBOi multi-model analysis as if they were obtained from two separate models (although interpretation of results will need to be aware of, and test for sensitivity to, the 25 possible dominance of the results by one particular family of models). All experiments are for AGCMs apart for an option to perform Experiment 5 with a coupled ocean, which is denoted as Experiment 5A (see below).

Experiment list and goals
3.1.1 Present-day climate The first two experiments are designed with the goal of identifying and distinguishing the properties of and mechanisms 30 underlying the variety of model simulations of the QBO in present-day conditions:
The main differences between these two experiments are expected to arise from the differences between their specified SSTs. These experiments will allow an evaluation of the accuracy of modelled QBOs under present-day climate conditions, employing the diagnostics and metrics discussed in Section 4. The impact of interannually varying forcing (e.g., Figure 2) on the

Climate projections
Two further experiments are designed to subject the modelled QBOs (i.e., the QBO simulated by the present-day experiments) to an external forcing similar to that typically applied for climate projections: -Experiment 3 (2⇥CO 2 timeslice): Identical to Experiment 2, but with a change in CO 2 concentration and specified SSTs appropriate for a 2⇥CO 2 world (100 years or ensemble of 3⇥30 years).
These experiments will allow the response (i.e., 2⇥CO 2 -1⇥CO 2 and 4⇥CO 2 -1⇥CO 2 ) of the QBO, its forcing mechanisms, and its impact/influence to be evaluated using the same diagnostics and metrics used in the analysis of Experiments 1 and 2. Key questions that will be addressed are:

10
-What is the spread/uncertainty of the forced model response?
-Do different models cluster in any particular way?
-Can a connection/correlation be made between QBOs with similar metrics/diagnostics in present-day climate and their response to CO 2 forcing?
The motivation is to investigate what aspects of modelled QBOs determine the spread, or uncertainty, of the QBO response 15 to CO 2 forcing. These aspects are considered high priority by QBOi in order to reduce uncertainty in future projections.
These experiments also will provide context for the uncertainty in climate change projections of QBO behaviour among the state-of-the-art GCMs being used in CMIP6.
Furthermore, the possibility was noted in Section 2 that models may be over-tuned to ensure that they capture the behaviour of the present-day QBO. If so, then a large multi-model spread in the forced response may indicate that such tuning constitutes, 20 in effect, an "overfitting" of models to present-day conditions.

QBO hindcasts
The goal of the final experiment is to evaluate and compare the predictive skill of modelled QBOs in a retrospective hindcast context, quantify this predictive capability in multiple models, and study the model processes driving the evolution of the QBO: -Experiment 5 (hindcasts): A set of initialized QBO hindcasts of 9-12 months using the observed SSTs and forcings 25 specified as in Experiment 1. Specified start dates are 1 st May and 1 st November for the years 1993-2007 (i.e., 15 years, 30 start dates) with initial atmospheric conditions obtained from reanalyses (at least 3-member ensemble).
Because of the prescribed SSTs these are not true prediction experiments; nonetheless they provide an important test of how well models can predict the evolution of the QBO from specified initial conditions that reasonably sample the full range of QBO phases, despite some clustering of the 1 st May initial profiles (Figure 3).  (Dee et al., 2011). The two profiles shown in coloured lines (May 1993 andNovember 2005, taken as representative of eastward and westward QBO phases in the lower stratosphere, respectively) are those used in offline comparison of the gravity-wave drag parameterizations presented in Section 5.1.
Key questions that will be addressed are: -How does prediction skill vary among models, and to what extent, and for how long are models able to predict the QBO evolution correctly at different vertical levels and different phases of the QBO?
-How does the forecast skill relate to the behaviour of the QBO in Experiment 1? Are realistic QBO simulations in a multi-decadal simulation well correlated with skillful long-term deterministic predictions? 5 -Do the models that cluster and/or do well in the prediction experiments cluster in the CO 2 forcing experiments?
One aim is to investigate which aspects of modelled QBOs determine the quality of QBO prediction and therefore where development needs to be focused for model improvement. The hindcast framework can also be helpful for directly assessing model changes, possibly driving improvements in free-running models. Further motivation for these experiments is to investigate the possibility of using the hindcast results to narrow the range of plausible models for climate change experiments. 10 It is recognised that some groups may already have completed for the period 1993-2007 operational seasonal hindcasts using a coupled ocean-atmosphere model, and therefore for the QBOi multi-model analysis an acceptable alternative (or addition) to Experiment 5 is: -Experiment 5A (hindcasts): A set of initialized QBO hindcasts of 9-12 months identical to Experiment 5 apart from replacing the specified SSTs with a coupled ocean model appropriately initialized (at least 3 member ensemble). 15 Full comparison with the other models providing Experiment 5 output will nonetheless depend on most of the diagnostics discussed in Section 5 being available from those groups providing Experiment 5A output.

Process studies
A secondary purpose of Experiment 5 is to investigate and evaluate differences in wave dissipation and momentum deposition, so as to understand the processes driving the QBO in each model and separate the contributions from resolved and unresolved waves (e.g., Scaife et al., 2000;Shibata and Deushi, 2005). Due to the initialization of the hindcasts, each model will have essentially the same initial basic state, and its evolution immediately after the start of the forecast will allow the properties of 5 wave dissipation and momentum deposition to be compared and contrasted between different models given a near-identical basic state. Specifying the same observed SST in all models (rather than allowing each model to predict its own SST evolution) facilitates the comparison as it eliminates any differences resulting from the evolving ocean. Short periods of additional high frequency diagnostics are requested to maximize the benefits of the multi-model comparison.

10
The diagnostics requested by QBOi draw on those requested by other major multi-model intercomparison projects, in particular DynVarMIP (Gerber and Manzini, 2016a), though they have been specifically tailored through community discussion for the analysis of the QBO in Experiments 1-5. The requested diagnostics are described in this section; additional technical information on how they should be formatted and uploaded to the shared QBOi repository is available in the Supplement. 15 For ease of comparison among models most output variables are requested on a standard set of 30 pressure levels: 1000,925,850,700,600,500,400,300,250,200,175,150,120,100,85,70,60,50,40,30,20,15,10,7,5,3, 2, 1.5, 1.0 and 0.4 hPa.

Spatial and temporal resolution
These are adapted from the extended levels set requested by DynVarMIP for CMIP6 (e.g., Gerber and Manzini, 2016a) to obtain a vertical resolution in the upper tropical troposphere and lower stratosphere (i.e., between 200 hPa and 40 hPa) of 1.0 to 1.5 km. There are two exceptions however: 20 -Data to be used for calculating equatorial wave spectra (6-hourly instantaneous fields) should be provided at vertical resolution equivalent to the model resolution to ensure accurate calculation of QBO wave forcing (e.g., Kim and Chun, 2015a); see below for further details.
-To reduce data volume, daily-mean 3-dimensional (3D) variables are requested for only the 8 pressure levels used by CMIP5: 1000, 850, 700, 500, 250, 100, 50 and 10 hPa. These data will be used mainly to examine the QBO influence 25 on other regions of the atmosphere [e.g., on the North Atlantic Oscillation (NAO)] and higher vertical resolution is not considered necessary.
Horizontal resolution should be the same as the model but if data volume is an issue then a reduced grid is acceptable, provided the reduction method is documented.
To examine the daily-mean and monthly-mean QBO zonal-mean momentum budget, terms making up the TEM zonal mo-30 mentum equation (e.g., Andrews et al., 1987, p. 127-130) are requested following the recipe given by Gerber and Manzini (2016a, Appendix A3), but also see their corrigendum (Gerber and Manzini, 2016b). In particular note the importance of calculating the individual terms from 6 hourly or higher frequency data (e.g., every time step) and the need for sufficient vertical resolution (e.g., the standard pressure levels listed above) for accurate estimates of the vertical derivatives. Furthermore to examine the wavenumber-frequency spectra of the equatorial waves (e.g., Horinouchi et al., 2003;Lott et al., 2014) instantaneous values of 3D winds and temperature are requested every 6 hours on model levels or on pressure levels at roughly equivalent 5 vertical resolution to the model levels but, to reduce data volumes, only for levels between 100 hPa and 0.4 hPa and for latitudes between 15 N and 15 S. For ease of analysis, pressure levels at model-level resolution are preferred over actual model levels. Table 1. Climate and variability. Monthly and daily means, with 2D indicating a longitude-latitude-time (XYT) field and 3D indicating a longitude-latitude-pressure-time (XYPT) field. XY is typically the model's horizontal output grid and P is the standard 30-level set of diagnostic pressure levels described in Section 4.1: 1000,925,850,700,600,500,400,300,250,200,175,150,120,100,85,70,60,50,40,30,20,15,10,7,5,3

Output period
Monthly-mean output is requested for the full duration of all experiments and all ensemble members. Likewise for Experiment 5 daily-mean output is requested for the full duration of each ensemble member. On the other hand for Experiments 1-4 10 daily-mean output is only requested for the first 30 years and/or the first ensemble member.
High-frequency (6-hourly) diagnostics for calculating equatorial wave spectra are requested for the following periods and ensemble members for each experiment:  Table 2. Dynamics. (a) Monthly-mean and daily-mean fields and contributions to zonal-mean zonal momentum equation (YPT). (b) Monthlymean tendencies and fluxes from parameterized gravity waves (XYPT). (c) Daily-mean sources for orographic and non-orographic gravity waves (XYT). P is the standard 30-level set of diagnostic pressure levels described in Section 4.1 (also Table 1

Requested output variables
Similarly to DynVarMIP (Gerber and Manzini, 2016a), the requested variables are separated into three categories: standard variables (Table 1) for diagnosing the climate and variability in the models, dynamical variables (Table 2) for analysing momentum transport and budgets, and thermodynamic quantities (Table 3). In addition a fourth category of variables (Table 4) will enable the equatorial wave spectra (e.g., Horinouchi et al., 2003;Lott et al., 2014) to be compared among the models. Table 3. Thermodynamics. Monthly-mean and daily-mean zonal-mean fields (YPT). P is the standard 30-level set of diagnostic pressure levels described in Section 4.1 (also Table 1   Alternatively the data can be provided on actual model levels, although in this case the data required for conversion between model and pressure levels must also be provided. Section 3 it is also clear that for an AGCM to participate in these experiments it must be configured with a number of essential characteristics (e.g., land-ocean contrast, annual cycle, and a radiation scheme that can accommodate changes in CO 2 amounts). Apart from this QBOi does not impose any restrictions on the representation in participating models of any physical process or, indeed, chemical process for those models with interactive ozone. Of course, participating models are expected to properly resolve the stratosphere with an average vertical resolution of the order of 2 km or less between 100 hPa and 1 hPa and an upper boundary somewhere above that (cf., high and low top results in Osprey et al., 2013). However, it is not strictly necessary for a model to display QBO-like variability in the equatorial stratosphere as additional insight can be gained by comparing models 5 with and without this property. Models with QBO-like variability but without a properly resolved stratosphere (e.g., with upper boundary below 1 hPa) are also considered since, again, this potentially provides guidance on the level of stratospheric detail that is required in order to reproduce a QBO. There are 17 models or model-versions participating in phase-1 of QBOi (i.e., For spectral models the horizontal resolution is given in terms of triangular truncation of spectral coefficients, from which a grid spacing can be estimated as described in the Figure 5 caption. For example, T63 ⇠ 2.8 ⇥ 2.8 , T159 ⇠ 1.125 ⇥ 1.125 , and T255 ⇠ 0.7 ⇥ 0.7 , corresponding roughly to grid lengths 310 km, 130 km and 80 km, respectively. Upper boundary altitude is given in terms of pressure and log-pressure altitude as described in the Figure 4 caption. data from 17 models has been uploaded or is planned for upload to the shared QBOi repository; see Supplement for details of this repository). These models are listed in Table 5 along with the institutes and investigators using the models and their 10 contact information. The model names given refer to the names used in the repository while the information given in Tables  Gregory et al. (1990) no Schemes marked † are non-orographic GWD parameterizations based on a wave-spectrum approach, while in schemes marked ‡ the wave spectrum is treated as a collection of monochromatic waves. For the models using the Warner and McIntyre (1999) scheme (HadGEM2-A, HadGEM2-AC, UMGA7, UMGA7gws, and UMGC2), the use of the scheme to generate a QBO is described in Scaife et al. (2002). For IFS43r1, the use of the Scinocca (2003) scheme to generate a QBO is described in Orr et al. (2010). For MPI-ESM-MR, the use of the Hines (1997a, b) scheme is described in Schmidt et al. (2013). The abbreviation in square brackets for each scheme (2 nd column; "[WM]", "[H]" or "[L]") denotes the type of dissipation used in the scheme as labelled in Figure 7. "Fixed" in column 3 refers to sources of parameterized gravity-waves that are not linked to any other model physical variable (see Footnote 1, Section 5). Note however that "fixed" includes sources that vary in time and/or space in a prescribed way, as well as stochastically (e.g. as is done in the ECHAM5sh model).
6 and 7 refers specifically to the configuration and parameter settings used by each model when producing the uploaded data.
More comprehensive descriptions of the individual models can be found in the references given in the last column of Table 5.
It should be noted that common model development history can lead to a lack of full independence among models. which shared development history affects model independence can be difficult to assess and varies among models (e.g., Knutti et al., 2013). Apart from describing those aspects of model formulation that are expected to be relevant to the QBO (Tables   6 and 7), detailed consideration of model independence is outside the scope of this paper. However, note that out of the 17 QBOi models, there are two pairs of models that are identical in all respects but one: HadGEM2-A and UMGA7 used fixed 5 sources for their non-orographic gravity wave parameterizations, while their counterparts HadGEM2-AC and UMGA7gws, respectively, use parameterized gravity wave sources; this distinction is described in more detail below.
Properties of the models (Tables 6 and 7) that are of particular relevance for simulating a QBO are:  -Horizontal resolution: This is likely to have a significant impact on the development and evolution of wave sources in the tropical troposphere, which are important for forcing the QBO. Horizontal resolution may also affect the propagation and breaking of large-scale Rossby waves propagating from the extratropics, which are now known to affect the QBO (e.g., Osprey et al., 2016). Figure 5 (see also column 2 of Table 6) shows the horizontal resolution of each model and how the differences in horizontal resolution compare to the differences in stratospheric vertical resolution.  Table 6 for total number of vertical levels in each model. The horizontal grid spacing is estimated by calculating the average of the zonal and meridional grid spacings, ( + )/2, and converting this to a value in km at the equator. For spectral models with triangular truncation we assume = = 2 3 180 /(T + 1) as an estimate of the transform grid resolution, where T is the truncation wavenumber as given in column 2 of Table 6.

5
-Timestep: The increasing use of inherently stable advection schemes such as semi-implicit semi-Lagrangian methods allows for longer timesteps than are possible, say, with a more traditional Eulerian advection. While this can lead to significant savings in computing requirements, particularly at higher spatial resolution, an adverse effect is the filtering or damping of high frequency equatorial waves (e.g., Shutts and Vosper, 2011) that can potentially make a significant  Table 6 for the different dynamical timesteps used by the participating models.
-Numerical advection scheme: Model dependence of the QBO on numerical advection schemes generally arises through a sensitivity of the wave propagation characteristics and, perhaps more importantly, the strength of the Brewer-Dobson circulation (Butchart, 2014) due to the tropical upwelling which opposes the descending QBO cycles in the standard 5 paradigm (Baldwin et al., 2001).
-Parameterized sub grid-scale waves (non-orographic gravity waves): A very significant development in models that has led to increased success in simulating QBO-like variability has been the introduction of non-orographic GWD parameterizations. Early schemes focused on parameterizing the (vertical) propagation and dissipation of sub grid-scale waves from spatially and temporally fixed sources while more recent developments have included parameterized sources 10 too (e.g., Beres et al., 2005;Choi and Chun, 2011;Lott and Guez, 2013;Schirber et al., 2014;Bushell et al., 2015).
Broadly speaking there have been two approaches to parameterizing the propagation and dissipation. The first, followed by Hines (1997a, b) and Warner and McIntyre (1996), aims to represent a broad spectrum of unresolved gravity waves generated by a variety of sources, while the alternative method is to represent the wave spectrum by a finite number, or collection of monochromatic waves such as described by Lindzen (1981) or Alexander and Dunkerton (1999). All models 15 or model-versions participating in QBOi, with the exception of MIROC-AGCM-LL, include at least one parameterization of non-orographic GWD, with the superscripts † or ‡ in the second column of Table 7 indicating, respectively, whether the spectrum or collection of monochromatic waves method is used. A comparison of how the different schemes attenuate parameterized eastward and westward momentum fluxes of non-orographic gravity waves propagating upward through typical wind profiles with opposite phases of the QBO is shown in Figure 7 and described in detail in Section 5.1, below. 20 Five of the 17 models [60LCAM5, CESM1(WACCM5-110L) HadGEM2-AC, LMDz6 and UMGA7gws] have extended their non-orographic GWD parameterizations to include parameterized gravity wave sources 1 . References giving details of these extended parameterizations are listed in column 3 of Table 7. In most cases this has simply involved replacing an ersatz "fixed" source with one that is more physically based, although for the LMDz6 model the previously-used Hines scheme was replaced with a new GWD parameterization (Lott et al., 2012;Lott and Guez, 2013). There are two pairs 25 of models that are identical except for their gravity wave source being fixed / parameterized: UMGA7 / UMGA7gws and HadGEM2-A / HadGEM-AC. Hence it will be possible to assess the impact these model developments have on the simulation of the QBO and how it responds to changes in climate forcings, at least for a small subset of the participating models.
-Convection: An important source of equatorial waves in the models is convection and its associated diabatic heat-30 ing. Gravity wave source parameterizations also typically couple the generation of parameterized GWD to parameters 1 A "source parameterization" denotes a gravity wave source that is coupled with other physical fields in the model, such as precipitation or deep convective heating, and therefore varies temporally and spatially. In contrast, "fixed" gravity wave sources are not coupled to other physical fields. Fixed sources are often constant in time, although this category could also include sources that have a prescribed temporal variation (e.g. seasonal cycle) or are stochastic. Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-187 Manuscript under review for journal Geosci. Model Dev. Discussion started: 26 October 2017 c Author(s) 2017. CC BY 4.0 License. obtained from the convection schemes such as the precipitation (e.g., Lott and Guez, 2013). The different convection schemes used by the participating models are listed in column 4 of Table 7 for easy comparison.
-Ozone climatology and feedbacks when interactive chemistry is included: Although differences in ozone climatologies can potentially impact on simulated QBOs (e.g., Bushell et al., 2010), precise specifications for the ozone forcing were not included in the experiment descriptions (Section 3; Appendix A) to allow for the inclusion of models with prog-5 nostic ozone and also to keep the experiment specifications as simple as possible. Therefore for those models without ozone chemistry there are some variations among the ozone climatologies that have been prescribed. Figure 6 illustrates these variations in the tropics, for the ozone used in the timeslice experiments (Experiments 2-4).  Table 5 indicates which models have performed these experiments), for models that do not include ozone chemistry (as indicated in Table 7). Each vertical profile is an average over the 5 S-5 N latitude band, zonal mean, and annual mean.

Offline comparison of non-orographic gravity wave drag schemes
As noted above, non-orographic GWD parameterizations have been important for the generation of a QBO in many climate 10 models (as Table 7 indicates, only one of the QBOi models does not use parameterized GWD). The non-orographic GWD

19
Geosci. Model Dev. Discuss., https://doi.org /10.5194/gmd-2017-187 Manuscript under review for journal Geosci. Model Dev. The offline calculations are performed using ERA-Interim equatorial zonal and meridional winds and temperatures for 1 st May, 1993 (top) and 1 st November, 2005 (bottom). The middle panels show results for the case where the momentum flux is set to 1 mPa at 100 hPa (⇡ 16 km). The right panels show results for the case where the models' own launch amplitudes and launch heights are used. Note that the results in the right-hand panel for MRI-ESM2 and UMGA7gws have been multiplied by 0.1 and 0.6, respectively, and the GWD profiles plotted using dotted lines (see Appendix B). The labels in parentheses to the right of the model names denote the type of GWD scheme: "F" or "P" for fixed or parameterized sources; "H" for Hines, "WM" for Warner-McIntyre, or "L" for Lott et al. (2012) for the type of dissipation used. Note that "WM" here includes both the Warner and McIntyre (1999) and Scinocca (2003) schemes (Table 7), which are both implementations of the Warner and McIntyre (1996)    Vertical profiles of zonal-mean GWD for the 1 mPa experiment are shown in the middle panels of Figure 7. Results for the 10 mPa experiment (not shown) are quite similar to the 1 mPa results but are larger by a factor of ten, confirming that to a good first approximation the GWD at these heights scales linearly with the MF at 100 hPa. This is perhaps not too surprising given that critical level absorption by the background winds, as opposed to nonlinear dissipation resulting from the exponential growth with height of the gravity wave amplitudes, is the primary cause of the momentum flux deposition in these highly 15 sheared wind profiles. The results of the third experiment are shown in the right panels. Compared to the 1 mPa results, these show much more inter-model spread. Since the source specifications used in this experiment are the ones that produce each model's best QBO, the larger inter-model spread in the third experiment is a reflection of model dependent biases in, for instance, the mean winds and temperatures, and resolved waves that must be overcome by tuning of the gravity wave sources.
The GWD profiles between 20 and 40 km are approximately Gaussian in form and can be simplified by fitting the zonal-mean 20

GWD to a function of the form A exp[ ((z B)/C) 2
]. The three fit parameters are shown in the insets in the middle and right panels. The increase in inter-model spread of the maximum GWD (fit parameter A) in the experiment using the models' launch amplitudes and heights is more readily seen. As observed (not simulated) precipitation is used in the offline calculations for two of the models using parameterized gravity wave sources (LMDz6 and UMGA7gws), the results in the right-hand panels of Figure 7 may not accurately reflect what the models themselves would produce. Hence the parameterized-source and fixed- 25 source results in the right-hand panels are not entirely comparable. A case in point is the rather large difference in the peak GWD in the UMGA7 (fixed source) and UMGA7gws (parameterized source) results; for this reason the UMGa7gws results have been scaled to fit on the plot. Note also in the 1 mPa experiment that the GWD peaks are wider in the vertical and weaker for the models that use Hines than for the others. This is consistent with the vertical smoothing of the momentum fluxes that is conventionally applied in the Hines scheme before the GWD is computed. The differences in the 1 mPa Hines results are a 30 consequence of the different amount of smoothing used by the different models; if the smoothing is removed from the offline calculation, the 1 mPa Hines results for the different models are identical. In summary, the offline comparison shows that most of the inter-model differences in the parameterized GWD in the equatorial stratosphere arise from the differences in their launch height and launch amplitude, not from differences in the wave dissipation mechanism and the shape of the assumed launch spectrum.

Closing remarks and future plans
The QBO is arguably the most conspicuous and regular mode of variability observed anywhere in the atmosphere that is not 5 directly related to either the annual or diurnal cycles. At a fundamental level, and for current conditions, it can be considered to be purely an atmospheric dynamical mode of variability, despite possible external influences from variability in the oceans, the solar cycle or changes in atmospheric composition. Therefore the primary goals of phase 1 of QBOi are achievable using atmosphere-only global models that are computationally relatively inexpensive to run. To date (July 2017) output from 17 models/model versions (Table 5) has been uploaded, or is planned for uploading, to the shared database. 10 The goals of phase 1 of QBOi are to: -Compare, for present day conditions, the accuracy of the morphology of the simulated QBOs across models, and relate this to differences between models in the representation of the forcing mechanisms (e.g., terms contributing to the zonalmean zonal momentum equation) and other model properties such as resolution and sources of waves.
-Compare how the morphology of the simulated QBOs and QBO forcing mechanisms respond to climate change (i.e., a 15 doubling and quadrupling of CO 2 amounts) and identify which aspects of these responses are robust.
-Compare QBO predictive skill between models and its dependence on the QBO's initialized phase, the underlying state of the atmosphere and/or properties of the individual models (e.g., why was there an absence of skill in predicting the disruption of the QBO in 2016?).
Phase 1 of QBOi therefore addresses the challenges associated with modelling, predicting the evolution of, and projecting 20 long term changes in the QBO. Results from planned studies are expected to inform on requirements for future model development leading to more accurate representations of the QBO and its variability in the individual models and across the multi-model ensemble. Benefits, however, are likely to extend well beyond this and range from potential enhancements in skill in seasonal to decadal predictions resulting from concomitant improvements in QBO-extratropical dynamical teleconnections, to better capabilities for assessing the consequences of geoenginering proposals involving the injection of aerosol into the 25 equatorial stratosphere where its redistribution away from the tropics is likely to be significantly influenced by the QBO.
Beyond phase 1, QBOi is expected to focus more on QBO extratropical dynamical teleconnections and couplings to other aspects of the climate system. In this respect QBOi again differs from those multi-model activities like CMIP and CCMI that are largely policy-driven and hence place considerable emphasis on continually updating projections using the latest generation of models. Instead the developing consensus in the QBOi community, which has emerged primarily from the September 2016 30 QBO workshop (see Anstey et al., 2017, for a workshop summary), is to build on the experiments described in this paper though, of course, results from phase 1 studies are expected to feed through into improving the representation of the QBO 22 Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-187 Manuscript under review for journal Geosci. Model Dev. Discussion started: 26 October 2017 c Author(s) 2017. CC BY 4.0 License.
in the next generation of models. Some new coordinated studies that have been proposed for future endorsement by QBOi include: -Increasing the ensemble size of Experiment 1 ("AMIP") to examine the robustness across models of possible synchronisation bewteen ENSO events and the QBO (e.g., Christiansen et al., 2016).
-Extending Experiment 2 (present-day time slice) to increase the sample size to examine QBO teleconnection robustness 5 in an idealised framework in which there is no other externally forced variability, apart from the annual and diurnal cycles.
-Repeating Experiment 2 (present-day time slice) with idealised perpetual El Niña / La Niña SST anomalies to examine the interaction of ENSO and QBO teleconnections.
-Empirically separating the effects of stratospheric and tropospheric climate change on the QBO by modifying Experi-10 ments 3 and 4 (future time slice) such that the increases in CO 2 amount (⇠forcings stratospheric climate change only) and SSTs (⇠forcing tropospheric climate change only) are applied separately.
-Examining the impact of ozone on the QBO either through prescribed ozone perturbations or through ozone feedbacks for those models that can rerun with and without ozone chemistry. 15 The above list is by no means exhaustive and other possible extensions of the research plans for QBOi include more idealized studies comparing simulations using only "dynamical cores" (e.g., Yao and Jablonowski, 2015) or perhaps simulations in which the QBO is artificially removed (e.g., by turning off the non-orographic GWD parameterization in the tropics). However, in line with current QBOi practices, details of any new coordinated studies will again be formulated through community discussion at forthcoming QBOi workshops, and will depend on the outcomes of the phase 1 studies.

20
Code and data availability. For information on the code availability for the individual models considered in this paper see the appropriate references given in Table 5. Details of the QBOi data repository and how to access it are provided in the Supplementary. The corresponding external forcings for the CMIP5 AMIP-experiment (e.g., radiative trace gas concentrations, aerosol distributions, solar irradiance, and appropriate forcings from explosive volcanoes) can be found here: http://cmip-pcmdi.llnl.gov/cmip5/forcing.html#amip apart from ozone which, for high-top models, can be obtained from: 5 https://groups.physics.ox.ac.uk/climate/osprey/QBOi_O3/ Initial conditions are not prescribed and it is left to individual groups to use whatever is appropriate for their model and to include any spin-up if this is considered necessary.

A2 Experiment 2 -1⇥CO 2
Experiment 2 is similar to Experiment 1 but with a repeated annual cycle for the SSTs and sea ice amounts plus all the other 10 forcings (i.e., there is no interannual variability or any secular changes in the forcings). It can either be a 1-3 member ensemble of 30-year simulations or preferably a single 100-year (or longer) simulation. The long single integration has the additional potential of providing information on very low frequency variations.
Ideally the external annual cycle forcings should be 30-year climatologies based on Experiment 1 although, as these are generally not readily available, a suitable alternative is to apply annually repeating forcings based on the 2002 CMIP5 forcings. 15 The year 2002 is well removed from any explosive volcanic eruptions and the ENSO and Pacific Decadal Oscillation (PDO) are both in their neutral phases and hence conditions is this year can be considered as a useful proxy for the multi-year mean for most quantities. However 2002 ozone amounts are likely to be strongly perturbed because of the Southern Hemisphere sudden stratospheric warning (e.g., Shepherd et al., 2005) and for ozone a 2D climatological field representative of the 1990s is preferable. For SSTs and sea ice amounts CMIP5 1988-2007 climatologies are available from: 20 http://www-pcmdi.llnl.gov/projects/amip/AMIP2EXPDSN/BCS/amipbc_dwnld.php As Experiment 2 is the control for Experiments 3 and 4 (2⇥CO 2 and 4⇥CO 2 , respectively) the average CO 2 amount for 2002 should be used as the baseline 1⇥CO 2 amount.
Although the use of different length climatologies for different forcings is not ideal and does not provide direct comparison to the 30-year period of Experiment 1, the observed dependence of the QBO on a changing climate through this period appears 25 to be negligible. Thus for QBOi the benefits of the simpler experimental set-up is considered to far outweigh any possible disadvantages. Nonetheless it important to emphasize that the same idealised set of climatologies and forcings are to be used throughout Experiments 2-4, that is, apart from the changes to the CO 2 amounts and SSTs described below.
As with Experiment 1, atmospheric initial conditions are not prescribed.

30
Experiments 3 and 4 are the same as Experiment 1 but for "2⇥CO 2 " and "4⇥CO 2 " climates, respectively. Again these can either be a 1-3 member ensemble of 30-year simulations, or preferably a single 100-year simulation, after allowing for a suitable spin-up to the new climate (without a coupled ocean this is expected to be fairly rapid though for the 4⇥CO 2 experiment this can be of order five years). Compared to the amount specified for Experiment 1 the CO 2 concentration should be either doubled (Experiment 3) or quadrupled (Experiment 4) with a corresponding idealized adjustment made to the SSTs of a spatially uniform perturbation of +2K for 2⇥CO 2 and +4K for 4⇥CO 2 . Sea ice amounts should be kept the same as in Experiment 1.
All other forcings in these two Experiments should be exactly the same as in Experiment 1 including the amounts of all radiatively active greenhouse gases other than CO 2 . If ozone is prescribed (i.e., if the model does not have interactive chemistry) 5 then this too should be exactly the same as in Experiment 1. Alternatively if the model does have interactive chemistry then the source gases and/or emissions should be kept exactly the same as in Experiment 1. This idealized set-up for Experiments 3 and 4 is appropriate as these are sensitivity experiments and not attempts to predict specific periods in the future.
As with Experiment 1 atmospheric initial conditions are not prescribed, but note the need to allow for spin-up to the new climates.

A4 Experiment 5 -QBO hindcasts
These are atmosphere-only experiments, initialized from reanalysis data, providing multiple ensembles of short integrations from a relatively large set of start dates sampling different phases of the QBO. The prescribed start dates (i.e., atmospheric initial conditions) are 1 st May and 1 st November for the years 1993-2007 (i.e., 15 years with a total 30 start dates). The duration of each hindcast should be at least 6 months but preferably 9-12 months. 15 As with Experiment 1 the boundary conditions and external forcings should be the same as those specified for the CMIP5 AMIP experiment (Taylor et al., 2012). CMIP5 interannually varying sea ice and SSTs can be obtained from: http://www-pcmdi.llnl.gov/projects/amip/AMIP2EXPDSN/BCS/amipbc_dwnld.php while the CMIP5 external forcings for radiative trace gas concentrations, aerosols, solar, explosive volcanoes, etc., can be obtained from: 20 http://cmip-pcmdi.llnl.gov/cmip5/forcing.html#amip Ozone forcing datasets appropriate for use in high-top models are available from: https://groups.physics.ox.ac.uk/climate/osprey/QBOi_O3/ Initial data for the hindcasts should be taken from the ERA-Interim reanalysis (Dee et al., 2011) which can be downloaded from: 25 http://apps.ecmwf.int/datasets Registration is required; if downloading many start dates from this site, it may be easier to use the "batch access" method described on the site, although interactive download of each date is also possible. Data are available on either standard pressure levels or original model levels, and in either grib or netCDF formats. The ensemble is expected to be generated by perturbing the initial conditions by a small anomaly, which needs do no more than change the bit pattern of the simulation. For some 30 models this is possible through stochastic physics, however each group should use an ensemble generation method that is most appropriate to their model and that is most readily available to them.

A5 Experiment 5A -QBO forecasts
This experiment is as Experiment 5, but using a coupled ocean-atmosphere model and predicting the SST, instead of specifying observed values. External forcings should also be fixed at the initial start time so as not to use future information. This is then a true forecast experiment for the QBO, and can be compared with the results of Experiment 5. Some groups may already have performed these hindcasts as part of their operational seasonal forecasts but note that for QBOi purposes it is important that 5 the majority of the diagnostics discussed in Section 4 are available for a full comparison to Experiment 5 results.

Appendix B: Offline non-orographic gravity wave drag calculations
This appendix provides details about the offline GWD calculations shown in Figure 7. The background equatorial winds and temperatures are from a single day (daily mean) of ERA-Interim data on a 1 longitude grid and on pressure levels at the ECMWF model levels resolution. 10 For models that use "fixed" gravity wave sources (e.g., AGCM3-CMAM), the calculations are straightforward and simply involve computing the GWD above the launch height. Since these models all use a horizontally isotropic gravity wave source, the MF in a single azimuth is set to either 1 or 10 mPa for the first two experiments. All fixed-source calculations are done using offline versions of the Scinocca (2003), Hines (1997a, b) and Warner and McIntyre (1999) non-orographic GWD schemes using each model's parameter settings. Results for the third offline experiment in which the models' own source amplitudes 15 (i.e., momentum flux for Scinocca, root-mean-square (RMS) winds for Hines) and launch heights are used, are validated by comparing to results from QBOi Experiment 5 for models that provided daily-mean GWD. With the exception of one model, the agreement is reasonably good which is all that can be expected given that the resolution of the models differs from that used in the offline calculations. For MRI-ESM2 the offline results for the third experiment are ten times larger than the Experiment 5 results, and have been scaled in the right panels of Figure 7. The reason for this large discrepancy is unknown. For models 20 that tie their non-orographic gravity wave sources to parameterized processes in the troposphere (referred to in the Figure 7 caption as parameterized sources), the calculations are more involved.
For the models that were able to perform the offline calculations for parameterized-source schemes (LMDz6, UMGA7gws and HadGEM2-AC) the procedure was as follows. For LMDz6, daily precipitation observations were used to generate an ensemble of monochromatic waves. The background winds and temperatures are held fixed in time using either the 1 st May 25 or 1 st November data. A similar procedure is used for the other two models, except that the launch momentum fluxes in HadGEM2-AC are obtained by sampling from the Experiment 1 result for the month since the source parameterization in HadGEM2-AC requires convective heating profiles not provided by observations. As momentum flux is not prescribed for these models, tuning of the gravity wave parameters is required to achieve the desired MF at 100 hPa for the first two experiments, such that (|MF east |+|MF west |)/2 = 1 or 10 mPa at 100 hPa. Due to time constraints the NCAR group, which also ties its GWD 30 scheme to convection in the 60LCAM5 and CESM1(WACCM-L110) models, was unable to participate in this comparison.