The Arctic Predictability and Prediction on Seasonal-to-Interannual TimEscales (APPOSITE) data set version 1

. Recent decades have seen signiﬁcant developments in climate prediction capabilities at seasonal-to-interannual timescales. However, until recently the potential of such systems to predict Arctic climate had rarely been assessed. This paper describes a multi-model predictability ex-periment


Introduction
Unprecedented climate change in the Arctic has opened up opportunities for business in diverse sectors such as fossil fuel and mineral extraction, shipping and tourism, but has Published by Copernicus Publications on behalf of the European Geosciences Union.also put pressure on local communities, who are dependent on the ice for their livelihoods (Emmerson and Lahn, 2012;Stephenson et al., 2013).The need for these stakeholder groups to avoid hazardous sea ice and weather conditions has increased demand for Arctic sea ice forecasts at seasonalto-interannual timescales (Eicken, 2013;Jung et al., 2016).These local interests and a growing appreciation of the importance of the Arctic in mid-latitude weather phenomena (Jung et al., 2014) have motivated the development of seasonal sea ice prediction systems (e.g.Sigmond et al., 2013;Chevallier et al., 2013;Wang et al., 2013;Peterson et al., 2014) which are initialised from observations.
It has previously been shown that these sea ice prediction systems exhibit significant skill in predicting summer sea ice extent a season ahead (Guemas et al., 2016), but diagnosing the source of forecast errors is problematic.Forecast errors may be due to both inadequate representation of important physical processes in the model (such as melt ponds, Schröder et al., 2014) or inadequate knowledge of initialstate conditions, such as sea ice thickness (Day et al., 2014a;Msadek et al., 2014;Massonnet et al., 2015), which is not currently used to initialise operational forecasts.Sea ice predictability is also inherently limited due to chaotic, unpredictable atmospheric variability (Blanchard-Wrigglesworth et al., 2011b;Holland et al., 2010) which will lead to irreducible errors in sea ice predictions at seasonal and longer timescales, fundamentally limiting the timescale at which sea ice will be predictable (Tietsche et al., 2016).If the skill of a given forecast system is already close to this fundamental limit it will not be possible to further increase the lead time at which the forecast is skilful.
To determine whether there is the potential to improve the operational prediction systems, we consider a more idealised situation.The "perfect model" approach to estimating predictability involves producing initial-value ensemble predictions with a general circulation model (GCM), which are verified against the model itself rather than against observations of the real world (following Griffies and Bryan, 1997b).It is therefore not hampered by changes to the observational network over time or changes in predictability due to secular climate change, which hampers this kind of analysis in the real world (Collins, 2002).Such studies provide an estimate of the predictive skill obtainable in a world with a perfect model and complete observations.However, such estimates are not necessarily an upper bound for the limit of predictability in the real world because important predictability mechanisms may be missing (Eade et al., 2014).There is an ongoing discussion in the literature on this point (e.g.Shi et al., 2015).
The perfect model approach has previously been used to quantify and understand predictability of coupled modes of climate variability, such as the Atlantic Meridional Overturning Circulation (AMOC) (e.g.Griffies and Bryan, 1997a;Collins, 2002;Pohlmann et al., 2004) and the El Niño-Southern Oscillation (ENSO) (Collins et al., 2002), leading to the development of operational seasonal-to-decadal pre-diction systems based on atmosphere-ocean climate models (e.g.Smith et al., 2007;Jin et al., 2008).
Using this approach, Collins et al. (2006) demonstrated that the timescale on which the AMOC is predictable varies from model to model.These inter-model differences in predictability arise because different GCMs have different representations of the underlying physical equations and parameters.It is therefore likely that there will be inter-model differences in predictability for other climate variables, so it is important to conduct such analyses in multiple GCMs.The APPOSITE model intercomparison was designed to diagnose the limit of initial-value predictability of Arctic sea ice in multiple GCMs.Previous studies had estimated this limit in individual climate models, but with slightly different experiment designs (such as Blanchard-Wrigglesworth et al., 2011b;Holland et al., 2010;Koenigk and Mikolajewicz, 2009;Tietsche et al., 2013).All these experiments demonstrated initial-value sea ice predictability on seasonalto-interannual timescales; however, because they focussed on slightly different variables and averaging periods, and because the experimental protocols were inconsistent between the studies, it was not clear whether the results of these studies were consistent (Guemas et al., 2016).For the APPO-SITE ensemble a consistent protocol was followed to ensure that it was possible to intercompare models, so that any differences in predictability were only the result of differences in the inherent predictability of the models themselves.The first results of this project were presented in Tietsche et al. (2014).
The primary aim of this paper is to provide a detailed description of the APPOSITE experiment, archived at the British Atmospheric Data Centre (BADC) (Day et al., 2015).We also present an updated assessment of the limit of Arctic sea ice extent and volume predictability, initially presented in Tietsche et al. (2014), including more models than available at the time of this publication.In addition we consider an open question in Arctic prediction: to what extent is sea ice predictability state dependent?In this study we consider whether sea ice extent and volume predictability is different when initialised from high and low states compared to states close to the model climatology.
The paper is outlined as follows: Sect. 2 describes the experiment in detail as well as the mean state of the models used; Sect. 3 includes an update of the results of Tietsche et al. (2014) and the state dependence analysis, followed by the conclusions in Sect. 4. Additional details of the data set, archived at the BADC, are included as Appendix A.

Description of the simulations
Seven different coupled climate models performed simulations for APPOSITE (see Table 1).Six of these models followed the same experimental protocol, which is described in Sects.CanCM4, followed a slightly different protocol which is described in Sect.2.3.

Control simulations
Predictability of the climate system changes with mean climate (DelSole et al., 2014), complicating the assessment of predictability in a transient climate.This is likely to be particularly acute in the Arctic where the sea ice climate changes rapidly in transient simulations (Holland et al., 2010).The APPOSITE experimental protocol therefore asked for both control simulations and ensemble predictions to be conducted in GCMs with forcing fixed at present-day values.
Since the perfect model approach uses initial conditions generated by the model itself, present-day control simulations with each model were run under fixed present-day radiative forcings.For practical reasons the year that the forcings correspond to differs between models, either 1990, 2000 or 2005, depending on the model (see Table 1).Apart from MPI-ESM, which was initialised from year 2005 of the CMIP5 historical simulation, all other models were initialised in a static state from present-day ocean temperature and salinity profiles (e.g.Conkright et al., 2002).The period of spin-up varied from model to model, but is at least 100 years.Each model was integrated for at least 100 further years to fully sample the model's climate, drift, and the model's internal variability.Data from the spin-up period of each model were not archived.However, it is worth noting that despite more than a century of spin-up, some of these simulations still have significant drifts in the mean sea ice extent and volume time series (see Fig. 1).These drifts are accounted for by the predictability metrics we use in Sect. 3 and are not expected to significantly influence the estimate of predictability.
All of the models are coupled atmosphere-ocean-sea ice GCMs and each has a fully prognostic sea ice component.These account for variations in sea ice due to both thermodynamic and advective processes that result from stress internal to the sea ice as well as through interaction with the atmosphere and ocean.Like all components of the GCMs, the sea ice models have both structural and conceptual differences, the most significant of which are their treatment of sea ice dynamics, such as the local ice thickness distribution, as well as vertical heat flux through the ice and heat exchange at the ice-ocean interface.Except for HadGEM1.2,E6F and MIROC5.2, the versions of the models used were those submitted to the Coupled Model Intercomparison Project Phase 5 (CMIP5).These models have been well tested and evaluated against observations and their strengths and weaknesses are well documented (see references in Table 1).However, in order to facilitate understanding of the differences in sea ice predictability, we present the differences in their sea ice mean state and variability.
Although not designed to robustly assess the realism of each model's climate, this analysis shows that sea ice mean state and variability in the control runs differ considerably from model to model and to the observations (see Figs. 2,3 and 4).Before calculating the standard deviation, shown in Fig. 4, a linear trend was removed from sea ice extent and volume time series for each model.The wide range of sea ice climates in GCMs is well known (e.g.Arzel et al., 2006;Flato et al., 2013); however, the wide model variety in interannual variability exhibited by the different models is likely to be just as important for determining the inherent predictability exhibited by each model.Indeed, looking across the models, the interannual variability of summer sea ice extent in each model appears to be negatively correlated with its mean, in line with previous studies (Goosse et al., 2009;Holland et al., 2008).This does not appear to be the case for winter.It should also be noted that whilst the climate of each model is very well sampled here (over 100 years), the observational time series, at a length of 35 years, is much shorter.

Ensemble predictions
To diagnose the inherent predictability in each of these models, we performed a suite of ensemble predictions.The number of start dates selected from the control run differs from  model to model and ranges between 8 and 18, depending on the resource limitations of each modelling centre.Whilst participating groups were responsible for choosing their own start dates, they were encouraged to pick them so that a range of high, low and medium sea ice extent and volume states were captured, in order that any dependence of sea ice predictability on the size of the initial state anomaly could be assessed (see Sect. 3.4).They were also encouraged to keep start dates well spaced in time, so that they could be considered independent (see Fig. 1).The minimum spacing between start dates is 3 years in the case of GFDL-CM3, and longer in other models.
For each start date an ensemble of between 8 and 16 members was generated, again depending on the resource limitations of each modelling centre.The initial conditions were taken from the control run of each model and each ensem- ble member differs only by a perturbation to the sea surface temperature field.The perturbation used to generate the ensemble takes the form of randomly generated spatially uncorrelated noise, applied to each grid cell.This noise is sampled from a Gaussian distribution with a standard deviation of 10 −4 K.Each ensemble member starts with a slightly different realisation of this noise.Such a perturbation is so small that it is equivalent to assuming perfect knowledge of the initial conditions.For a given start date, differences in the evolution of each ensemble member are solely determined by the chaotic nature of the simulated climate system.Note that different initialisation methods, such as lagged atmospheric conditions, may lead to slightly different predictability estimates (see Hawkins et al., 2016).For each start date the ensemble was run for 3 years, with the exception of MIROC5.2, which was run for 3.5 years.
A minimum contribution for models to be included in the APPOSITE experiment was to submit a control run and predictability experiments started on 1 July, which allows an assessment of seasonal predictions of the late-summer sea ice conditions, when the sea ice is at its lowest extent, and human activity in the Arctic Ocean is largest.Although we restrict our analysis to the simulations started in July, some groups have also submitted simulations started in January, May and November (see Table 1 for details).Note that operational dynamical seasonal predictions, such as GloSea5 and ECMWF-System 5, are more commonly started in May.We decided to start our simulations later due to the presence of an early summer predictability barrier, which might lead to a sharply decreased skill in predicting the late-summer sea ice extent minimum (Blanchard-Wrigglesworth et al., 2011a;Day et al., 2014b

CanCM4 transient experiments
The set of simulations with the CanCM4 model uses a different protocol, in order to facilitate direct comparison of these simulations with the CanSIPS operational seasonal prediction system, which uses the same climate model (Sigmond et al., 2013).The CanCM4 simulations were different in two key respects.Firstly, they were run under a transient climate, with observed historical forcing agents prescribed.Secondly, initial-value ensembles were generated every year and only run for 1 year.In all other regards, such as the method of ensemble generation, these simulations are the same as the other APPOSITE perfect model simulations.

Perfect model intercomparison
An inter-model comparison of Arctic sea ice predictability, using four climate models, was published in Tietsche et al. (2014).Here we present an update of this study, including the MIROC5.2,E6F and CanCM4 climate models.

Metrics
Two predictability metrics, as defined by Collins (2002), were used to quantify predictability in this study.These make use of the fact that in a perfect model study, such as this, any ensemble member may be chosen as "the truth" or "the forecast".Therefore it is possible to increase the effective sample size by taking each member as "the truth" in turn, and comparing it with every other member as "the forecast".compares forecast RMSE to the climatological variability: where • i denotes the expectation value, to be calculated by summing over the specified index with appropriate normalization, and x ij (t) is the sea ice extent at lead time t for the ith member of the j th ensemble.The σ in the denominator is the standard deviation of the control run for the appropriate month, calculated from the whole archived time series (shown in Fig. 1) after the linear trend has been removed (values shown in Fig. 4).The value of the denominator is equivalent to the climatological RMSE between two independent realisations, which is the limit that the RMSE term in the nominator will approach over time.Therefore the NRMSE will approach a limit of 1.The model is said to show significant predictability when the NRMSE is significantly lower than 1, as calculated using an F test, following Collins (2002).
The second metric is the anomaly correlation coefficient (ACC).This is defined as where µ j is the climatological mean at the time of the j th ensemble prediction.The anomalies are calculated relative to a time varying climatology to take into account any drifts in the control run; otherwise, ACC values for models with larger drifts would be biased high.For the j th start date, the climatology µ j is the value of the linear fit at the corresponding point in the control run time series at the corresponding point in time.Note that we chose to use the whole time series for each model (after the spin-up period), shown in Fig. 1  the impact of such choices on the estimate of predictability, see Hawkins et al. (2016).At some lead time, both of these metrics become insignificantly different from their asymptotic limit (0 for ACC and 1 for NRMSE), and the lead time at which this happens can be used to define the limit of predictability.For each lead time, significance is calculated using an F test or t test in the case of the NRMSE and ACC metrics respectively, where for each model the degrees of freedom used in the test is the number of start dates multiplied by the number of ensemble members run for that model.It appears that the NRMSE metric is more conservative than the ACC metric and becomes insignificantly different from its limit at an earlier lead time (see Fig. 5).Thus using both metrics gives some spread in the estimate of the time when the limit of predictability is actually reached.

Fixed forcing experiments
Although sea ice extent predictability decreases rapidly during the first year, with the exception of EC-Earth, all models (and both metrics) show significant levels of predictability for the first year (see Fig. 5).After the first year of simula-tion, two of the models, MIROC5.2 and GFDL-CM3, show significant levels of predictability at all later lead times.At the other end of the predictability spectrum, E6F is only intermittently predictable after the first year.Predictability in E6F (and to a lesser extent HadGEM1.2) has a strong seasonal cycle with months surrounding the winter extent maximum significantly predictable until the end of the simulation and no significant summer predictability after the first year.
Sea ice volume is much more predictable than sea ice extent in all models.Apart from E6F all models exhibit significant predictability in all 3 years of the simulations.In a prognostic predictability analysis with decadal simulations, Germe et al. (2014) similarly found that winter sea ice extent was predictable out to 7 years in their model, compared to 3 years in summer, and found that volume was predictable out to 9 years ahead.It is therefore likely that the winter sea ice extent predictability horizon may be significantly beyond the 3 years simulated in these experiments.ume (see Fig. 5).It is possible that the CanCM4 model actually has inherently lower levels of initial-value predictability than the other models.However, there are reasons to expect that both metrics will indicate lower levels of predictability, not because of inherently lower levels of initial-value predictability, but because of using the shorter control run associated with the transient protocol employed by CanCM4.

CanCM4 transient experiments
In the case of NRMSE, detrending a short time series is likely to significantly reduce the climatological variance, since without multiple ensemble members to estimate the forced trend, some internal variability is removed in attempting to remove the forced trend (see Hawkins et al., 2016).
We believe that the ACC values are lower than the estimates of other models for the following reason.The reference climate (which is a linear fit to the control run) is a much better fit to the data, with lower residuals, in the case of the short CanCM4 transient control run than it is for the long fixed forcing control runs.This is because, in general, the long control runs have large decadal anomalies which are not well approximated by a linear fit.Therefore the CanCM4 simulations will exhibit lower persistence than would be found if the same model had been run for a longer period in the fixed forcing setup, simply as a result of differing accuracy of the linear fit in each case.

State dependence of predictability
As mentioned in Sect.2.2, start dates for the ensembles were chosen to sample low, medium and high sea ice extents and volume states in each model's control run.In order to estimate whether starting in different positions of model state space has an impact on predictability, we calculated the anomaly correlation and NRMSE metrics again but only selecting start dates according to whether they were started from a month of the control run with a low, medium or high state.This was done for most models by choosing the two lowest states, two highest states or two states closest to the mean of the control runs.E6F had three start dates in each class and CanCM4 had seven in each, as a result of these models having more start dates than other models.In general, the high states are larger than 0.8 standard deviations above the mean and the low states lower than 0.8 standard deviations below the mean.  of sea ice extent predictability, the start dates were binned by sea ice extent and, to assess the dependence of volume predictability, they were binned by volume.The ACC and NRMSE were recalculated for each of these bins (see Fig. 6).
According to Fig. 6, whether the predictability changes with the distance of the initial state from the mean extent and volume appears to depend on the metric.For states initialised close to the mean sea ice volume climatology, the ACC metric decreases much more rapidly with lead time than the high or low cases, appearing to recover towards the end of the simulations.Indeed, the multi-model mean ACC falls dramatically in the medium case compared to the low and high years.However, similar features are not present when using the NRMSE metric, with the mean NRMSE increasing with lead time at a similar rate across the high, medium and low cases.We therefore believe that this behaviour is a statistical artefact of the ACC metric, for the following reason.For start dates initialised close to climatology, the numerator of the ACC metric (Eq.2) will fluctuate between positive and negative values as the ensemble members diverge, more frequently than when initialised from a large anomaly.When started from a large anomaly, the ensemble members will agree more strongly on the sign.This leads to lower ACC in the medium cases.Similar behaviour is observed when experiments are binned by high, low and medium initial sea ice extent (not shown).With so few data points it is not possible to robustly test the statistical significance of this finding, so this result should only be seen as an indicator.
Although we show that there is little evidence of sea ice predictability depending on the distance of the prediction's initial state from the climatological mean, this does not mean that the predictability is not state dependent.For example, years where anomalous atmospheric circulation patterns, which are unlikely to be predictable at seasonal timescales, play a role in driving large sea ice anomalies (e.g.summer 2007; Serreze and Stroeve, 2015) will be poorly predicted even in a perfect prediction system.Hawkins et al. (2016) also demonstrate that the rate of ensemble divergence can vary from start date to start date in perfect model simulations.

Conclusions
We have presented the experimental protocol for the APPO-SITE Arctic sea ice predictability multi-model intercomparison, and described the archive of model simulations which contributed to it.The mean state and variability of Arctic sea ice cover in the models were presented and compared to observed estimates.We utilise this database to assess the limit of initial-value Arctic sea ice extent and volume predictability from each of the models, updating the results of Tietsche et al. (2014) to include three more models.
The results of this analysis of perfect model predictability can be summarised as follows.
-There is significant intermodel spread in the timescale at which summer sea ice extent is predictable, with some models not showing any interannual or longer timescale predictability, and others showing significant predictability throughout all months of the 3-year simulations.
-Sea ice volume is generally more predictable than sea ice extent.
Furthermore, because prediction ensembles were started from high, medium and low sea ice states, we were able to assess the state dependence of sea ice predictability for the first time.We found little evidence of sea ice predictability depending on the distance of the prediction's initial state from the climatological mean.
These data are archived at the BADC (Day et al., 2015) and have been used in a number of sea ice predictability studies.These have (i) quantified the predictability horizon for Arctic sea ice forecasts (Tietsche et al., 2014, and this study), (ii) demonstrated the existence of a spring "predictability barrier" for sea ice predictions (Day et al., 2014b), (iii) highlighted the development of sea ice thickness initialisation as a crucial step towards skilful seasonal predictions (Day et al., 2014a), (iv) quantified the sources of irreducible forecast error in Arctic predictions (Tietsche et al., 2016), and (v) been used to investigate the initial state dependence of sea ice predictability (this study).This data set has therefore helped fill key knowledge gaps in sea ice prediction research.
However, important questions on Arctic sea ice predictability still remain.For example, a clear understanding of why predictability varies from model to model and to what extent it depends on the models' mean climate remains elusive.We feel that it will be necessary to expand this set of predictability experiments in order to answer this question robustly.We hope that by making these data available, other researchers will be able to utilise them to answer these and other open questions.
As well as enabling the results of the APPOSITE project to be reproduced and allowing the community to utilise these simulations for Arctic sea ice research, this archive could also be further utilised to improve understanding of predictability of other variables on seasonal-to-interannual timescales, such as Antarctic sea ice cover (e.g.Holland et al., 2013) or even ENSO (e.g.Collins et al., 2002).Further details of the data archive can be found in Appendix A.

Discussion of protocol
Having presented a summary of the results of the APPOSITE model intercomparison project (MIP), it is natural to consider the suitability of the protocol and suggest ways in which a future protocol might be improved.Analyses pertinent to this question were described in Hawkins et al. (2016), and we will use these examples in this discussion.
A number of methods exist for generating initial value ensembles in coupled models.Perfect model studies have generally used simple methods, including white noise perturbations of SST (as used in APPOSITE), or atmosphere or state lagged methods (where state vectors from adjacent days are used to initialise the model), although more complex methods exist.Hawkins et al. (2016) conducted experiments to determine the impact of these simple methods on ensemble spread in a set of 6-month long experiments with the MPI-ESM.They found that the state lagged and atmosphere lagged approach generated more ensemble spread in both sea ice extent and volume than did the SST white noise perturbation.This finding suggests that using the same perturbation method for each model, as was done in APPOSITE, is important, although it is not clear a priori whether one method is better than the others.
Given that all modelling centres work with finite computing resources, a pertinent question both for future perfect model studies and for operational forecasting is how many ensemble members and start dates are required to robustly assess the inherent predictability of a model.Hawkins et al. (2016) present an analysis with the HadGEM1.2AP-POSITE simulations, where they subsample from the 16 ensemble members and 10 start dates to investigate the sensitivity of September sea ice extent and volume predictability metrics when using fewer start dates and members.RMSE seems quite insensitive to the number of members and start dates, certainly for values above the eight start dates and eight members, which was suggested as a minimum in the APPOSITE protocol.However, the ACC monotonically increases with ensemble size and, as we have shown in Sect.3.4, is highly sensitive to small numbers of start dates.Hawkins et al. (2016) conclude that even with 16 members (the most submitted to APPOSITE), probabilistic measures of predictability were not reliable.
The choice of ensemble size also depends on the particular question the experiment is trying to address; for example, when designing an experiment to investigate how predictability depends on the initial state, increasing the number of start dates, at the expense of ensemble members, might be a worthwhile trade-off.
As discussed in Sect.3.4, in order to investigate the dependence of predictability on the initial state, we decided to pick high, low and medium states rather than randomly selecting them.Our analysis in this section demonstrates that some metrics, particularly ACC, could be very sensitive to this choice and that manually choosing start dates in this way may bias the overall estimate of model predictability, compared to a random selection.Therefore, we would recommend that studies focussed solely on understanding intermodel differences in predictability use a random selection approach to choosing start dates.
www.geosci-model-dev.net/9/2255/2016/Geosci.Model Dev., 9, 2255-2270, 2016 A length of 3 years was decided upon for the APPOSITE predictability simulations.This was chosen both for pragmatic computational resource reasons and based on previous studies, which indicated that the limit of sea ice extent predictability was under 3 years (e.g.Blanchard-Wrigglesworth et al., 2011b).Although this is certainly the case in some models, it appears to be predictable past this point in others (see Fig. 5).It is also certainly the case that sea ice extent in some regions, such as the North Atlantic, is predictable past 3 years (Day et al., 2014b).Therefore, similar future studies should consider extending simulations for longer in order to capture the predictability horizon for all models.
A significant problem we encountered was dealing with drift in the control simulations.Many of the control simulations were not in an equilibrium state, and had significant drifts in sea ice extent and volume (Fig. 1).Predictability metrics such as the ACC and NRMSE are dependent on the method used for choosing the reference climatology (see Hawkins et al., 2016); therefore, we would recommend running the control runs to equilibrium so that a more stable model climate is used both for initialising ensembles and as a reference.
The set of diagnostics we asked for was generally sufficient for our analysis goal of quantifying and understanding seasonal-to-interannual sea ice predictability, with a couple of exceptions.Firstly, Tietsche et al. (2014) utilised processbased tendencies to relate errors in sea ice thickness to their mechanical and thermodynamical processes in HadGEM1.2 and MPI-ESM.These diagnostics were not available from the other models and we would recommend saving such diagnostics as part of a future predictability study.Secondly, although the focus was on seasonal-to-interannual timescales, saving daily sea ice data has been very useful in studying the predictability of user relevant metrics, such as the position of the sea ice edge on these timescales (Goessling et al., 2016).Recently, Notz et al. (2016) presented a recommended set of diagnostics for CMIP6, with diagnostics designed to close the sea ice heat, momentum and mass budgets.Diagnostics are binned into three tiers indicating the relative priority of each diagnostic.A future sea ice predictability MIP could use their list as a starting point (see the Supplement for a full list of recommended diagnostics as well as the experiment description, which was distributed to the APPOSITE project participants).
APPOSITE required participants to prepare their data files so that they meet the following constraints.
-Data files are in netCDF file format and ideally conform to the climate and forecast (CF) metadata convention (outlined on the website http://cf-pcmdi.llnl.gov).
In instances where it was not possible to produce fully CF compliant netCDF files, participants were required to follow the CMOR variable naming convention.
-There must be only one output variable per file.
-The file names have to follow the file naming convention outlined below.
Each variable is contained in a single directory of a directory tree with the following structure: The Supplement related to this article is available online at doi:10.5194/gmd-9-2255-2016-supplement.

Figure 1 .
Figure 1.Time series of monthly mean September sea ice extent (sie, left column) and sea ice volume (siv, right column) in each model's control simulation (blue) with the line of best fit to data (black).Vertical grey lines indicate start years used to initialise simulations.Values on the time axis are model clock times, and do not correspond to the actual run length of the simulation.

Figure 3 .
Figure 3. Average sea ice thickness in present-day model control simulations and from PIOMAS (Schweiger et al., 2011).

Figure 4 .
Figure 4. Seasonal cycle of monthly mean sea ice extent (a), volume (b) and standard deviation of sea ice extent (c) and volume (d) in present-day model control simulations.The HadISST observations of sea ice extent and PIOMAS reconstruction of ice volume are included as a reference.These data were linearly detrended prior to calculating the variance.
Figure 5. (a, b) Lead-time dependence of SIE NRMSE and SIV NRMSE for all models.(c, d) Lead-time dependence of SIE ACC and SIV ACC for all models.September and March are marked by thin grey vertical lines.Dashed lines represent the averages across models.Circles indicate where metrics do not indicate significant predictability (at 95 %).Updated from Tietsche et al. (2014).

Figure 6 .
Figure 6.Top row: NRMSE of sea ice extent, but calculated only for start dates with anomalously low, medium or high sea ice volume, relative to the control run climatology.Bottom row: as the top row but for the ACC metric.The black dashed line shows the multi-model average of each metric and grouping.The number of start dates in the low, medium and high bins is two for all models except E6F (three) and CanCM4 (seven).

Table 1 .
2.1 and 2.2.For practical reasons one model, Details of simulations submitted to the APPOSITE database.
where runtype is "ctrl" or "pred" for the control run or ensemble predictions respectively, model is the name of the climate model (e.g.hadgem1_2, mpiesm), variable is the CMOR name for a given climate variable and submodel&frequency indicates the model subcomponent and frequency (e.g.Amon, Aday, Omon and Oday).Files are named using the following convention:<variable>_<submode&frequency>_<model>_ <runtype>_<run>_<time>.ncwhere run is a concatenated string including the start year, prediction start month and ensemble member number for ensemble predictions (e.g.2005Jul3), or simply contains "1" for a control run.