An evaluation of current capabilities of modelling low-frequency climate variability

A crucial performance test of Earth system models is their ability to simulate the climate mean state and variability. Here we concentrate on representation of inter-annual to multi-decadal variability in 12 CMIP5 climate model simulations. Reference climate is provided by two 20th century reanalysis data sets of monthly mean near-surface air temperature. The spectral decomposition is based on Randomised Multi-Channel Singular Spectrum Analysis (RMSSA). The main results are as follows. First, the total spectra for the two reanalysis data sets are remarkably similar in all time scales, except that spectral 5 power of decadal variability (10–30 yr) differ in these data by about 30 %. Second, none of the 12 coupled climate models closely reproduce all aspects of the reference spectra, although some models represent many aspects well. For instance, the IPSL-CM5B-LR model is close to reanalyses but has too little multi-decadal variability, and the HadGEM2-ES model is close to reanalyses except the notable over-activity at periods at and around 10 yr.


Introduction
The ultimate goal in developing Earth system models (ESM) is to exploit the inherent predictability of the Earth system to reduce weather and climate related uncertainties in our daily life, and guide societies in making sustainable choices (e.g., Slingo and Palmer 2011;Meehl et al. 2014).Prediction tools are very complex and their testing goes hand-in-hand with their development.A crucial performance test of ESMs is related to their ability to simulate well the observed climate mean state and the variability around the mean.
Here we focus on ESMs of today and how they represent inter-annual to multi-decadal climate variability.This is a very broad range of temporal scales and it is associated with a multitude of spatial scales.Generally speaking, spectral misrepresentations appear either due to lack of variability in a model or over-activity of a model in some temporal scales.Conclusions about model deficiencies based on spectral differences are very scale dependent, and some general guidance can be obtained by thinking about the mechanisms of natural climate variability (e.g., Ghil 2002).Essentially, short time scale variability (below 2 yr) in the model spectrum of near-surface air temperature is most likely related to the representation of internal variability of the Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016Discuss., doi:10.5194/gmd- -61, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 24 March 2016 c Author(s) 2016.CC-BY 3.0 License.coupled models to realistically capture the observed temperature distribution (Flato et al., 2013).These include processes in the Earth system component models (atmosphere, ocean, etc.) as well as in their mutual coupling models.
The historical  simulations were extracted from the CMIP5 data archive and they follow the CMIP5 experimental protocol (Taylor et al. 2012).The 20th Century simulations use the historical record of climate forcing factors such as greenhouse gases, aerosols, solar variability, and volcanic eruptions.We used a single ensemble member of each model and the model data sets were interpolated into a common grid of 144 × 73 points.
As a reference, we used two reanalysis data sets: the 20th Century Reanalysis V2 data (hereafer 20CR) provided by the NOAA/OAR/ESRL PSD (Compo et al., 2011), and ERA-20C data provided by ECMWF (Poli et al., 2013).The data sets are produced using an ensemble of perturbed reanalyses, and the final data set corresponds to the ensemble mean.In 20CR, only surface pressure observations are assimilated, and the observed monthly sea-surface temperature and sea-ice distributions from HadISST1.1 (Rayner et al., 2003) are used as boundary conditions (Compo et al., 2011).In ERA-20C, observations of surface pressure and surface marine winds are assimilated (Poli et al., 2013).Unlike 20CR, it uses a more recent sea-surface temperature and sea ice cover analysis from HadISST2 (Rayner et al., 2006).Both 20CR and ERA-20C are forced by historical record of changes in climate forcing factors (greenhouse gases, volcanic aerosols and solar variations).In order to be consistent with the climate model simulations, the same time period is used (1901-2005, i.e., 1260 monthly mean fields).20CR has ∼2.0 degree horizontal resolution and we have used gaussian gridded (192 × 94) data from 3-hour forecast values.The horizontal resolution of ERA-20C is approximately 125 km (T159) in a grid of 360 × 181 points.

Data processing
Some pre-processing of the data sets is needed before applying RMSSA and statistical significance testing.The data sets were standardised (i.e. the time series of each grid point was mean-centered and divided by its standard deviation) to avoid overweighting the grid points with higher variance.This adds weight on the lower latitude variability, where ENSO-type variability is pronounced.On the other hand, no-scaling would make the higher latitude variability dominant because of larger amplitude variations there.Furthermore, each data set was de-trended, and the dominating annual cycle was estimated using Seasonal-Trend Decomposition (STL; Cleveland et al., 1990) and removed from the original time series.
After the pre-processing, the dimension reduction step of RMSSA was applied so that approximately 3 % of the original dimensions of 20CR data, 0.8 % of ERA-20C, and about 5 % of climate model data were retained -the different percentages are due to different volumes of the original data.The lag window in the analysis was 20 yr (240 months).The total spectra was obtained from this analysis.
In the statistical significance test, the first 50 PCs of each data set were retained as input.Those PCs explain 80 % of the variability in 20CR, 75 % in ERA-20C, and 70 %-80 % in the climate model data sets.A total of 1000 realisations of red noise surrogate data sets were generated, and confidence intervals (90 % and 95 %) for the oscillatory modes were estimated.We note that transformation to PCs may intefere with the detection of weak signals, as demonstrated by Groth and Ghil (2015).
In the following section we will compare the spectral properties of the reanalysis and model data sets.Furthermore, we will test the spectra of each data set against a red noise null hypothesis in order to distinguish signal from noise.Finally, we will compare the spatial patterns of an oscillatory mode with a 3.5 yr period as represented by different data sets.

Spectral similarity of the two reanalysis data sets
We first demonstrate the spectral similarity of the two reanalysis data sets.Figure 1 displays, in terms of explained variance, the leading 30 ST-PCs for 20CR and ERA-20C and the corresponding power spectra.The decomposition reveals that variance is distributed in a very similar way in 20CR and ERA-20C.The ordering of component pairs is not identical but there is a very clear correspondence of the spectral peaks.For instance, the 1.7 yr peak in the 20CR components 29-30 corresponds to the components 24-25 in ERA-20C.In summary, the trend components with multi-decadal periods (components 1 and 2) explain 6 % and 5.3 % of the variance in 20CR and ERA-20C, respectively, with very similar spectra (the length of the time series -105 years -restricts, of course, the correct identification of multi-decadal oscillations) the spectral peak at 3.5 yr and the broad one around 5 yr are very similar in components 3-6, although the associated 10-20 yr variability in 20CR is separate in components 9-10 in ERA-20C (i.e.ENSO variability has a decadal component in 20CR) The similarity of 20CR and ERA-20C is striking in the total spectra (Fig. 2a).Variability shorter than about 10 years is captured similarly by the two reanalyses.20CR is somewhat more lively in decadal scale (10-30 yr) and has about 30 % higher spectral density there.One explanation for this difference may be the decadal variability of ENSO which is present in 20CR (Fig. 1, components 5-6) but is missing from ERA-20C.
The statistical significance testing of the periodicities in 20CR (Fig. 2b) and ERA-20C (Fig. 2c) reveals that nearly the same periods rise above the red noise in the two data sets (the red dots above the area covered by the vertical bars).It is logical that the frequency corresponding to the annual cycle is present in the red noise surrogates while it is removed from the data sets (and therefore the red dots are far below the bars).The periods rising above the red noise at 5 % and 10 % significance level are tabulated in Table 2, columns 1 and 2, and are nearly the same in 20CR and ERA-20C.Our conclusion is therefore that the low-frequency climate variability in the near-surface air temperature is very similar in the two reanalyses.

Evaluation of the simulated spectra
The climate model simulation data was processed exactly the same way as the reanalysis data.The total spectrum of each model is shown in Figure 3 with the reanalysis spectra on the background as a reference (dotted lines).The spectra are objective measures of model performance.We evaluate subjectively how the models reproduce the reanalysis spectra, and adopt the following terminology: a model can be either under-active (over-active) if the spectral density is lower (higher) than in the reanalysis spectra, or a model is on-target with respect to spectral density.This enables us to make an overall evaluation of the current capabilities of climate models to represent low-frequency variability.
The subjective evaluation is summarised in Table 3.In representing multi-decadal variability, three models are on-target while the rest are under-active.In decadal variability, majority of the models are on target, while four are over-active.Only one model has both of these variabilities on-target (a).In ENSO variability, two models (i and g) seem to have both the 3.5 and 5 yr periods on-target.Five models have one of these periods on-target, and the rest of the models are either under-or over-active.
With respect to the inter-annual variability, four models seem to be over-active, while the rest are on target.
We are not going to rank the models, but two models (i and g) seem to perform particularly well and have four out of five key periodicities on-target.The IPSL-CM5B-LR model (i) is close to the reanalyses all the way from inter-annual and ENSO variability to the decadal scale of 20 years and only seem to have too little multi-decadal variability.The HadGEM2 model (g), on the other hand, is reasonably close to the reanalyses across the spectrum, except for the notable over-activity at periods at and around 10 yr.CanESM2 (a) is the only model that closely reproduces both the decadal and multi-decadal variability.
Overall, the clearest signal here is that the models generally seem to lack multi-decadal variability, and some models are over-active in representing decadal and inter-annual variability.In ENSO time scales, only two models are on-target.

Significance test by Monte-Carlo MSSA
Table 2 contains the periods rising above the red noise at 5 % and 10 % significance level.The periods for 20CR and ERA-20C are in the columns 1 and 2. Both data sets have five periodicities significant at 5 % level, four of which are common and one specific.Additionally, there are three common periodicities significant at 10 % level.In terms of statistical significance, 20CR and ERA-20C behave in a very similar manner.
Table 2 gives an impression that the number of statistically significant periodicities is very large in models, in general.In this respect, the climate model data is dissimilar to the reanalysis data.There is only one model (model k) which has fewer significant periodicities than the reanalyses, and three models (g, h, j) are somewhat close to reanalyses with 11-13 significant periodicities.The rest of the models have many more periodicities (up to 30).Interestingly, the HadGEM2 model (g) has the 10 yr peak, discussed earlier, significant at 5 % level.
The number of models that correctly detected either the 20CR or ERA-20C periodicities were seven models for the 1.7 yr period, five for 2.2 yr, four for 3.5 yr, seven for 3.6, four for 4.2 yr, and two for 5.2 yr.In addition to these, there were many "false alarms", as seen in Table 2.

Spatial patterns of the 3.5 yr mode
The oscillatory mode with a 3.5 yr period was significant in 20CR and ERA-20C and in seven climate models.The spectral peak at 3.5 yr has some power on both sides of the peak.Therefore, a 3-4 yr mode would perhaps be a more appropriate name for the peak.Oscillations at these periods are usually attributed to ENSO (e.g.Kleeman, 2008).We illustrate here how this periodicity appears in different data sets.Since the 3.5 yr period is not reproduced by all of the climate models, we have chosen in these cases a periodicity that is close to 3.5 years.
We have calculated the reconstructed components (RC) corresponding to the 3.5 yr mode in order to visualise the associated spatial temperature anomaly patterns.The result is a time series of the original (centered) data corresponding to the 3.5 yr mode in each grid point and according to its original variance.We therefore have an animation of the global 3.5 yr mode for each data set for the whole time period .
To synthesise the animations, we have calculated composite maps of the 3.5 yr mode for each data set (Fig. 4).The patterns are composites of eight instances when the mode is in its maximum positive phase in the Niño3.4region (120 • W-170 • W, 5 • N-5 • S).Positive events are defined as an average of winter months (November-March).
The top row of Fig. 4 displays the patterns for the reanalyses.One can see typical El Niño related temperature anomalies, such as positive anomalies in the equatorial Pacific Ocean and South-America, and northwestern North-America.There is also a cold anomaly over the northern parts of the Eurasian continent, a dipole structure in the western Antarctica, and warm Africa.
The patterns are remarkably similar in 20CR and ERA-20C with somewhat larger amplitudes in 20CR, except the stronger dipole in ERA-20C.Next, we will try to subjectively assess model performance in reproducing the spatial patterns of the 3.5 yr oscillation.
All models produce a warm pool to the equatorial Pacific Ocean and South-America.The amplitude is larger or the area extends too far to the west in five models (a, b, d, e, h).Unlike in the reanalyses, the pool extends to the Atlantic in four models (a, d, e, h).The warm anomaly in the South-American land area is too weak in five models (c, f, g, j, k).
The warm anomaly in the northwestern North-America is present in all the models.In the reanalyses, the anomaly is strictly confined to land areas but in most models, it is either misplaced or extends to the adjacent sea areas and the Eurasian continent.
The cold anomaly over the Eurasian continent is not well represented in the model data.There is a weak cold anomaly in three models (f, i, l), a weak warm anomaly in one model (c), and a mixture of cold and warm areas in the rest.
The warm pool in the Amundsen sea related to the warm-cold dipole around the western Antarctica is present in nine models (a, e, f, g, h, i, j, k, l).The cold pool in the Weddell sea is present in five models (c, d, f, g, k).Three models (f, g, k) represent both pools, and thus the dipole structure in well represented.In Africa, the anomaly is on the positive side in all models.
In Table 3, there are three over-active models with respect to the 3.5 yr period (a, d, e).These are associated with large amplitudes in Fig. 4. Additionally, two models also seem to have large amplitudes in Fig. 4 (b, h).Both of these have higher spectral density than the reanalyses (Table 3), but were anyhow assessed as "on-target" in Table 3.Four models were underactive in Table 3 (c, f, j, k).These correspond to the low-amplitude maps in Fig. 4. In summary, the patterns of Fig. 4 are in support of the subjective analysis of the total spectra of Table 3.

Discussion
Table 2 shows that there is a much larger number of significant periodicities in the model data than in the reanalyses.This seems to imply that "nature" tends to produce only a few but statistically significant periodicities, and the other potential periodicities Geosci.Model Dev. Discuss., doi:10.5194/gmd-2016-61, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 24 March 2016 c Author(s) 2016.CC-BY 3.0 License.are somehow "worn out" via non-liner interactions and feedbacks so that they cannot be distinguished from red noise.We can think of two possible causes for this.First, it may simply be because the reanalysis data represent ensemble mean while the model data are individual simulations.This difference may appear as a difference in the number of significant periodicities.
Second, it may be that something in models prohibits the wearing process from occurring, and we can observe the excessive number of periodicities above the noise level -as if the models were missing some non-linear processes.However, MPI-ESM (model k) has fewer significant periodicities than observed, and to our knowledge, this model does not fundamentally differ from the other models.Therefore, it is unclear what exactly lies behind this result.
We have not discussed the interpretation of the 1.7 yr interannual variability.Based on the related spatial patterns, it may be that it is a harmonic of the ENSO variability (2 × 1.7 yr = 3.4 yr).

Conclusions
In this study, we decomposed the 20th Century near-surface temperature variability into its inter-annual to multi-decadal eigenmodes.We used two state-of-the-art 20th Century reanalysis data sets and 12 historical climate model simulations extracted from the CMIP5 data archive.The analysis was performed using the randomised multi-channel singular spectrum analysis (RMSSA), which is particularly well suited to high-dimensional time series analysis.The statistical significance of the identified eigenmodes was estimated with Monte-Carlo simulations.The main conclusions are as follows: -The spectral properties of the two reanalyses (20CR and ERA-20C) are remarkably similar, the only notable difference being the different spectral density in the decadal scale variability (10-30 yr).Also, nearly the same periodicities rise above the red noise at the 5 % and 10 % significance levels.
-Majority of the climate models are under-active in representing the multi-decadal climate variability (> 30 yr), some models are over-active in decadal (10-30 yr) or inter-annual (< 2 yr) variability, and only two models are on-target in both the 3.5 yr and 5 yr ENSO variabilities.
-The IPSL-CM5B-LR and the HadGEM2-ES are the closest to the total spectra of the reanalyses.
Relaxation of the uni-variate nature of the present study remains as a subject for future research.It would be very interesting to extend the set of variables since the use of random projections allows efficient data compression.Of special interest would be to study behaviour of variables directly linked with coupling processes, such as heat, momentum, and moisture fluxes over oceans.

Data and code availability
All data used in this study was downloaded from open sources.The RMSSA algorithm and the statistical significance testing are implemented using GNU licensed free software from the R Project for Statistical Computing (www.r-project.org).Our RMSSA with significance testing is briefly presented in the following.Testing the MSSA components against a red-noise null-hypothesis requires orthogonal input vectors, which are obtained by calculating first a conventional PCA and retaining a set of dominant PCs.Therefore some additional calculation steps are included in the RMSSA-algorithm: SVD of lower dimensional matrix P is calculated to obtain the principal components (PCs, calculated as UD 1/2 ).PCs fullfil the orthogonality constraint exactly.PCs, that explain large part of the variance of the data set (e.g.50 first), are retained to obtain matrix T, where the columns are the PCs.Next, the augmented matrix A P C is constructed from T and SVD is calculated as in Eq. (A2).
Finally, a large number of red-noise processes (i.e.surrogate data sets) are generated, and the confidence limits for the MSSA eigenmodes are determined.This signicance test (Monte-Carlo MSSA) is described in detail in Allen and Robertson (1996).
Author contributions.HJ suggested the study and mostly wrote the article.TS implemented the methods, performed all computations, and 10 wrote data and method descriptions, JS supported the method development, and JR the climate model data analysis.

Figure 1 .Figure 2 .
Figure 1.ST-PCs 1-30 of 20CR and ERA-20C monthly near-surface temperature 1901-2005 and their spectra.The lag window length M used in RMSSA is 20 years (240 months).The annual cycle and linear trend are removed from the original data set before applying RMSSA.The proportion of the remaining variance explained by each component is also presented in the figure.

Figure 3 .
Figure 3.Total spectrum of each climate model.The total spectrum is the sum of the spectral density of all components (ST-PCs) related to each data set.For comparison, the total spectra of the reanalysis data sets are plotted in each subfigure.The unit of the spectra is arbitrary.

Table 2 .
Approximate periodicities (in years) detected by the Monte-Carlo significance test with a 20 year lag window length.Similar periodicities among the data sets are aligned.Numbers in the table are in bold if the significance level is at 5 % and in italics if at 10 %.Dominant periodicities of the oscillations are estimated with MTM.Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-61,2016 Manuscript under review for journal Geosci.Model Dev.Published: 24 March 2016 c Author(s) 2016.CC-BY 3.0 License.