Climate models are extremely complex pieces of software. They reflect the best knowledge on the physical components of the climate; nevertheless, they contain several parameters, which are too weakly constrained by observations, and can potentially lead to a simulation crashing. Recently a study by Lucas et al. (2013) has shown that machine learning methods can be used for predicting which combinations of parameters can lead to the simulation crashing and hence which processes described by these parameters need refined analyses. In the current study we reanalyse the data set used in this research using different methodology. We confirm the main conclusion of the original study concerning the suitability of machine learning for the prediction of crashes. We show that only three of the eight parameters indicated in the original study as relevant for prediction of the crash are indeed strongly relevant, three others are relevant but redundant and two are not relevant at all. We also show that the variance due to the split of data between training and validation sets has a large influence both on the accuracy of predictions and on the relative importance of variables; hence only a cross-validated approach can deliver a robust prediction of performance and relevance of variables.
The development of realistic models of climate is one of the most important areas of research due to the dangers posed by global warming. It is by no means a trivial task since it involves the parameterisation of many processes that are not directly solved within the model. It has been shown by Lucas et al. (2013) that certain combinations of these parameters lead to failure of a model, despite each individual parameter having a reasonable value. Authors of this study performed 540 simulations with randomly varied combinations of 18 parameters of the Parallel Ocean Program (POP2) (Smith et al., 2010) module in the Community Climate System Model Version 4 (CCSM4) (UCAR, 2010). About 10 % of these simulations crashed due to numerical instabilities. Then they applied machine learning methods to attribute failures to the parameters of the model. To this end they used the support vector machine (SVM) (Vapnik, 1995) classification to quantify and predict the probability of failure as a function of the values of 18 POP2 parameters. The causes of the simulation failures were determined through a global sensitivity analysis. Combinations of eight parameters related to ocean mixing and viscosity from three different POP2 parameterisations were then determined as the major sources of the failures. These eight parameters were indicated as targets for more detailed research.
These results are somewhat disappointing, since the number of parameters is still rather high. Hence we decided to check whether a more elaborate method for analysis could decrease this number further. We have observed potential weak points of the analysis performed by Lucas et al. (2013); namely, they had not fully take into account that the apparent importance of a variable for classification may be in fact the result of a spurious fluctuation. The problem is most acute when a sample used for a machine-learning algorithm is small. In such a case random fluctuation may introduce spurious correlations within data, which can be utilised by the classification algorithm for model building. The appropriate procedure should be applied to minimise the influence of such random correlations on the final results.
Lucas et al. (2013) have also analysed the impact of the decision variable that is used for the classification on the quality of results. While the models were built as an ensemble of learners built on the bootstrap samples of the training set, the evaluation of the classification performance was based on a single split of data between training set and test set. This set-up was due to the construction of the study – simulations for the validation set were performed after the predictions had been made. While this is a very honest method for the verification of the predictions, it precludes the estimation of the statistical uncertainty of the result. In particular, it is impossible to say whether the observed differences between classification accuracy observed for different decision functions are significant or whether they arise due to statistical fluctuations.
The current study is devoted to the reanalysis of the data. It aims at minimising the influence of random fluctuations on the final results. Our aim was to establish all variables that truly contribute to the final result of the simulations, i.e. whether the simulation was finished successfully or whether it crashed. To this end we use contrast variables that carry no information on the decision variable and apply the Boruta algorithm for all-relevant feature selection and extensive Monte Carlo sampling. We also compare the quality of classification for several subsets of variables used for prediction of simulation result, to perform a parallel check of relevance of variables.
Similarly to the original work, we rely on machine learning algorithms to identify parameters that critically influence the fate of the simulation. The fundamental idea is that, if the classification algorithm can predict result of the simulation, i.e. the successful completion of simulation or the crash, using only the information on the values of certain combinations of selected parameters, then these parameters are indeed responsible for the result. In the original paper the authors performed true prediction and achieved a high degree of accuracy, therefore showing the true predictive power of this approach. On the other hand, this set-up precludes estimation of statistical uncertainty for some of their findings. In particular, the discussion of the prediction accuracy in Sects. 4.4 and 4.5 is based on a single split of data between training and test sets and ignores the possibility that effects may depend on the particular split.
In the current study we know all results beforehand; thus we are limited to virtual predictions only. In this approach we split the entire data set into training and validation sets. We then build a model using the training set and check its quality by performing virtual prediction on the validation set and comparing the predicted results with the true ones. One can take advantage of virtualisation to obtain information about the probability distribution of results. To this end one can perform multiple virtual experiments, with different splits between training and validation sets, and perform classification experiment on each of these splits. The results of individual trials will differ in most cases, allowing one to draw conclusions not only about mean values but also about variance and even shape of probability distribution. Lucas et al. (2013) have used this approach for the sensitivity analysis, utilising ensembles of SVM (Vapnik, 1995) learners for classification. Each member of the ensemble was obtained using a different subsample of the training set. The classifier was then used for prediction of the simulation result for the validation set.
We used a different classification algorithm, namely the Random Forest (Breiman, 2001), and instead of the sensitivity analysis we applied the all-relevant feature selection algorithm Boruta (Kursa et al., 2010). All computations were performed in the R environment for statistical modelling (R Development Core Team, 2008), using the randomForest package for classification (Liaw and Wiener, 2002) and the Boruta package for feature selection (Kursa and Rudnicki, 2010). Interestingly, some of the authors of Lucas et al. (2013) have recently used Random Forest in their analysis of the results of the CAM5 model applied for study of Madden–Julian oscillation. It was applied to analyse the influence of the model parameters on selected diagnostic variables (Boyle et al., 2015).
Random Forest is an ensemble algorithm based on decision trees. To ensure the low correlation between elementary learners, each tree is grown using a different random subsample of the original data set. Moreover, each split in the tree is built using only a random subset of the predictor variables. The number of variables in this subset influences the balance between bias and variance for the training set. The default value for classification tasks is the square root of the total number of variables, and it is usually a very robust selection. Random Forest is a robust “off-the-shelf” algorithm that is easily applicable to various classification and regression tasks. It has only few control parameters, and usually it does not need fine tuning for the particular problem under scrutiny. In many cases it has a performance comparable to or even better than state-of-the-art classifiers, and it rarely fails. A big advantage of the algorithm is that it estimates both the classification error and the importance of variables by internal cross validation. To estimate the latter, it measures how much the accuracy of base learners is decreased when information about the variable in question is removed from the system.
Protocol of the first test. The example list of important variables from a single Boruta run is shown. Core variables are highlighted in boldface; the contrast variable is shown in red.
The Boruta algorithm infers features' relevance using the estimate of their importance from Random
Forest. To this end it extends the information system by variables that
are non-informative by design – the so-called contrast variables. It then
compares the apparent importance of the original variables with that of the
non-informative ones. It performs this multiple times using different
realisations of the non-informative variables and performs a statistical
test. The algorithm finds both strongly and weakly relevant variables. The
notions of strong and weak relevance were introduced by Kohavi and
John (1997) in the context of the ideal classification algorithm. The
features are
The climate simulation data set is highly biased towards successful
completion of simulation. Only 46 cases out of 540 are failures. Such
unbalanced data sets are often difficult for classification, because the
automatic selection of the majority class results in good, but useless,
classification accuracy. In such a case no information is gained, and hence
one cannot perform feature selection. In the first test of the current study
this problem was avoided by application of the following protocol (see
Fig. 1). Firstly 11 balanced subsamples of the training set were constructed;
each subsample consisted of all objects from the minority class (failed
simulations) and 1 / 11 of the majority class (successful simulations). In order
to check specificity of the feature selection, each data set was extended by
contrast variables. To this end each original variable was duplicated, and its
values were randomly permuted between all objects. In this way a set of
Protocol of the second test.
Summary of results of the Boruta run. Importance of the variables is shown. The variables are sorted by increasing importance. The variables coloured in green are those which were classified as relevant. Variables coloured in red are those which are irrelevant. The blue boxes correspond to minimal (sMin), median (sMed) and maximal (sMax) importance achieved in each run by contrast variables. One can observe a wide range of maximal-importance values that can be achieved by random variables. In particular in many iterations it can be higher than importance of truly relevant variables.
The second test probing the importance of variables was performed by analysing the influence of variables used for model building on the prediction quality. The first experiment revealed four variables that were classified as important by Boruta in all, or nearly all, of the 660 trials. These variables were considered to form a core variable set, and the model built using these variables was used as a reference. We examined whether removing one of the core variables or whether adding another variable respectively decreases or increases the classification quality measured by the area under the curve (AUC). The extension of the core test was examined for three variables that were classified in the first test as important significantly more often than the randomised variables.
The test was performed similarly to the one reported in the original study (see Fig. 2). The data set was randomly split into a training set containing 360 objects and a validation set containing 180 objects. The split was performed separately for the minority and majority class, so the number of minority class objects in each training set was 32 and in the validation set it was 14. The randomForest function from the identically named R package was used to perform classification and error estimate. The procedure was repeated 30 times, and results of 30 repetitions were analysed.
The number of trees in the forest (parameter ntree both in randomForest and in Boruta functions) was set to 5000 both for feature selection with Boruta and classification with randomForest. In both cases the number of variables examined for each split was equal to the square root of the total number of variables. In our experience these settings are fairly robust; we have examined them internally over multiple data sets (Rudnicki et al., 2015). Moreover, we have checked whether they influence results in the initial trials. The number of trees used was 10 times higher than default, to assure that importance estimate in Random Forest converge to their asymptotic values; the number of trees for classification was the same for consistency.
Summary of results. The variables indicated as important by Lucas et
al. (2013) are marked with
The summary of the results of the study is presented in Table 1. Variables V1
and V2 were deemed important in all 660 cases. Variables V13 and
V14 were deemed important in nearly all cases – 593 and 623 cases,
respectively. All these variables were also indicated as most important by
Lucas et al. (2013). However, the results do not agree so well for other
variables. Lucas et al. (2013) indicated variables V4, V5, V16 and V17 as
important, but their influence on the final result was much weaker than that
of the first group. In the current study variables V4 and V16 were deemed
important by Boruta for 44 and 66 subsamples, respectively. In both cases the
number is significantly higher than the average for the random variables,
which was obtained as 25
Hence the first experiment, which confirmed the importance of variables V1 and V2, showed that importance of V13 and V14 is nearly universal; it also confirmed the weak importance of variables V4 and V16. On the other hand the importance of variables V5 and V17 was not confirmed with our method; instead variable V9 was found to be weakly important. The example result of the Boruta run for an interesting sample is presented in Fig. 3. In this sample the importance was confirmed for variables V9 and V16, whereas variable V13 was deemed irrelevant. The importance of V4 was higher than that of the highest random variable, but only barely so, and hence the final decision of Boruta was “tentative”. One should note that the importance returned by Boruta is the averaged importance obtained from the underlying Random Forest algorithm. It is not directly interpretable in terms of the fraction of variance explained by a given variable.
One should note that Boruta is an all-relevant feature selection algorithm
that aims at finding both strongly and weakly relevant variables, as defined
by Kohavi and John (1997). The second test aimed at discerning between
strongly and weakly relevant variables. In the case of V1 and V2, removal of
the variable from the core data set resulted in a dramatic drop of AUC,
confirming that these variables are truly informative (see Table 1 and
Fig. 2). In the case of V14 the difference in AUC – referenced further as
AUC obtained in simulation study grouped by subset of variables
used for model building. The labels are coded in the following way: C – core
set of variables
AUC obtained in simulation study grouped by split between training and validation set.
Results of experiment 2. Average AUC obtained for all tested models, as well as examples for five interesting cases. No. 1 – the sample with the lowest AUC from core model; no. 12 the sample with the highest AUC obtained in the study; samples nos. 6, 22 and 30 – samples with core model close to the mean that show variance of AUC for other models. The highest AUC value obtained in the experiment is highlighted in boldface.
One should note that the results of the second test were highly variable and
largely dependent on the split of data between test and validation sets. This
is illustrated in Fig. 5, and examples of the results from several samples are
given in Table 2. The highest AUC obtained in the experiment was 0.990 for
the model built using core variables and V16 in sample no. 12. In the same sample
the AUC for the model built from core-V2 was 0.888. On the other hand for sample
no. 1 the highest AUC was obtained for the model built on core
The average of the cross-validated AUC obtained for three strongly important
variables – namely V1, V2 and V13 – was 0.924. The highest average AUC was
obtained for the model built using five variables, namely
A single run of the Boruta algorithm in the first test took 2 min on a server equipped with an Intel Xeon E5620 @ 2.4GHz CPU. The entire protocol took less than 24 h of single CPU core. The second test is far less computationally demanding. A single run of the randomForest function takes less than 20 s on the same CPU; therefore, computations for the entire protocol take less than 10 min. This effort is negligible in comparison with the time required to run 540 simulations of the climate model itself.
The results of the study are mostly in good agreement with the results of Lucas et al. (2013); however, importance of the variables is not identical. The most important difference is the importance of the variable V13 in both studies. This variable is more important than V14 in the SVM-based model by Lucas et al. (2013), whereas our analysis deems it relevant but redundant. However, one should note that in the first test V13 was deemed relevant in nearly 90 % of cases, only slightly less than in the case of V14. Only the second test revealed that V13 contains mostly redundant information, and on average it does not improve quality of Random Forest predictions. The difference is most likely due to the underlying classifier used in each approach. The SVM is essentially a linear classifier, which can be applied to non-linear problems using some non-linear, continuous kernel transformation. On the other hand, Random Forest is based on non-linear and discrete decision trees. Figure 2 in Lucas et al. (2013) suggests that the decision space of the system under scrutiny is non-continuous. Random Forest can treat such systems more efficiently using fewer variables, whereas SVM needs higher dimensional spaces to build a hyperplane separating two classes. We have observed such effects in other systems, for example in our earlier study of the recognition of musical instruments (Kursa et al., 2009). The other differences are less important, since they involve variables with marginal relevance.
Our reanalysis of the results of 540 simulations is in general qualitative agreement with the results of Lucas et al. (2013). The results of the simulation can be predicted with fairly good accuracy using the machine learning approach, and the two different methods give very close results. The cross-validated AUC reported by Lucas et al. (2013) by ensemble of SVM classifiers was 0.93. In the current study the average of the cross-validated AUC obtained for three strongly important variables was 0.924.
We have shown by cross validation that the AUC reported for the prediction experiment performed by Lucas et al. (2013) falls within the range of values that can be expected in such a prediction; however, one should not assign any weight to the particular value obtained. If the split between the training set and test set were set differently, the resulting AUC for prediction could be any number between 0.88 and 0.99.
The three most important conclusions for the climate modelling community are
the following. Firstly, the efforts on improving the numerical stability of
simulations should be concentrated on three parameters of the CCSM4 parallel
ocean model, namely
Wiesław Paja performed most computations and drafted the first version of the manuscript; Mariusz Wrzesien and Rafał Niemiec performed computations and contributed to the writing. Witold R. Rudnicki designed the experiments and wrote the manuscript. Edited by: S. Easterbrook