What should we do when a model crashes? Recommendations for global sensitivity analysis of Earth and environmental systems models

Sheikholeslami, Razi; Razavi, Saman; Haghnegahdar, Amin

doi:https://doi.org/10.5194/gmd-12-4275-2019

Articles | Volume 12, issue 10

https://doi.org/10.5194/gmd-12-4275-2019

Articles | Volume 12, issue 10

Methods for assessment of models

10 Oct 2019

Methods for assessment of models |

| 10 Oct 2019

What should we do when a model crashes? Recommendations for global sensitivity analysis of Earth and environmental systems models

Razi Sheikholeslami, Saman Razavi, and Amin Haghnegahdar

Abstract

Complex, software-intensive, technically advanced, and computationally demanding models, presumably with ever-growing realism and fidelity, have been widely used to simulate and predict the dynamics of the Earth and environmental systems. The parameter-induced simulation crash (failure) problem is typical across most of these models despite considerable efforts that modellers have directed at model development and implementation over the last few decades. A simulation failure mainly occurs due to the violation of numerical stability conditions, non-robust numerical implementations, or errors in programming. However, the existing sampling-based analysis techniques such as global sensitivity analysis (GSA) methods, which require running these models under many configurations of parameter values, are ill equipped to effectively deal with model failures. To tackle this problem, we propose a new approach that allows users to cope with failed designs (samples) when performing GSA without rerunning the entire experiment. This approach deems model crashes as missing data and uses strategies such as median substitution, single nearest-neighbor, or response surface modeling to fill in for model crashes. We test the proposed approach on a 10-parameter HBV-SASK (Hydrologiska Byråns Vattenbalansavdelning modified by the second author for educational purposes) rainfall–runoff model and a 111-parameter Modélisation Environmentale–Surface et Hydrologie (MESH) land surface–hydrology model. Our results show that response surface modeling is a superior strategy, out of the data-filling strategies tested, and can comply with the dimensionality of the model, sample size, and the ratio of the number of failures to the sample size. Further, we conduct a “failure analysis” and discuss some possible causes of the MESH model failure that can be used for future model improvement.

How to cite

How to cite.

Dates

Received: 19 Jan 2019 – Discussion started: 04 Feb 2019 – Revised: 28 Aug 2019 – Accepted: 08 Sep 2019 – Published: 10 Oct 2019

1 Introduction

1.1 Background and motivation

Since the start of the digital revolution and subsequent increases in computer processing power, the advancement of information technology has led to the significant development of modern software programs for dynamical Earth system models (DESMs). The current-generation DESMs typically span upwards of several thousand lines of code and require huge amounts of data and computer memory. The flip side of the growing complexity of DESMs is that running these models will pose many types of software development and implementation issues such as simulation crashes and failures. The simulation crash problem happens mainly due to violation of the numerical stability conditions needed in DESMs. Certain combinations of model parameter values, an improper integration time step, inconsistent grid resolution, or lack of iterative convergence, as well as model thresholds and sharp discontinuities in model response surfaces, all associated with imperfect parameterizations, can cause numerical artifacts and stop DESMs from properly functioning.

When model crashes occur, the accomplishment of automated sampling-based model analyses such as sensitivity analysis, uncertainty analysis, and optimization becomes challenging. These analyses are often carried out by running DESMs for a large number of parameter configurations randomly sampled from a domain (parameter space) (see, e.g., Raj et al., 2018; Williamson et al., 2017; Metzger et al., 2016; Safta et al., 2015). In such situations, for example, the model's solver may break down because of implausible combinations of parameters (the “unlucky parameter set” as termed by Kavetski et al., 2006), failing to complete the simulation. It is also possible that a model will be stable against the perturbation of a single parameter, while it may crash when several parameters are perturbed simultaneously. “Failure analysis” is a process that is performed to determine the causes that have led to such crashes while running DESMs. Before achieving a conclusion on the most important causes of crashes, it is necessary to check the software code of the DESMs and confirm if it is error-free (e.g., if a proper numerical scheme has been adopted and correctly coded in the software). This often requires investigating both the software documentation and a series of nested modules. However, the existence of numerous nested programming modules in typical DESMs can make the identification and removal of all software defects tedious. In addition, as argued by Clark and Kavetski (2010), the numerical solution schemes implemented in DESMs are sometimes not presented in detail. This is one important reason why detecting the causes of simulation crashes in DESMs is usually troublesome. For example, Singh and Frevert (2002) and Burnash (1995) described the governing equations of their models without explaining the numerical solvers that were implemented in their codes.

Importantly, the impact of simulation crashes on the validity of global sensitivity analysis (GSA) results has often been overlooked in the literature, wherein simulation crashes have been commonly classified as ignorable (see Sect. 1.2). As such, a surprisingly limited number of studies have reported simulation crashes (examples related to uncertainty analysis include Annan et al., 2005; Edwards and Marsh, 2005; Lucas et al., 2013). This is despite the fact that these crashes can be very computationally costly for GSA algorithms because they can waste the rest of the model runs, prevent the completion of GSA, or inevitably introduce ambiguity into the inferences drawn from GSA. For example, Kavetski and Clark (2010) demonstrated how numerical artifacts could contaminate the assessment of parameter sensitivities. Therefore, it is important to devise solutions that minimize the effect of crashes on GSA. In the next subsection, we critically review the very few strategies for handling simulation crashes that have been proposed in the literature and identify their shortcomings.

1.2 Existing approaches to handling simulation crashes in DESMs

We have identified, as outlined below, four types of approaches in the modeling community to handle simulation crashes. The first two are perhaps the most common approaches (based on our personal communications with several modellers); however, we could not identify any publication that formally reports their application.

After the occurrence of a crash, modellers commonly adopt a conservative strategy to address this problem by altering or reducing the feasible ranges of parameters and restarting the experiment in the hope of preventing a recurrence of the crashes in the new analyses.
Instead of GSA that runs many configurations of parameter values, analysts may choose to employ local methods such as local sensitivity analysis (LSA) by running the model only near the known plausible parameter configurations.
Some modellers may adopt an ignorance-based approach by using only a set of “good” (or behavioral) outcomes and responses in sampling-based analyses and ignoring unreasonable (or non-behavioral) outcomes such as simulation crashes. This can be done in conjunction with defining a performance metric to choose which simulations to exclude from the analysis (see, e.g., Pappenberger et al., 2008; Kelleher et al., 2013).
The most rigorous approach seems to be a non-substitution approach that tries to predict whether or not a set of parameter values will lead to a simulation crash. Webster et al. (2004), Edwards et al. (2011), Lucas et al. (2013), Paja et al. (2016), and Treglown (2018) are among the few studies that aimed at developing statistical methods to predict if a given combination of parameters can cause a failure. For example, Lucas et al. (2013) adopted a machine-learning method to estimate the probability of crash occurrence as a function of model parameters. They further applied this approach to investigate the impact of various model parameters on simulation failures. A similar approach is based on model preemption strategies, in which the simulation performance is monitored while the model is running and the model run is terminated early if it is predicted that the simulation will not be informative (Razavi et al., 2010; Asadzadeh et al., 2014).

The above approaches have some major limitations in handling simulation crashes in the GSA context because of the following.

Locating the regions of the parameter space responsible for crashes (i.e., “implausible regions”) is difficult and requires analyzing the behavior of the DESMs throughout the often high-dimensional parameter space. Implausible regions usually have irregular, discontinuous, and complex shapes and are thus too effortful to identify. Additionally, altering or reducing the parameter space by excluding the implausible regions changes the original problem at hand.
It is well known that local methods (e.g., LSA) can provide inadequate assessments that can often be misleading (see, e.g., Saltelli and Annoni, 2010; Razavi and Gupta, 2015).
Ignoring the crashed runs in GSA may only be seen as relevant when using purely random (and independent) samples (i.e., Monte Carlo method). In such cases, if the model crashes at a given parameter set, one may simply exclude that parameter set or generate another random parameter set (at the expense of increased computational cost) that results in a successful simulation.
Some efficient sampling techniques follow specific spatial arrangements; examples include the variance-based GSA proposed by Saltelli et al. (2010) or STAR-VARS in Razavi and Gupta (2016b). In GSA enabled with such structured sampling techniques, we cannot ignore crashed simulations because excluding sample points associated with simulation crashes will distort the structure of the sample set, causing inaccurate estimation of sensitivity indices. As a result, the user may have to redo part of or the entire experiment depending on the GSA implementation.
The implementation of the non-substitution procedures necessitates significant prior efforts to identify a number of model crashes based on which a statistical model can be built to predict and avoid simulation failures in the subsequent model runs. Such procedures can easily become infeasible in high-dimensional models, as they would require an extremely large sample size to ensure adequate coverage of the parameter space for characterizing implausible regions and building a reliable statistical model. These strategies can be more challenging when a model is computationally intensive. For example, to determine which parameters or combinations of parameters in a 16-dimensional climate model were predictors of failure, Edwards et al. (2011) used 1000 evaluations (training samples) to construct a statistical model to identify parameter configurations with a high probability of failure in the next 1087 evaluations (2087 model runs in total). As pointed out by Edwards et al. (2011), although 2087 evaluations might impose high computational burdens, a much larger sample size spreading out over the parameter space is required to guarantee reasonable exploration of the 16-dimensional space.

These shortcomings and gaps motivated our investigation to develop effective and efficient crash-handling strategies suitable for GSA of the DESMs, as introduced in Sect. 2.

1.3 Scope and outline

The primary goal of this study is to identify and test practical “substitution” strategies to handle the parameter-induced crash problem in GSA of the DESMs. Here, we treat model crashes as missing data and investigate the effectiveness of three efficient strategies to replace them using available information rather than discarding them. Our approach allows the user to cope with failed simulations in GSA without knowing where they will take place and without rerunning the entire experiment. The overall procedure can be used in conjunction with any GSA technique. In this paper, we assess the performance of the proposed substitution approach on two hydrological models by coupling it with a variogram-based GSA technique (VARS; Razavi and Gupta, 2016a, b).

The rest of the paper is structured as follows. We begin in the next section by introducing our proposed solution methodology for dealing with simulation crashes. In Sect. 3, two real-world hydrological modeling case studies are presented. Next, in Sect. 4, we evaluate the performance of the proposed methods across these real-world problems. The discussion is presented in Sect. 5, before drawing conclusions and summarizing major findings in Sect. 6.

2 Methodology

2.1 Problem statement

We denote the output of each model run (realization) y(X), which corresponds to a d-dimensional input vector $X = \{x_{1}, x_{2}, \dots, x_{d}\}$ , where $x_{i} (i = 1, 2, \dots, d)$ is a factor that may be perturbed for the purpose of GSA (e.g., model parameters, initial conditions, or boundary conditions). Running a GSA algorithm usually requires generating n realizations of a simulation model using an experimental design $X^{s} = {\{X_{1}, X_{2}, \dots, X_{n}\}}^{T}$ , forming an n×d sample matrix. Then, the model responses will form an output space as $Y = {\{y (X_{1}), y (X_{2}), \dots, y (X_{n})\}}^{T}$ . Here, we deem simulation crashes as missing data and consider the model mapping of X^s→Y as an incomplete data matrix. For a given $Y \in R^{1 \times n}$ with missing values, let the vector Y_a consist of the n_a locations in the input space for which, in the given Y, the model responses are available, and let the vector Y_m consist of the remaining n_m locations ( $n_{m} = n - n_{a})$ for which, in the given Y, the model responses are missing due to simulation crashes. For convenience of expression and computation, we use the NAN_j symbol to represent the jth missing value in vector Y. The main goal now is to develop and test data recovery methods that can be used to substitute model crashes Y_m using available information (i.e., Y_a and X^s).

2.2 Proposed strategy for handling model crashes in GSA

We propose and test three techniques adopted from the “incomplete data analysis” for missing data replacement – the process known as imputation (Little and Rubin, 1987). Our techniques do not account for the mechanisms leading to crashes because identifying such mechanisms can be very challenging (Liu and Gopalakrishnan, 2017). Therefore, only the non-missing responses and the associated sample points are included in our analysis to infill model crashes for GSA, as described in the next subsections.

2.2.1 Median substitution

In sampling-based optimization, one may assign a very poor objective function value (e.g., a very large objective function in the minimization case) to a crashed solution, similar to the big M method for handling optimization constraints (Camm et al., 1990). Our first strategy in the GSA context adopts such an approach. However, since replacing crashes with a big value can magnify the effect of the crashed runs in GSA, instead we suggest choosing a measure of central tendency such as mean or median to minimize the impact of the implausible parameter configurations on the GSA results. If the distribution of the model responses is not highly skewed, imputing the crashes with the mean of the non-missing values may work. However, if the distribution exhibits skewness, then the median may be a better replacement because the mean is sensitive to outliers. Therefore, we used the median substitution technique for the experiments reported in this paper. In general, this strategy treats each model response as a realization of a random function and ignores the covariance structure of the model responses. Also, a shortcoming of this technique is that while it preserves the measure used for the central tendency of Y, it can distort other statistical properties of Y, for example by reducing its variance.

2.2.2 Nearest-neighbor substitution

The nearest-neighbor (NN) technique (also known as hot deck imputation, see, e.g., Beretta and Santaniello, 2016) uses observations in the neighborhood to fill in missing data. Let X_j∈X^s be an input vector for which a simulation model fails to return an outcome. Basically, in NN-based techniques, NAN_j is replaced by either a response value corresponding to a single nearest neighbor (single NN) or a weighted average of the response variables corresponding to k nearest neighbors (k-NN), where k>1. The underlying rationale behind NN-based techniques is that the sample points closer to X_j may provide better information for imputing NAN_j. In the k-NN techniques, weights are assigned based on the degree of similarity between X_j and the kth nearest neighbor X_k, where y(X_k)∈Y_a, characterized through kernel functions (Tutz and Ramazan, 2015).

In this study, we choose to use the single NN technique with a Euclidean distance measure. We do so because the single NN technique is very parsimonious and simple to understand and implement. To substitute the crashed simulations, the single NN algorithm reads through whole dataset to find the nearest neighbor and then imputes the missing value with the model response of that nearest neighbor. It is noteworthy that some authors have asserted that covariances among Y variables are preserved in NN-based techniques when using small k values (Hudak et al., 2008; McRoberts et al., 2002; Tomppo et al., 2002). But, McRoberts (2009) showed that the variance and covariance of the Y variables tend to be preserved for k=1 but not for k>1 (McRoberts, 2009). In general, compared to the single NN technique, the k-NN technique may provide a better fit to the data but at the expense of being more complex and requiring a careful (and subjective) selection of the kernel functions and variable k. As a more complex technique, we suggest directly using a model emulation technique as described in the section below.

2.2.3 Model emulation-based substitution

Model emulation is a strategy that develops statistical, cheap-to-run surrogates of response surfaces of complex, often computationally intensive models (Razavi et al., 2012a). Here we develop an emulator $\hat{y} (.)$ , which is a statistical approximation of the simulation model based on a response surface modeling concept. This strategy consists of finding an approximate and/or surrogate model with low computational cost that fits the non-missing response values Y_a to predict the fill-in values for the missing responses Y_m. There are various types of response surface surrogates, which have been extensively discussed in the literature (see, e.g., Razavi et al., 2012a). Examples are polynomial regression, radial basis functions (RBFs), neural networks, kriging, support vector machines, and regression splines. Here, we employ the RBF approximation as a well-established surrogate model. It has been shown that RBF can provide an accurate emulation for high-dimensional problems (Jin et al., 2001; Herrera et al., 2011), particularly when the computational budget is limited (Razavi et al., 2012b). An RBF model as a weighted summation of n_a basis functions (and a polynomial or constant value) can approximate the predictive response $\hat{y} (X)$ at a sample point X as follows:

\begin{matrix} (1) & \hat{y} (X) = \sum_{i = 1}^{n_{a}} ω_{i} f (∥X - X_{i}∥) = f (X) ω, \end{matrix}

where $f = \{f_{1}, f_{2}, \dots, f_{n_{a}}\}$ is the vector of the basis functions, ω_i is the ith component of the radial basis coefficient vector $ω = {\{ω_{1}, ω_{2}, \dots, ω_{n_{a}}\}}^{T}$ , and ∥X−X_i∥ is the Euclidian distance between two sample points.

There are various choices for the basis function, such as Gaussian, thin-plate spline, multi-quadric, and inverse multi-quadric (Jones, 2001). In the present study, we utilize the well-known Gaussian kernel function for RBF:

\begin{matrix} (2) & f (∥X - X_{i}∥) = \exp (\frac{{∥X - X_{i}∥}^{2}}{c_{i}^{2}}), \end{matrix}

where c_i is the shape parameter that determines the spread of the ith kernel function f_i.

After choosing the form of the basis function, the coefficient vector ω can be obtained by enforcing the accurate interpolation condition, i.e.,

\begin{matrix} (3) & [\begin{array}{c} y (X_{1}) \\ y (X_{1}) \\ ⋮ \\ y (X_{n_{a}}) \end{array}] = [\begin{array}{cccc} f_{11} & f_{12} & \dots & f_{1 n_{a}} \\ f_{21} & f_{22} & \dots & f_{2 n_{a}} \\ ⋮ f_{n_{a} 1} & ⋮ & f_{n_{a} 2} \\ ⋱ & \dots & ⋮ & f_{n_{a} n_{a}} \end{array}] [\begin{array}{c} ω_{1} \\ ω_{2} \\ ⋮ \\ ω_{n_{a}} \end{array}], \end{matrix}

where $f_{u v} = f (∥X_{u} - X_{v}∥)$ . In a matrix form, Eq. (3) can be simply rewritten as Y_a=Fω. This equation has a unique solution $ω = F^{- 1} Y_{a}$ if and only if all the sample points are different from each other. Therefore, the fill-in values for remaining n_m locations, for which the model responses are missing due to simulation crashes, can be approximated by

\begin{matrix} (4) & \hat{y} (X_{j}) = f (X_{j}) F^{- 1} Y_{a} (j = 1, 2, \dots, n_{m}) . \end{matrix}

To reduce the computational cost and avoid overfitting when building RBF, for each failed simulation at X_j one can choose k non-missing nearest neighbors of that missing value (here we arbitrarily set k=100). Then, a function approximation can be built using these k sample points to approximate that missing value; i.e., in Eq. (3), we set n_a to 100. Moreover, the shape parameter c in the Gaussian kernel function, which is an important factor in the accuracy of the RBF, can be determined using an optimization approach. We use the Nelder–Mead simplex direct search optimization algorithm (Lagarias et al., 1998) to find an optimal value for c by minimizing the RBF fitting error (for more details, see Forrester and Keane, 2009, and Kitayama and Yamazaki, 2011).

Note that in general depending on the complexity and dimensionality of the model response surfaces, other types of emulations can be incorporated into our proposed framework. However, for the crash-handling problem, it is beneficial to utilize the function approximation techniques that exactly pass through all sample points (i.e., the response surface surrogates categorized as “exact emulators” in Razavi et al., 2012a) such as kriging and RBF. This is mainly because most DESMs are deterministic and therefore generate identical outputs and responses given the same set of input factors. In other words, an exact emulator at any successful sample point X_k (not crashed) reflects our knowledge about the true value of the model output at that point; i.e., it returns $\hat{y} (X_{k})$ without any error.

2.3 The utilized GSA frameworks

We illustrate the incorporation of the proposed crash-handling methodology into a variogram-based GSA approach called the variogram analysis of response surfaces (VARS; Razavi and Gupta, 2016a) and a variance-based GSA approach adopted from Saltelli et al. (2008). The VARS framework has successfully been applied to several real-world problems of varying dimensionality and complexity (Sheikholeslami et al., 2017; Yassin et al., 2017; Krogh et al., 2017; Leroux and Pomeroy, 2019). VARS is a general GSA framework that utilizes directional variograms and covariograms to quantify the full spectrum of sensitivity-related information, thereby providing a comprehensive set of sensitivity measures called IVARS (integrated variogram across a range of scales) at a range of different “perturbation scales” (Haghnegahdar and Razavi, 2017). Here, we use IVARS-50, referred to as “total-variogram effect”, as a comprehensive sensitivity measure since it contains sensitivity analysis information across a full range of perturbation scales.

We utilize the STAR-VARS implementation of the VARS framework (Razavi and Gupta, 2016b). STAR-VARS is a highly efficient and statistically robust algorithm that provides stable results with a minimal number of model runs compared with other GSA techniques, and thus it is suitable for high-dimensional problems (Razavi and Gupta, 2016b). This algorithm employs a star-based sampling scheme, which consists of two steps: (1) randomly selecting star centers in the parameter space and (2) using a structured sampling technique to identify sample points revolved around the star centers. Due to the structured nature of the generated samples in STAR-VARS, ignorance-based procedures (see Sect. 1.2) cannot be useful in dealing with simulation crashes because deleting sample points associated with crashed simulations will demolish the structure of the entire sample set. Moreover, to achieve a well-designed computer experiment and sequentially locate star centers in the parameter space, we use the progressive Latin hypercube sampling (PLHS) algorithm. It has been shown that PLHS can grasp the maximum amount of information from the output space with a minimum sample size, while outperforming traditional sampling algorithms (for more details, see Sheikholeslami and Razavi, 2017).

For the variance-based GSA, we calculate the total-effect index (Sobol-TO), which accounts for the impact of any individual parameter and its interaction with all other parameters, according to the widely used algorithm proposed by Saltelli et al. (2008). This algorithm follows a specific arrangement of randomly generated samples to calculate the sensitivity indices as follows: first, an n×2d matrix of independent random numbers is generated (hereafter called the “base sample”). Next, by splitting the base sample in half, two new sample matrices, X^A and X^B, are built (each of size n×d). Then, to calculate the ith sensitivity index Sobol-TO_i, an additional sample matrix of size n×d, $X^{C i} (i = 1, 2, \dots, d)$ , is constructed by recombining the columns of X^A and X^B such that X^Ci contains the columns of X^B except the ith column, which is taken from X^A. To build the base sample, we use the Sobol quasi-random sequence. Furthermore, to achieve maximum space-filling properties and to maximize uniformity in the parameter space, for the given sample size, the skip, leap, and scramble operations are applied (for more details, see Estrada, 2017).

3 Case studies

3.1 A conceptual rainfall–runoff model

As an illustrative example, we applied the HBV-SASK conceptual hydrologic model to assess the performance of the proposed crash-handling strategies. HBV-SASK is based on the Hydrologiska Byråns Vattenbalansavdelning model (Lindström et al., 1997) and was developed by the second author for educational purposes (see Razavi et al., 2019; Gupta and Razavi, 2018). Here, we used HBV-SASK to simulate daily streamflows in the Oldman River basin in western Canada (Fig. 1) with a watershed area of 1434.73 km². Historical data are available for the period 1979–2008, from which we estimate average annual precipitation to be 611 mm and average annual streamflow to be 11.7 m³ s⁻¹, with a runoff ratio of approximately 0.42. HBV-SASK has 12 parameters, 10 of which are perturbed in this study (Table 1).

https://www.geosci-model-dev.net/12/4275/2019/gmd-12-4275-2019-f01

Figure 1The Oldman River basin (b), located in the Rocky Mountains in Alberta, Canada, flows into the Saskatchewan River basin (a).

https://www.geosci-model-dev.net/12/4275/2019/gmd-12-4275-2019-f02

Figure 2The Nottawasaga River basin in southern Ontario, Canada (adapted from Sheikholeslami et al., 2019, with permission from Elsevier; license number: 4664891206213).

Table 1HBV-SASK model parameters and their feasible ranges used in this study. For information on the full parameter set, refer to Razavi et al. (2019).

Download Print Version | Download XLSX

3.2 A land surface–hydrology model

In the second case study, we demonstrate the utility of imputation-based methods in crash handling via their application to the GSA of a high-dimensional and much more complex problem. We used the Modélisation Environmentale–Surface et Hydrologie (MESH; Pietroniro et al., 2007), which is a semi-distributed, highly parameterized land surface–hydrology modeling framework developed by Environment and Climate Change Canada (ECCC), mainly for large-scale watershed modeling with the consideration of cold region processes in Canada. MESH combines the vertical energy and water balance of the Canadian Land Surface Scheme (CLASS; Verseghy, 1991; Verseghy et al., 1993) with the horizontal routing scheme of the WATFLOOD (Kouwen et al., 1993). We encountered a series of simulation failures while assessing the impact of uncertainties in 111 model parameters (see Table A1 in Appendix A) on simulated daily streamflows in the Nottawasaga River basin, Ontario, Canada (Fig. 2). For this case study, the drainage basin of nearly 2700 km² was discretized into 20 grid cells with a spatial resolution of 0.1667^∘ (∼15 km). The dominant land cover in the area is cropland followed by deciduous forest and grassland. The dominant soil type in the area is sand followed by silt and clay loam (for more details, see Haghnegahdar et al., 2015).

3.3 Experimental setup

In the first case study, for STAR-VARS, we chose to sample 100 star centers (with a resolution of 0.1) from the feasible ranges of parameters (Table 1) using the PLHS algorithm, resulting in 9100 evaluations of the HBV-SASK model. For the variance-based method, the base sample size was chosen to be 5000, and thus the model was run 60 000 times. The larger base sample size was selected for the variance-based method to ensure the stability of the algorithm. The Nash–Sutcliffe (NS) efficiency criterion on streamflows was used as the model output for sensitivity analysis.

After calculating the NS values, we performed a series of experiments, each with a different assumed “ratio of failure” (from 1 % to 20 %), defined as the percentage of failed parameter sets to the total number of parameter sets. In each experiment, we randomly selected a number of sampled points based on the associated ratio of failure and considered them to be simulation failures. Then, we evaluated the performance of the crash-handling strategies in replacing simulation failures during GSA of the HBV-SASK model and compared the results with the case when there are no failures. In addition, we accounted for the randomness in the comparisons by carrying out 50 replicates of each experiment with different random seeds. This allowed us to see a range of possible performances for each strategy and to assess their robustness when crashes occurred at different locations in the parameter space.

In the second case study having 111 parameters, we only tested STAR-VARS with 100 star centers randomly generated using the PLHS algorithm (with a resolution of 0.1), resulting in 100 000 MESH runs. The NS performance metric was used to measure daily model streamflow performance, calculated for a period of 3 years (October 2003–September 2007) following a 1-year model warm-up period.

Due to various physical and/or numerical constraints inside MESH (or more precisely in CLASS), some combinations of the 111 parameters caused model crashes. Here, approximately 3 % of our simulations failed (3084 out of 100 000 runs). We applied the proposed crash-handling strategies to infill the missing model outcomes in the GSA of the MESH model. The entire set of 100 000 function evaluations of the MESH model would take more than 6 months if we used a single standard CPU core. However, we used the University of Saskatchewan's high-performance computing system to run the GSA experiment in parallel on 160 cores. Therefore, completing all model runs required approximately 32 h. For this case study, using an Intel^® Core^™ i7 CPU 4790 3.6 GHz desktop PC, the RBF technique took only 65 s to substitute 3084 crashed runs, while the single NN technique required about 97 s to complete the task.

4 Numerical results

4.1 Results for the HBV-SASK model

According to both the IVARS-50 and Sobol-TO sensitivity indices, the parameters of the HBV-SASK (when there were no model crashes) were ranked as follows from the most important to the least important one: ${FRAC, FC, C0, TT, alpha, K1, LP, ETF, beta, K2}$ . We assume these rankings and respective sensitivity indices to be the “true” values. Based on the dendrogram (Fig. 3) generated by the factor-grouping algorithm introduced by Sheikholeslami et al. (2019), we categorized these parameters into three groups with respect to their importance; i.e., {FRAC,FC, and C0} are the strongly influential parameters, {TT,alpha, and K1} are moderately influential parameters, and ${LP, ETF, beta$ , and K2} are weakly influential parameters.

https://www.geosci-model-dev.net/12/4275/2019/gmd-12-4275-2019-f03

Figure 3Grouping of the 10 parameters of the HBV-SASK model when applied on the Oldman River basin. The parameters are sorted from the most influential (to the left) to the least influential (to the right).

What should we do when a model crashes? Recommendations for global sensitivity analysis of Earth and environmental systems models

1.1 Background and motivation

1.2 Existing approaches to handling simulation crashes in DESMs

1.3 Scope and outline

2.1 Problem statement

2.2 Proposed strategy for handling model crashes in GSA

2.2.1 Median substitution

2.2.2 Nearest-neighbor substitution

2.2.3 Model emulation-based substitution

2.3 The utilized GSA frameworks

3.1 A conceptual rainfall–runoff model

3.2 A land surface–hydrology model

3.3 Experimental setup

4.1 Results for the HBV-SASK model

4.2 Results for the MESH model

5.1 Potential causes of failure in MESH

5.2 The role of sampling strategies in handling model crashes