This paper develops a multivariable integrated evaluation (MVIE) method to measure the overall performance of climate model in simulating multiple fields. The general idea of MVIE is to group various scalar fields into a vector field and compare the constructed vector field against the observed one using the vector field evaluation (VFE) diagram. The VFE diagram was devised based on the cosine relationship between three statistical quantities: root mean square length (RMSL) of a vector field, vector field similarity coefficient, and root mean square vector deviation (RMSVD). The three statistical quantities can reasonably represent the corresponding statistics between two multidimensional vector fields. Therefore, one can summarize the three statistics of multiple scalar fields using the VFE diagram and facilitate the intercomparison of model performance. The VFE diagram can illustrate how much the overall root mean square deviation of various fields is attributable to the differences in the root mean square value and how much is due to the poor pattern similarity. The MVIE method can be flexibly applied to full fields (including both the mean and anomaly) or anomaly fields depending on the application. We also propose a multivariable integrated evaluation index (MIEI) which takes the amplitude and pattern similarity of multiple scalar fields into account. The MIEI is expected to provide a more accurate evaluation of model performance in simulating multiple fields. The MIEI, VFE diagram, and commonly used statistical metrics for individual variables constitute a hierarchical evaluation methodology, which can provide a more comprehensive evaluation of model performance.

Climate models play a very crucial role in a variety of climate-related studies including, e.g., climate dynamics, the detection and attribution of climate change, the projection of future climates and environments, and adaptation to future climate change (IPCC, 2012, 2013). All these studies strongly rely on the performance of climate models. Model evaluation and intercomparison have become increasingly important, especially because a number of climate models are available at present. A total of 29 modelling groups and 60 climate models are involved in the Coupled Model Intercomparison Project Phase 5 (CMIP5) and more are expected to be included in its next phase (Eyring et al., 2016). In addition, more and more regional climate models have been used in regional model downscaling and intercomparison projects (e.g., Fu et al., 2005; van der Linden and Mitchell, 2009; Mearns et al., 2009; Giorgi and Gutowski, 2015). Thus, how to concisely summarize and evaluate model performance is extremely important for climate model intercomparison, development, and application.

The Taylor diagram provides a very efficient way to summarize multiple aspects of model performance in simulating scalar fields (Taylor, 2001). Gleckler et al. (2008) introduced a suite of metrics, e.g., decomposed mean square error, and relative error metrics, which were used to characterize the model performance for various applications. Xu et al. (2016) devised a vector field evaluation (VFE) diagram, which can be regarded as a generalized Taylor diagram, to evaluate the model performance in simulating vector fields, such as vector winds and temperature gradients. Most metrics, e.g., root mean square error, correlation coefficient, and standard deviation, measure the model performance in simulating an individual variable (Gleckler et al., 2008). It is a common view that no model performs better than others in every aspect. For example, among various models, one model can show the best performance in simulating air temperature but may have a poor performance in simulating precipitation. In this case, how can researchers select the best model if both temperature and precipitation are of great concern in a study? A popular approach is to show the relative errors of various variables from different models using a portrait diagram (e.g., Gleckler et al., 2008; Pincus et al., 2008). The portrait diagram illustrates model errors for each individual variable and can provide an overview of the model performance in simulating various variables. However, the portrait diagram cannot give a quantitative evaluation of the overall performance of climate models in simulating multiple fields. To measure the overall model performance, Gleckler et al. (2008) proposed an exploratory index, termed the model climate performance index (MCPI), by averaging each model's relative errors across multiple fields. Note that the MCPI only considers the root mean square errors (RMSEs) of various fields. The RMSE can be interpreted as a function of the correlation coefficient and standard deviation (Murphy, 1988; Taylor, 2001; Pincus et al., 2008; Pierce et al., 2009). Therefore, the RMSE takes both the correlation coefficient and standard deviation into account. However, the RMSE cannot explicitly measure the correlation coefficient and standard deviation. For example, the same RMSE can correspond to very different correlation coefficients and standard deviations, especially for large RMSE values.

In this paper, we propose a more comprehensive multivariable integrated
evaluation (MVIE) method, which can summarize multiple statistics of model
performance in terms of multiple variables, for climate model evaluation.
The general idea is to group

Xu et al. (2016) constructed the VFE diagram in terms of two-dimensional vector fields. There are three statistical quantities in the VFE diagram, i.e., root mean square length (RMSL) of a vector field, vector similarity coefficient (VSC), and root mean square vector deviation (RMSVD) between two vector fields. In this section, each quantity will be defined and interpreted from the viewpoint of MVIE. Thereafter, we will construct the VFE diagram for multidimensional vector fields.

Consider two vector fields

In the same way as for the vector similarity coefficient (VSC) for two-dimensional
vector fields (Xu et al., 2016), the VSC for

With the aid of Eqs. (1) and (2), Eq. (7) can be written as

To measure the difference in vector fields

With the aid of Eq. (7), the square of the RMSVD can be written as

To evaluate model performance in terms of the simulation of multiple
variables, one can group various scalar fields into a vector field and
compare the constructed vector field against the observed one using the VFE
diagram. For example, we can construct a vector field with temperature and
precipitation as its

VFE diagram for displaying multiple statistics of two vector fields.
The vector similarity coefficient between two vector fields is given by the
azimuthal position of the test field. The radial distance from the origin is
proportional to the RMSL of the vector field.

Multiple statistics of CMIP5 models in simulating surface air temperature and precipitation in terms of climatological mean state and interannual variability. Tm (Pm) is the climatological mean surface air temperature (precipitation) in summer (June–July–August). Ta (Pa) is the temporal standard deviation of summer surface air temperature (precipitation). CMIP5 simulations and three individual groups of observational datasets are compared with the ensemble mean of three groups of SAT and precipitation data observed during the period from 1961 to 2000. The rms is the ratio of modeled to observed root mean square values of the spatial pattern for each variable. CORR (RMSD) is the uncentered spatial correlation coefficient (root mean square deviation) between model and observational fields. RMSL, Rv, and RMSVD measure the statistics of two vector fields, which can represent the overall statistics of all fields (Eqs. 3, 13, 16). RMSL was shown as the ratio of model simulated RMSL to the observed RMSL. The rms_std is the standard deviation of four rms values, which describe the dispersion of rms values of Tm, Pm, Ta, and Pa (Eq. 23). MIEI is the multivariable integrated evaluation index (Eq. 24). Model performance is indicated by the color scale; lighter colors denote better model performance.

Without loss of generality, we choose the climatological mean SAT and
precipitation as well as the temporal standard deviation of the SAT and
precipitation as the variables to interpret the MVIE method. Four variables
derived from climate models are examined against the corresponding
observational estimates. The evaluation is based on the monthly mean datasets
derived from the first ensemble run of CMIP5 historical experiments during
the period from 1961 to 2000 (Taylor et al., 2012). Three pairs of observed SAT and precipitation datasets are
used in this study. The first pair of datasets is the Climatic Research Unit
(CRU) gridded SAT and precipitation (Harris, et al., 2014). The second pair
of datasets is the University of Delaware air temperature and precipitation
(Willmott and Matsuura, 2001). The third pair of datasets is composed of the
Global Historical Climatology Network (GHCN) temperature (Fan and van den
Dool, 2008) and Global Precipitation Climatology Centre (GPCC) precipitation
(Schneider et al., 2014). All observational data are available at
0.5

Table 1 shows the various statistics of nine CMIP5 models in terms of the
climatological mean summer (June–July–August) SAT, precipitation, and the
temporal standard deviation of SAT and precipitation over the global land
area (60

VFE diagram describing the normalized climatological mean SAT,
precipitation, and interannual variabilities of SAT and precipitation over a
land area between 60

As shown in Fig. 2, the VSC varies from 0.90 to 0.94, indicating which models
can better reproduce the overall spatial pattern of various variables and
which cannot. For example, model 1 shows the maximum VSC, indicating that
model 1 can generally better reproduce the spatial pattern of the four
variables relative to other models. This can be confirmed by Table 1. The
uncentered pattern correlation coefficients for the four scalar fields are
generally higher in model 1 than in the other models. Figure 2 also clearly
shows which model overestimates or underestimates the overall rms values. For
example, models 5 and 7 overestimate the RMSLs of the four-dimensional vector
fields, suggesting that both models generally overestimate the rms values of
the four scalar fields. This can also be confirmed by Table 1, as model 5
clearly overestimates the rms values of Ta (1.43) and Pa (1.19) and slightly
underestimates the rms values of Tm (0.99) and Pm (0.94). Model 7
overestimates all rms values (1.06, 1.09, 1.14, and 1.07) of the four
variables. Thus, the RMSL of a constructed vector field can reasonably
represent the overall performance of a model in reproducing rms values of
multiple scalar fields. In contrast, model 9 clearly underestimates the RMSL
of the vector field (Fig. 2). Correspondingly, three out of the four rms
values of scalar fields are smaller than 1 for model 9 (Table 1). Similarly,
the RMSVD between two vector fields can also reasonably represent the overall
RMSDs of multiple scalar fields as shown in Fig. 2 and Table 1. Thus, one can
evaluate the model performance in simulating multiple variables with three
statistical quantities. The three statistical quantities represent different
aspects of model performance, the knowledge of which can provide a more
comprehensive model evaluation. The VFE diagram can clearly illustrate to
what extent the overall RMSDs of various scalar fields (represented by the
RMSVD) are attributable to the systematic difference in rms values
(represented by the RMSL) and how much is due to the poor pattern
similarities (represented by

Note that model performance does not change monotonically with the increase
or decrease in rms values. Specifically, model performance improves as the
normalized rms values approach 1 but decreases as the normalized rms values
approach either zero or infinity. As defined in Eq. (3), the RMSL being equal
to the sum of rms values of all components of a vector field. Thus,
even if the modeled
RMSL is equal to the observed one, it does not necessarily suggest that the model well reproduces the rms
values of various scalar fields. This conclusion may result from the
cancellation between the overestimated and the underestimated rms values. For
example, as shown in Table 1, model 3 overestimates the rms values of Tm
(1.05) and Ta (1.26) but underestimates the rms values of Pm (0.80) and Pa
(0.77). However, the RMSL (0.99) is almost consistent with the observational
estimate. Under such a circumstance, the RMSL misrepresents the model
performance in simulating rms values of various scalar fields. To mitigate
this shortcoming, one can add a line segment centered at each plotted point
along the azimuthal direction (Fig. 2). The length of the line segment is
equal to twice the standard deviation of rms values of multiple scalar
fields. Thus, the length of the line segment can measure the dispersion of
various rms values relative to their mean. A shorter line indicates that the
rms values are close to the mean. In contrast, a longer line segment
indicates that the rms values are spread out over a wider range. To measure
the accuracy of modeled rms values to that of those observed, one can use the
root mean square deviation of the rms values of various variables:

In general, the model results get closer to the observational estimate as
the RMSVD decreases. It is noteworthy that for a given VSC at a relatively
low value, the RMSVD does not strictly decrease monotonically as the
simulated RMSL approaches the observed one (Fig. 3). For example, model B
shows the same VSC as that of model A but a smaller bias in the RMSL, which
suggest that model B performs better than model A. However, the RMSVD is
greater in model B than in model A (Fig. 3). Thus, the decrease in the RMSVD
may not necessarily indicate an improvement in model performance. On the
other hand, given the drawback of the RMSL in measuring the accuracy of rms
values, the model skill score, defined based on the RMSL and VSC in Xu et
al. (2016), is also not well suited for measuring the model performance in
simulating multiple scalar fields. To better measure model performance, we
define a multivariable integrated evaluation index (MIEI) based on the VFE
diagram (Fig. 3):

Schematic diagram displaying the relationship between the RMSVD,
RMSD

As interpreted in Sect. 2, the RMSVD is determined based on the sum of quadratic RMSDs of various scalar fields (Eq. 16). Thus, the RMSVD is equivalent to the model climate performance index used in previous studies (e.g., Gleckler et al., 2008; Radić and Clarke, 2011; Chen and Sun, 2015). In general, both the RMSVD and MIEI can be used to measure the model performance. However, the MIEI is expected to provide a more accurate evaluation of model performance than the RMSVD. For example, model 3 shows a smaller RMSVD but a larger MIEI compared to model 2 (Table 1, Fig. 2). The RMSVD and MIEI give an opposite rank in the performance of models 2 and 3. Note that model 3 shows a much greater standard deviation of rms values (0.20) than that of model 2 (0.04), suggesting that model 3 poorly simulates the relative amplitude of the four variables. Such information is not considered by the RMSVD but can be captured by the MIEI (Eqs. 18, 24). The values of the MIEI derived from various models are also shown in Fig. 2. A smaller MIEI generally indicates a better performance of the climate model. For example, models 1 and 6 show smaller MIEIs than the other models. Models 1 and 6 show higher VSC values and a close correspondence of rms values with the observed ones (Table 1, Fig. 2). The MIEI can serve as an index to determine the rank of climate model performance in simulating multiple fields. In comparison with the MIEI, the VFE diagram can provide a more detailed evaluation of the model performance by explicitly showing multiple statistics, i.e., pattern similarity, rms values and their variances, and RMSVD.

The issue of how to take the observational uncertainties into account is of particular importance in model evaluation and ranking, especially when more and more observational datasets provide estimates of the observational uncertainty. The statistics derived from each group of observational estimates are also shown in Table 1, which can roughly quantify the observational uncertainties and its impact on model evaluation. Generally, the colors are clearly lighter for the statistics of individual observed variables in contrast to the modeled variables (Table 1). This indicates that the observational uncertainties are relatively small and should have less impact on the evaluation of model performance. To further quantify the impacts of observational uncertainty on ranking model performance, we calculate the MIEIs of various climate models by taking each group of observational estimates as the reference data. Three groups of observational estimates generate three groups of MIEIs. Afterwards, we calculate Spearman's rank correlation coefficient of each group of MIEIs with those derived from models and the ensemble mean of multiple observational estimates. The Spearman's rank correlation coefficients are 0.996, 0.996, and 0.904, respectively, suggesting that the ranks are very close to each other no matter which group of observational estimates is used as reference data. Thus, the observational uncertainty should have less impact on ranking model performance in this case. One can use the average of Spearman's rank correlation coefficients to quantify the consistency of various ranks when a number of observational estimates are available.

Pyramid chart showing the relationship between three levels of
metrics. The first level of metrics, i.e., correlation coefficient
(CORR), rms value, and RMSD, measures the
model performance in terms of individual variables. The second level of
metrics, i.e., VSC, RMSL, standard deviation of rms values (

The MVIE method proposed here provides a concise way of representing the multiple statistics of multiple fields on a two-dimensional plot, i.e., the VFE diagram. The VFE diagram includes three statistical quantities, i.e., RMSL, VSC, and RMSVD, representing different aspects of model performance. Specifically, the RMSL (RMSVD) represents the total mean value and variance (total RMSDs) of all scalar fields. The VSC measures the overall pattern similarity across all scalar fields. As shown in the example, each of the three statistical quantities can reasonably represent the corresponding statistics of multiple scalar fields. Moreover, the VFE diagram can illustrate how much the overall RMSD of various fields is attributable to the difference in rms values and how much is due to poor pattern similarity. Thus, one can summarize multiple statistics of multiple variables for various models in a diagram and facilitate the intercomparison of model performance in simulating multiple variables. The MVIE method can be applied to spatial and/or temporal fields. It can also simultaneously evaluate various temporal variabilities simulated by models, e.g., climatological mean state and the amplitude of interannual variability as shown in Sect. 3.2. Based on the VFE diagram, we also developed a MIEI which takes the amplitude and pattern similarity of multiple fields into account. The MIEI satisfies the criterion that a model performance index should vary monotonically as the model performance improves. The MIEI provides a more concise evaluation than the VFE diagram of model performance in simulating multiple fields.

The statistical metrics presented in this paper can be divided into three different levels and their relationships are summarized in a pyramid chart (Fig. 4). The first level of metrics, i.e., correlation coefficient, rms value, and RMSD, measures model performance in terms of individual variables. These metrics can be illustrated by a table of metrics (Table 1), which can provide detailed information on model performance in simulating individual variables but cannot give a quantitative evaluation of the overall model performance in simulating multiple fields. The second level of metrics, i.e., the VSC, RMSL, standard deviation of rms values, and RMSVD, is derived from the first level of metrics and represents the overall statistics of multiple variables. The second level of metrics can be presented as a VFE diagram, which provides an integrated evaluation of model performance in terms of simulating multiple fields. The MIEI belongs to the third level of metrics, which is defined based on the VFE diagram. The MIEI further summarizes the three statistical quantities of the VFE diagram into a single index and can be used to rank the performance of various climate models. A higher level of metrics provides a more concise evaluation of model performance compared to a lower level of metrics, which facilitates model intercomparison. Unavoidably, the higher level of metrics loses detailed statistical information in contrast to the lower level of metrics. To provide a more comprehensive evaluation of model performance, one can show the VFE diagram together with a table of statistical metrics (Table 1) or other model performance metrics as needed.

As shown in Sect. 2, the VFE diagram can be constructed by using uncentered statistics, which are computed using the full scalar fields, including both mean and anomaly. The VFE diagram can also be constructed by using centered statistics (Appendix A). The centered RMSL of a vector represents the overall variance of all components of a vector field (Eq. A3). The centered VSC can be interpreted as weighted average of Pearson's correlation coefficients, which measures the overall pattern similarity across all paired anomaly fields (Eq. A9). The centered RMSVD measures the sum of centered RMSDs across all paired components between two vector fields (Eq. A12). The type of statistics, i.e., centered or uncentered statistics, that should be used depends on the application. The uncentered statistics should be used if both the mean and anomaly need to be evaluated. In contrast, the centered statistics should be used if the anomaly fields are the primary concern. The centered correlations alone are not sufficient for detection studies (Legates and Davis, 1997). It has been argued that the uncentered statistics are better suited for detection because they incorporate the response of the mean value. In contrast, the centered statistics are more appropriate for attribution because they better measure the similarity between spatial patterns (Hegerl et al., 2001). The VFE diagram provides us flexibility in model evaluation. In terms of model evaluation aimed at a detection study, one can compute the uncentered statistics with full fields. In contrast, one can use centered statistics by computing the statistical quantities with vector anomaly fields if an attribution study is the major concern of the model evaluation.

In practice, one may want to weight different fields based on their relative importance. If some variables to be evaluated are dependent on each other, e.g., skin temperature and surface air temperature, one may also want to weight these variables properly because the dependent variables contain redundant information. Otherwise, the evaluation of equally weighted variables may overestimate the importance of the dependent variables. Determining the weight coefficient depends on the application and therefore is beyond the scope of this study. Here, we only discuss how the weight can be considered in the multivariable integrated evaluation (Appendix B). The MVIE method presented in this study requires the normalization of each modeled and observed variable by dividing the corresponding rms value of the observed variable (Eqs. 19, 20). Therefore, one should weight different variables after the normalization (Eqs. B1, B2); otherwise the normalization process will remove the weight coefficient. Weighting each normalized field leads to a quadratic weighting of the quadratic rms values, quadratic RMSDs, and correlation coefficients (Eqs. B1, B5, B8, B11).

The VFE diagram and MIEI may also provide some guidance in weighting various climate models to constrain future climate projection. A recent study suggested that model weighting should take both model performance and model interdependencies into account to improve climate projections (Knutti et al., 2017). The VFE diagram can summarize model performance in terms of multiple statistics of multiple fields, on one hand. On the other hand, the VFE diagram can also clearly show the differences between model and observation as well as the differences between various models. This information provided by the VFE diagram may be used in weighting climate models, which warrant further studies.

The code used in the production of Fig. 2 and Table 1 is available in the Supplement.

CRU data are provided by Climatic Research Unit from their web
site at

To further interpret the RMSL, VSC, and RMSVD, we break down the full vector
fields

Equation (A1) can be written as

Similarly, we have

With the support of Eq. (13), the VSC can be written as

Equation (A8) can be rewritten as

The RMSVD between two vector fields can also be represented by the mean and
anomaly fields:

The statistics can be computed based on the full vector fields or anomaly vector fields depending on the concern of evaluation. The statistical quantities, i.e., RMSL, VSC, and RMSVD, computed based on the full vector fields represent the uncentered pattern statistics, which include the statistics from both the mean and anomaly fields. Alternatively, three statistics can also be computed based on the anomaly fields, yielding centered statistics, which only measure the anomaly fields. The full vector fields should be used if both the mean and anomaly need to be evaluated. In contrast, the anomaly vector fields should be used if anomaly fields are the primary concern.

In terms of model evaluation, one may care for some variables more than
other variables, although all variables are of great concern. In such a
circumstance, it would be useful to weight different variables to make the
VSC, RMSL, and RMSVD more sensitive to some variables than to others.
Without loss of generality, the weighted- and normalized-vector fields

Based on Eq. (16), the square of the RMSVD between normalized vector fields

Based on Eq. (13), the VSC between normalized vector fields

ZX devised the evaluation method and wrote the paper. All of the authors discussed the results and commented on the paper.

The authors declare that they have no conflict of interest.

We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output. The study was supported jointly by the National Key Research and Development Program of China (2016YFA0600403), the Major Research Plan of the National Science Foundation of China (91637103), and the National Science Foundation of China Grant (41675080, 41675105). This work was also supported by the Jiangsu Collaborative Innovation Center for Climate Change. Edited by: Klaus Gierens Reviewed by: two anonymous referees