Land surface Veriﬁcation Toolkit (LVT) - A generalized framework for land surface model evaluation

. Model evaluation and veriﬁcation are key in improving the usage and applicability of simulation models for real-world applications. In this article, the development and capabilities of a formal system for land surface model evaluation called the Land surface Veriﬁcation Toolkit (LVT) is described. LVT is designed to provide an integrated environment for systematic land model evaluation and facilitates a range of veriﬁcation approaches and analysis capabilities. LVT operates across 5 multiple temporal and spatial scales and employs a large suite of in-situ, remotely sensed and other model and reanalysis datasets in their native formats. In addition to the traditional accuracy-based measures, LVT also includes uncertainty and ensemble diagnostics, information theory measures, spatial similarity metrics and scale decomposition techniques that provide novel ways for performing diagnostic model evaluations. Though LVT was originally designed to support the land surface 10 modeling and data assimilation framework known as the Land Information System (LIS), it also supports hydrological data products from other, non-LIS environments. In addition, the analysis of diagnostics from various computational subsystems of LIS including data assimilation, optimization and uncertainty estimation are supported within LVT. Together, LIS and LVT provide a robust end-to-end environment for enabling the concepts of model data fusion for hydrological applications. African Monsoon Multidisciplinary Analysis (AMMA) Land surface Model Intercomparison Project Rosnay and Carbon-LAnd Model Intercomparison Project (C-LAMP; Randerson that were focused on evaluating and intercomparing a suite of land surface models when forced with a common suite of inputs. These studies docu-mented the systematic improvements in land surface model development and provided benchmarks for the simulation of continental scale water and energy budgets. Similar multi-model efforts include the North american Land Data Assimilation System (NLDAS; Mitchell et al. (2004)) and the Global Land Data Assimilation System (GLDAS; Rodell et al. (2004b)) projects, which generate land sur- face model outputs in near real-time, forced with observation-based meteorology. A detailed evaluation of the NLDAS model products against available observations were conducted during phase-I


Introduction
Verification and evaluation are essential processes in the development and application of simulation 20 models. Land surface models (LSMs) are one such class of simulation models specifically de-signed to represent the terrestrial water, energy and biogeochemical processes. LSMs generate estimates of terrestrial biosphere exchanges by solving governing equations of soil-vegetation-snowpack medium, and can be run in either offline mode or coupled to an atmospheric model. An accurate representation of land surface processes is therefore critical for improving models of the boundary 25 layer and land-atmosphere coupling as well as real world applications such as ecosystem modeling, agricultural forecasting and water resources prediction and management (NRC (1996)). The process of systematic evaluation and verification helps in the characterization of accuracy and uncertainty in the model predictions, which can then be used as a benchmark for future model enhancements.
Further, quantitative measures of the fidelity of model simulations are essential for improving the 30 usage and acceptability of LSM forecasts for real-world applications.

The Global Energy and Water Cycle Experiment (GEWEX) Global Land Atmosphere System
Study (GLASS) has identified that a general benchmarking framework capable of capturing useful modes of variability of LSMs through a range of performance metrics is necessary for further advancing the performance and predictability of the models (van den Hurk et al. (2011)). In their 35 recommendation of the priorities for hydrologic research, Entekhabi et al. (1999) emphasize the need for defining formal evaluation procedures to improve the "observability" of many LSM processes.
For e.g., soil moisture in most LSMs represents an index of the moisture state (Koster et al. (2009)) and the estimates from different models vary significantly even when forced with the same meteorology (Dirmeyer et al. (2006)). Further, the soil profile representations in LSMs and assumptions 40 about parameters such as soil hydraulic properties vary significantly across models. As a result, direct comparison of soil moisture estimates from these models against in-situ and remote sensing measurements becomes difficult. Given that a large suite of application models require soil moisture estimates as inputs (e.g. weather and climate forecasting (Fennessey and Shukla (1999) ;Koster et al. (2004)), agricultural models (Rosenzweig et al. (2002)), ecosystem models (Friend and Kiang models and available datasets. The key aspect of the MDF philosophy consists of using information from data to help the formulation, characterization and evaluation of models in a structured manner.
The results of the evaluation step are then used to revise and improve model formulation and subsequent development. As part of the new structure formulated in 2009, the GLASS community has identified Benchmarking and MDF as two of its three core themes for research going forward. Here 60 we describe the development of a formal evaluation system for land surface models that addresses both these themes identified by the GLASS community. The evaluation framework is designed to supplement an existing modeling system, to enable end-to-end formulations of the MDF paradigm.
As described in Kumar et al. (2006), Peters-Lidard et al. (2007) and Kumar et al. (2008a), the NASA Land Information System (LIS) is a flexible land surface modeling framework that has been 65 developed with the goal of integrating satellite-and ground-based observational data products and advanced land surface modeling techniques to produce optimal fields of land surface states and fluxes. The LIS infrastructure is designed as a land surface modeling and hydrological data assimilation system that generates estimates of water and energy states (e.g. soil moisture, snow) and fluxes (e.g. evaporation, transpiration, runoff) over a range of spatial (as finely resolved as 1km or 70 finer) and temporal (up to 1 hour and finer) resolutions. LIS operates several community land surface models and supports their application over global, regional or point domains. LIS is designed with advanced software engineering principles and provides a flexible, extensible framework for the inclusion of models, computational tools and datasets.
As a land surface modeling component for earth system models, LIS has also been coupled to 75 atmospheric models such as the Weather Research and Forecasting (WRF) model (Kumar et al. (2007); Santanello et al. (2009)). LIS includes a comprehensive data assimilation subsystem (Kumar et al. (2008b)) that enables the incorporation of several observational and satellite data sources for assimilation, in an interoperable manner. Additional computational tools to assist the utilization of data include parameter estimation and optimization (Santanello et al. (2007); Peters-Lidard et al.

80
(2008); ) and uncertainty modeling (Harrison et al. (2011)) subsystems. The uncertainty modeling components in LIS enable the explicit characterization of different sources of uncertainty in modeling using Bayesian inference techniques. In summary, LIS provides several key components of the MDF paradigm, including a suite of LSMs and computational tools such as data assimilation, optimization and uncertainty estimation.

85
In this article, we describe the development of a formal system for land surface model evaluation called the Land surface Verification Toolkit (LVT), designed to enable the systematic evaluation and intercomparison of various terrestrial hydrological datasets. LVT not only supports the diagnostic evaluation of the land model simulations from LIS and other land surface modeling systems, but also provides the capabilities for the analysis of outputs from various LIS subsystems such as data 90 assimilation, optimization, uncertainty estimation, radiative transfer and emission models, and application models. A large suite of in-situ, remotely-sensed and other model and reanalysis datasets are supported in LVT, which captures a wide range of land surface and terrestrial hydrologic regimes across the globe. In addition, a wide range of analysis metrics and procedures are supported in LVT to facilitate a comprehensive evaluation of hydrological datasets. Figure 1 presents a schematic 95 of the key functions of LVT and its interconnections with LIS and the observational datasets. The following sections describe the capabilities of LVT in detail.
Together, LIS and LVT encompass a comprehensive set of computational tools for fully enabling the MDF concept. The capabilities in LIS enable the estimation of model parameters with the use of the optimization subsystem and state estimation with the use of the data assimilation subsystem.

100
The uncertainty estimation tools enable the characterization of various sources of input uncertainty and their impacts on model prediction uncertainty. By providing the tools for model testing and diagnostic evaluation, LVT completes the requisite components of the MDF paradigm.
This article is structured as follows: Section 2 provides a review of the land model evaluation and verification efforts. This is followed by the description of LVT design (Section 3) and features (Sec-105 tion 4). A number of examples are presented in Section 5 that demonstrate how the LVT capabilities enable end-to-end MDF experiments.

Background
There have been a number of efforts to document and standardize land surface model evaluation. The model process development studies are typically focused on evaluating the model performance at 110 point or local scales (e.g., Henderson-Sellers et al. (1995); Chen et al. (1996); Pitman and Henderson-Sellers (1998); Koren et al. (1999); Blyth et al. (2010); Barlage et al. (2010); Niu et al. (2011)).
Though they are instrumental in benchmarking the improvements to model physics, these reported enhancements do not necessarily translate to broader spatial scales. Blyth et al. (2011) stresses that the model evaluations must be performed separately at the scales of interest, to guarantee transfer-115 ability of model processes to different scales.
There have been several community-wide efforts such as the Global Soil Wetness Project (GSWP; Dirmeyer et al. (2006)), African Monsoon Multidisciplinary Analysis (AMMA) Land surface Model Intercomparison Project (ALMIP; de Rosnay et al. (2006)) and Carbon-LAnd Model Intercomparison Project (C-LAMP; Randerson et al. (2009)) that were focused on evaluating and intercomparing 120 a suite of land surface models when forced with a common suite of inputs. These studies documented the systematic improvements in land surface model development and provided benchmarks for the simulation of continental scale water and energy budgets. Similar multi-model efforts include the North american Land Data Assimilation System (NLDAS;Mitchell et al. (2004)) and the Global Land Data Assimilation System (GLDAS; Rodell et al. (2004b)) projects, which generate land sur-125 face model outputs in near real-time, forced with observation-based meteorology. A detailed evaluation of the NLDAS model products against available observations were conducted during phase-I and II of the project (Robock et al. (2003); Sheffield et al. (2003); Pan et al. (2003); Lohmann et al. (2004); Mo et al. (2011);Xia et al. (2011a,b)). Evaluation of the model simulations from GLDAS against in-situ and remote sensing measurements are presented in Rodell et al. (2004a) and Kato 130 et al. (2007). The LandFlux-EVAL project, a more recent initiative, evaluated evapotranspiration estimates from a number of LSMs against in-situ data based estimates (Jiminez et al. (2011)). Approaches to define a minimum acceptable performance benchmark of LSMs by comparing them to calibrated noncausal (statistical/correlational) models are explored in Abramowitz et al. (2008).
Though these efforts cover a wide spectrum of model evaluation and benchmarking of model pro-135 cess advancements, the evaluation criteria and the performance metrics tend to be specific to each application. LVT consolidates the requirements identified in these efforts within a single framework. Forecast Centers (RFCs). Protocol for the Analysis of Land Surface models (PALS) is a web-based application for evaluating land surface models against observed datasets and calibrated statistical models (Abramowitz et al. (2008)). LVT and PALS will continue to be developed concurrently

150
LVT shares many features with these existing environments, but focuses on the native use of observational and model data sets since the interpretation of the data formats and reporting procedures is a critical and time consuming step in the evaluation process. LVT is designed as a framework that can be directly used and extended by the individual users and also includes a number of advanced features such as the evaluation of data assimilation diagnostics, standardized land surface diagnos-155 tics and uncertainty and information theory based analysis features. The following sections describe the design and capabilities of LVT.

Design of the LVT framework
LVT is implemented using object oriented framework design principles as a modular, extensible and reusable system. The software architecture of the system follows a three layer structure, as shown 160 in Figure 2. LVT core, the top layer, encompasses generic modeling features such as the management of time, I/O, configuration, logging and geospatial transformations. The middle layer, called "Abstractions" represents the extensible interfaces defined for incorporating additional functionalities into LVT. These include plugin interfaces for implementing new observational data sources and analysis metrics. The Abstractions layer provides the entry points for the reuse of existing generic 165 capabilities of the LVT core. The top two layers thus represent the classic "semi-complete" nature of an object oriented framework, which is made fully functional by including specific implementations of the abstractions. As shown in Figure 2, implementations to read and process observations from a wide range of terrestrial hydrological observations have been implemented using the "Observations" abstraction. Similarly, a large suite of analysis metrics has been implemented by extending 170 the "Metrics" abstraction.
LVT software is primarily written in Fortran 90 programming language. Though Fortran 90 lacks the direct support for object oriented programming concepts such as polymorphism and inheritance, these properties can be simulated in software (Decyk et al. (1997)) through the combined use of Fortran 90 and C programming languages. The compile-time polymorphism in LVT is simulated 175 through the use of virtual function tables, by employing C language to interface with Fortran 90 functions, and by storing them in memory to be invoked at runtime.
A key advantage of this object oriented-based design is interoperability. The top two layers (LVT core and Abstractions) define the interactions between an Observation or a Metric implementation with the LVT core in a generic manner. Similarly, the required interconnections between an Ob-180 servation implementation and a Metric implementation are also handled generically. As a result, the existing functionalities of the system are automatically available to a new addition in LVT, implemented through the extension of an Abstraction. For example, a newly incorporated observation implementation can take advantage of all available analysis metrics without having to define any additional interconnections between each bottom layer component.

185
Note that many of the model-independent capabilities within the LVT are enabled by the Earth System Modeling Framework (ESMF; Hill et al. (2004)). ESMF provides a structured collection of building blocks that can be customized to develop model components for Earth Science applications. It provides an infrastructure of utilities and a superstructure for coupling different model components. LVT employs the ESMF infrastructure utilities to handle the management of clock/time, 190 configuration, and logging. Further, LVT also employs the generic ESMF objects (called ESMF States) for sharing data and information between different components.

Capabilities of LVT
A critical part of an evaluation procedure is the processing of datasets, which normally consists of model outputs and measurements from in-situ, satellite and remote sensing platforms. These datasets 195 typically have different file formats, spatial and temporal scales and reporting procedures. Further, the in-situ and remotely sensed measurements typically require extensive quality control before their use. The rectification of such differences between datasets being compared is an essential, but routine and time consuming step in the evaluation process. The philosophy in LVT is to use the datasets in their native formats. The "plugin" style design of LVT enables the development of data processors 200 corresponding to each dataset. Once developed, these data processors can be subsequently used to work with an ongoing data collection without additional reprocessing.

Support for terrestrial hydrological datasets in LVT
The key processes that constitute the terrestrial hydrological cycle include precipitation, radiation, interception of precipitation by vegetation, infiltration of precipitation into the soil and the vertical 205 transfer of soil moisture, evapotranspiration, formation of snow, snow melt, and river runoffs, among others. In order to quantify the contribution of these individual processes to the overall variability of the terrestrial hydrological cycle, they must be evaluated against the full suite of available measurements. Motivated by this goal, the processing of a large set of measurements of different processes from a variety of sources are supported in LVT. As shown in Table 1, these datasets constitute the 210 monitoring of different components of the terrestrial hydrological cycle, from different observing platforms. The spatial and temporal scales of these measurements also vary significantly. By incorporating the processing of these datasets under a single, integrated framework, LVT enables an environment for performing a comprehensive evaluation of the terrestrial hydrological processes.
Note that the support of this large suite of products is enabled by the extensible nature of LVT soft-215 ware design and is expected to further expedite the incorporation of other relevant datasets in the future.

Analysis Metrics
The need for having a variety of performance evaluation metrics in the verification process is well recognized (Stanski et al. (1989)), as the robustness and sensitivity of each metric to measurement 220 attribute vary (Entekhabi et al. (2010)). Further, the appropriateness of an analysis metric may also differ significantly based on the targeted application (Gupta et al. (2009)). Model evaluation studies quite often use accuracy-based metrics that quantify model performance using residual-based measures. These metrics, however, may not provide further insights on the robustness of the model under future or unobserved scenarios (Pachepsky et al. (2006)). They are also inadequate in captur-225 ing estimates of associated uncertainties (Gulden et al. (2008)), relative importance and sensitivity of model parameters to the overall accuracy and uncertainty, tradeoffs in performance due to spatial scales and the tradeoffs between actual information content and variabilities introduced by random noise. Gupta et al. (2008) emphasize the need for sophisticated diagnostic evaluation methods that help in isolating the limitations of the model representations.

230
A number of analysis metric types is supported in LVT including; (1) Statistical accuracy measures that are conventionally used for model evaluation by comparing the model simulation against independent measurements and observations (e.g. RMSE, Bias), (2) Ensemble measures that provide assessments of the accuracy of probabilistic model outputs against observations, (3) Metrics that help in quantifying the apportionment of uncertainty and sensitivity of model simulations to model pa-235 rameters, (4) Information theory-based measures that provide estimates of information content and complexity associated with model simulations and measurements, (5) Spatial similarity and scale de-composition methods that assist in quantifying the impact of spatial scales in model improvements and errors and (6) Standard diagnostics to evaluate the efficiency of computational algorithms such as data assimilation. Table 2 presents a list of supported metric implementations within LVT. The 240 details of the metric implementations are discussed in Section 5 through a number of illustrative examples. The availability of this suite of metrics enables novel ways to quantify and translate model performance.

Miscellaneous features
LVT also supports a number of miscellaneous features to assist the verification procedures. To 245 provide a measure of the statistical significance and the influence of sampling density on the results, confidence intervals based on Gaussian distributions are computed for each verification metric. LVT generates the results of the analyses in ASCII text, binary, GriB and NetCDF output formats. The capabilities to generate probability density functions (PDFs) of the computed metrics by stratifying to specified parameters are also included in LVT. Further, LVT also provides methods to impose user-250 defined masking to exclude selected grid points when analysis metrics are computed. These masks can be static, time-varying or based on a certain variable. For e.g., a downward shortwave radiation (SW ↓) based mask can be defined that separates the analysis computations when the SW ↓ values are above and below a specified threshold (say 5 W/m 2 ). This will enable a day-night stratification of the computed metrics, when SW ↓ values are above and below 5 W/m 2 , respectively.

255
LVT also includes a number of land surface process diagnostics related to the partitioning of energy across the land atmosphere interface such as evaporative fraction, bowen ratio and overall energy, water and evaporation budgets at the land-atmosphere interface. These diagnostics are computed for both model and observational datasets. Quantifying these diagnostics are important for improving the understanding of the feedbacks between the land surface and the atmosphere.

260
As mentioned earlier, LVT also supports the analysis of diagnostics generated by the LIS data assimilation subsystem. These include distribution statistics of data assimilation innovations and analysis gain, which provide measures of the efficiency of data assimilation configurations. Similarly, LVT also handles the outputs of the optimization and uncertainty estimation subsystems of LIS. For e.g., checks to assess the convergence of these iterative algorithms can be performed by 265 analyzing the optimization and uncertainty estimation outputs through LVT.
Though LVT was originally designed to support LIS outputs, it has since been extended to facilitate the evaluation of other "non-LIS" model products. LVT contains the features to convert the given non-LIS product to a LIS output style and format. It then uses the converted output for evaluation. Note that this process does not involve any spatial or temporal transformation of the data, 270 rather the conversion to a different data format and convention.
5 Model evaluation examples using LVT 5.1 An end-to-end example of the MDF paradigm As noted earlier, one of the key motivations behind LVT is to provide a system that can augment LIS' modeling capabilities with an evaluation framework. The joint use of both these systems enables an 275 end-to-end environment for facilitating the steps of the MDF paradigm. In this section, we present an example of using the modeling and computational tools in LIS to refine the model performance and the verification features in LVT to quantitatively evaluate the simulations.
Model simulations using the Noah LSM (version 3.2) (Ek et al. (2003); Barlage et al. ( 2010)) forced with the NLDAS-II datasets are conducted over a 500x500 domain covering the U.S. Southern

Analysis of data assimilation diagnostics
The example in Section 5.1 presents an instance of the MDF paradigm that employs parameter 330 estimation for model reformulation. As noted in Williams et al. (2009), similar MDF instances can be defined that employ data assimilation techniques to improve state estimation. This section presents an example of using data assimilation diagnostics to assess the performance of the system within a MDF context.
The difference between the observations being assimilated and the model forecasts, known as 335 innovations, are typically computed during data assimilation. The statistics of the innovations are typically used to diagnose the performance of the assimilation algorithm. For example, when the Ensemble Kalman Filter (EnKF) is used as the assimilation algorithm, a linear system dynamics is assumed with Gaussian, mutually and serially uncorrelated errors in model and observations (Reichle and Koster (2002)). Consequently, the distribution of normalized innovations (normalized 340 with their expected covariance) is expected to follow a standard normal distribution N (0,1) (Gelb (1974)). The deviations from the expected mean and standard deviation of the normalized innovation distribution is used as a measure of suboptimality of the data assimilation configuration. A number of studies have confirmed that poor specification of model and observation error parameters can significantly degrade the quality of assimilation products ; Reichle et al. 345 (2008)). The assimilation diagnostics can be analyzed using LVT and the model and observation error specifications can then be continually revised to ensure optimal data assimilation performance.
To demonstrate these capabilities, a synthetic data assimilation experiment is conducted over the Continental U.S. domain at 1 • spatial resolution, for a time period of 1 Jan 2000 to 1 Jan 2006. In this experiment, the observations to be assimilated are synthetically simulated (from an independent 350 land model simulation using the Catchment LSM) and as a result, the associated errors are perfectly known. The observations are assimilated using the Ensemble Kalman Filter (EnKF) algorithm. The details of the assimilation setup is provided in . Figure 5 shows the spatial distribution of mean and variance of normalized innovations over the domain generated by the assimilation system. In this instance, the mean values are close to zero and the variances are closer 355 to 1, indicating the near-optimal performance. Additional analysis metrics such as lag correlation coefficients to assess the "whiteness" of the innovation distribution are also provided within LVT for more detailed evaluations of the efficiency of the data assimilation system.

Characterization of uncertainty diagnostics
It is well acknowledged that model simulations and observations are affected by different sources 360 of uncertainties. The errors in model parameters, input forcing and structural deficiencies introduce uncertainties in the model simulations. The measurements from satellite and remote sensing platforms are subject to measurement noise and errors in retrieval models. Similarly, the in-situ measurements also have associated uncertainties due to environmental factors, data processing and instrument errors. Therefore, it is important to quantify the impact of these uncertainty sources

Information Theory metrics
A number of studies (Wackerbauer et al. (1994); Lange (1999); Selle and Huwe (2004)) describe the use of information theory-based metrics to discriminate time series data based on their information content (or randomness) and their complexity. Pachepsky et al. (2006) and Pan et al. (2011) describe the use of these measures for discriminating soil water models. LVT includes a number of infor-405 mation theory-based measures such as metric entropy, mean information gain, effective complexity and fluctuation complexity. These measures are computed by converting the time series of a given dataset into a binary symbol string (Lange (1999)). Within the symbol string, patterns of words (defined as a group of consecutive symbols of a certain length) are identified, representing a state of the system of interest. For e.g., a word consisting of L consecutive symbols has 2 L possible states.
The information theory metrics are then defined by computing the probabilities associated with the patterns of words in the converted time series of the data. For example, the metric entropy (M E) and information gain (IG) metrics are defined as follows:
Characterization of the nature of spatial variability of different component processes over a range Using the domain configuration at 1km spatial resolution over Afghanistan used in Section 5.1, 460 two model simulations are conducted using Noah LSM (version 2.7.1); one that employs a terrain based correction of shortwave radiation input to the LSM and one that does not include such adjustments. The terrain-based corrections adjust the incoming shortwave radiation based on terrain slope and aspect and these changes in turn impact the evolution of snow over these terrain. The improvements in the snow cover simulation as a result of the terrain-based correction is computed as the 465 difference in POD fields from the two simulations, generated by comparing against the MOD10A1 (version 4) fractional snow cover product. The scale-decomposition approach is then applied to this difference field to quantify how the improvements in snow cover estimates at 1km spatial resolution translate to coarser spatial scales. Figure 8 shows the result of scale decomposition of the total improvement field for POD using the 470 two dimensional discrete Haar wavelet transform. The algorithm computes successive decompositions of the original field by powers of 2. The percentage contribution to the total improvement at each coarse spatial scale is shown in Figure 8. The results indicate that most of the improvements in POD are obtained at fine spatial scales and the contribution of the scale decreases with increase in spatial resolution. At scales coarser than 16km, the percentage contribution drops below 10%.

475
Similar analysis of scale effects can be performed on other metrics and variables of interest. This example demonstrates the use of LVT for another MDF experiment where the MODIS fractional snow cover data is used to assess the applicability of model formulations at different spatial scales.

Spatial similarity measures
With the increased availability of spatially distributed datasets from satellites and remote-sensing 480 platforms, there is a need for techniques and metrics that evaluate models and observations based on the their spatial patterns, in addition to the one-to-one correspondence comparisons that are typically used. The incorporation of spatial pattern comparisons will aid in further improving the reliability of LSMs for hydrological applications (Bloschl and Sivapalan (1995); Grayson and Bloschl (2000)).
A review of spatial similarity methods in hydrology is provided in Wealands et al. (2005), which 485 includes techniques based on statistical identification as well as image processing techniques. In this section, an example of using a similarity metric through LVT to compare snow cover patterns from two different LSMs is presented.
Snow cover estimates using two LSMs, Noah (version 3.2) and CLM (version 2 ; Dai et al. (2003)), forced with GDAS and CMAP datasets, are generated over a 100x100 region near the Southern Great  LVT is an evolving framework and continues to be enhanced with the addition of new analysis 555 capabilities and the incorporation of terrestrial hydrological datasets. In addition to the handling of LSM outputs, the support for outputs from various application models coupled to LIS (e.g. crop, drought, flood, landslide models) is also being developed. Ensemble measures such as reliability, resolution and discrimination (Murphy and Winkler (1992)) and timing error measures (Liu et al. (2011b)) will also be incorporated into the current suite of analysis metrics. The use of a common 560 environment for diagnostic evaluation will also help in quantifying the tradeoffs between different metrics and skill scores. For e.g., different organizations use different indices for quantifying the severity of drought (Heim (2002)). The availability of these drought indices through LVT will enable cross-comparisons of these measures and the assessment of their suitability for the intended application. In summary, the growing capabilities of LVT are expected to help in the definition and 565 refinement of a formal benchmarking and evaluation process for the LSMs and assist in improving their use for real-world applications.        (1991)