The GEWEX LandFlux project : evaluation of model evaporation using tower-based and globally-gridded forcing data

Introduction Conclusions References


Introduction
Characterizing the exchange of water between the land surface and the atmosphere is a topic of multi-disciplinary interest, as the processes that comprise this dynamic cycling of water determine the spatial and temporal variability of hydrological responses across local and global scales. In recent years, there has been significant progress in the development of regional and global datasets based largely on remote sensing 15 retrievals. These data have provided a wealth of spatially and temporally varying information across a range of Earth system processes, including soil moisture (Liu et al., 2011a), vegetation change (Tucker et al., 2005;Liu et al., 2011bLiu et al., , 2013, groundwater (Famiglietti et al., 2011;Richey et al., 2015) and precipitation (Huffman et al., 1995;Nesbitt et al., 2004), enabling a capacity to enhance our understanding and descrip-20 tion of regional-and global-scale water cycles and their spatial and temporal variability. While evaporation represents the key process for returning the Earth's surface water to the overlying atmosphere and represents the linking mechanism between the water and energy cycles, it is only in relatively recent times that effort has been directed towards the development of global products (Mu et al., 2007;Fisher et al., 2008; Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | To address this observation limitation, a number of evaporation modelling approaches have been developed over the past few years to enable estimation at scales beyond the field, using satellite remote sensing (Sheffield et al., 2010;Miralles et al., 2011a) and other data sources (Douville et al., 2013). The models tend to differ in their level of empiricism and in the desired scale of application, with some exclusively de-5 veloped for farm-scale operation and requiring local calibration (Bastiaanssen et al., 1998;Allen et al., 2007). Others have been developed for broader scale application and are built on physical relationships describing the water and energy transfer at the land surface (Norman et al., 1995;Su, 2002;Fisher et al., 2008;Miralles et al., 2011a). While traditional applications of evaporation estimates have been directed to-10 wards agricultural monitoring (Allen, 2000), catchment water budgets and basin-scale water management (Kustas et al., 1994;Granger, 2000), more recent applications of evaporation products have included detection and prediction of heatwaves Miralles et al., 2014a), droughts (Mu et al., 2012;Otkin et al., 2014) and in resolving the likely contribution of human-induced climate change on such events (Greve 15 et al., 2014).
Despite the importance of understanding the magnitude and spatial and temporal variability of evaporation, the availability of long-term products required to do this are rather limited. Characterizing the long-term trends and variability in independent observations of the Earth's coupled water and energy cycles is a key objective of range of diagnostic data sets, LSMs and reanalysis products and showed consistency in inter-annual variations of evaporation products that corresponded well with previous investigations (Jung et al., 2010).
These benchmarking studies provided a thorough (and much needed) assessment of available global evaporation products and the varying approaches used to derive 15 them. However, evaluation of the models for their predictive skill was challenging due to inconsistencies in the forcing data used to drive the models, as well as to the different parameterization schemes employed. That is, the analysis was performed on the published evaporation output, rather than re-running simulations from a common forcing dataset. In these benchmarking studies, the evaporation data sets were also aggre-JPL, the Advection-Aridity model of Brutsaert and Stricker (1979) and a single-source Penman-Monteith (PM) model (Monteith, 1965), using a set of twenty flux towers distributed across a range of biome types and climate zones to force the models with tower-based data directly. Based on common forcing and considering overall results, the study found that PT-JPL was the best performing model, followed by SEBS, PM 10 and Advection-Aridity. In a related contribution,  provided a more focused analysis on the influence of model structure and resistance parameterization on single, two-layer and three-source Penman-Monteith models. The authors identified considerable variability in the performance of models due to their structure and parameterization choices. A parallel effort to the LandFlux project is the European Space 15 Agency (ESA) funded WAter Cycle Multi-mission Observation Strategy for EvapoTranspiration (WACMOS-ET; see http://wacmoset.estellus.eu/). WACMOS-ET, which is focused on an analysis period covering 2005-2007, seeks to better understand the impacts of model structure on flux estimation, with an additional focus on developing a consistent forcing dataset. A key result from these early works and the preliminary out- 20 comes from WACMOS-ET support the finding that no single model or parameterization consistently outperformed any other across different biomes.
While establishing a baseline level of performance at the tower scale is important, understanding the impact of using the large-scale globally-gridded forcing that will ultimately drive the global products is key. The focus of the current investigation is to build 25 upon these past efforts and complement ongoing WACMOS-ET investigations, by simulating state-of-the-art evaporation models using a parallel assessment of tower-based meteorology and gridded data, and comparing results with available eddy-covariance flux observations. Understanding how application of gridded forcing data might influ-6814 ence the performance of the different models, relative to their performance when forced with (presumably) higher-quality tower data, is a motivating rationale for this work. Such evaluations are important in developing insight into the sensitivity of the models to input data uncertainties, provide a relative assessment of model quality and also inform upon issues of spatial scale and footprint mismatch (McCabe and Wood, 2006). Establishing 5 model suitability for large-scale operational application as part of the GEWEX Landflux project is a further motivating goal for this work. As such, a major objective is to evaluate the individual model responses across a range of biomes and climate zones. The models selected for assessment include SEBS, PT-JPL, the Penman-Monteith based Mu model (PM-Mu) (Mu et al., 2011) as well as the Global Land Evaporation: the Am-10 sterdam Methodology (GLEAM) (Miralles et al., 2011a). These models satisfy the key criteria considered for global model selection, which included reliance on a minimum number of forcing variables, capacity to use remote sensing based observations, as well as previous application at either the regional or global scale.

Data
For this analysis, model simulations cover the period from 1997 to 2007 and are performed at a 3-hourly temporal resolution. To examine model response and inter-product variability, a parallel tower-and grid-based analysis was performed. Data for the towerbased analysis are derived from a set of forty-five eddy-covariance towers (see Ta- 20 ble A1), while the gridded data are extracted from a compilation of available globally distributed satellite, meteorological and land surface characteristics products. Compared to the 0.5 degree and 3-hourly gridded data, the use of tower-based forcing is expected to minimize issues related to footprint uncertainties when evaluating simulations against the observed eddy-covariance based flux data. The primary purpose of the key attributes of the selected towers and Fig. A1 describes the varying temporal lengths of the tower records used in this study. The requirement that towers only be used if they are able to provide the input data required by all models (see Table 1) was a strong limiting criterion that significantly reduced the number of available study sites. In particular, the availability of land surface temperature data, which is required for SEBS, 10 drastically constrained the choice of towers. However, ensuring data consistency within the towers used for simulation and assessment was an important component of this work, as it removes the impact of tower bias in subsequent model assessment. Even with this reduced number, the selected towers represent a considerable spatial spread encompassing a variety of biome types and climate zones (see Fig. 1). 15 In terms of forcing data requirements, tower-based variables that were used for model simulations include air temperature, relative humidity, wind speed, net radiation, ground heat flux and precipitation. A summary of the forcing data requirements for each model is provided in Table 1. Land surface emissivity, leaf area index and fractional vegetation cover were estimated from Normalized Difference Vegetation Index (NDVI) data 20 obtained from the Global Inventory Monitoring and Modelling Study (GIMMS) dataset (Tucker et al., 2005), at 8 km spatial and bi-monthly temporal resolutions. Here, the emissivity was calculated using the approach of Sobrino et al. (2004), leaf area index was estimated following Fisher et al. (2008) and the fractional vegetation cover was estimated using the technique described in Jiménez-Muñoz et al. (2009). Land sur- 25 face temperature was calculated using tower-observed longwave upward radiation and by inverting the Stefan-Boltzmann equation (Brutsaert, 2005). Atmospheric pressure data, which are absent from many towers, were calculated based on ground elevation 6816 sis of the SEBS model , showed that h c was amongst the least sensitive variables in SEBS, reducing concerns on the impact of this assumption. Tower data were aggregated (i.e. summed for precipitation and averaged for other input variables) from their native resolution of half-hourly or hourly to 3-hourly, to match the temporal resolution of the gridded data.

Description of grid-based forcing data (LandFlux Version 0 forcing dataset)
Grid-based data were developed by Princeton University for the LandFlux Version 0 (V-0) dataset. The variables in the V-0 include air temperature, land surface temperature, wind speed, atmospheric pressure, specific humidity, precipitation, net radiation, 15 NDVI and leaf area index. Net radiation data derive from the GEWEX Surface Radiation Budget (SRB) Version-3 (Stackhouse et al., 2011), while land surface temperature is determined by employing a Bayesian post-processing procedure that merges High-Resolution Infrared Radiation Sounder (HIRS) retrievals with the land surface temperature data from the National Centers for Environment Prediction (NCEP) Climate

20
Forecast System Reanalysis (CFSR) (Saha et al., 2010), as described in Coccia et al. (2015). Precipitation data are also from the NCEP CFSR product and have been bias-corrected to the Global Precipitation Climatology Project (GPCP) V2.2 dataset (Adler et al., 2003). Atmospheric pressure, specific humidity and wind speed data were also based on the CFSR reanalysis data. For vegetation based parameters, NDVI data Introduction tive 5 arc-minute resolution for tower-based analysis, but were aggregated to 0.5 • for grid-based assessment. Vegetation optical depth data was from Liu et al. (2011b) using a merged product from multiple microwave based satellite data. The 0.25 • spatial and daily temporal resolutions VOD data were gap-filled as described by Miralles et al. (2011a). Soil moisture data assimilated in GLEAM comes from the CCI-WACMOS  Table A1) located in the southern hemisphere. Both GlobSnow data and the NSIDC product are at approximately 0.25 • spatial and daily temporal resolutions. Lightning frequency data is based on the Combined Global Lightning Flash  (Mach et al., 2007) and it is used to calculate a climatology of rainfall rates (Miralles et al., 2010). Finally, vegetation cover fractions are derived from the MODIS MOD44B product (Hansen et al., 2005). The MODIS continuous cover factions describe every pixel as a combination of its fractions of water, tall canopy, short vegetation and bare soil. The temporal average of fractions is used 5 here for the MODIS period, providing only a static cover fraction for the GLEAM simulations. The MOD44B product is available at 250 m and 0.25 • resolution. For tower-based analysis, cover fractions are at 250 m resolution, but for grid-based analysis the 0.25 • MOD44B product was aggregated to 0.5 • . Table A1 summarizes the different sources and spatiotemporal scales of the data 10 that were used for both the tower-and grid-based flux simulations. As noted earlier, the temporal analysis encompasses the period 1997-2007, although as defined in Fig. A1, the individual tower records do not necessarily provide uninterrupted observations during this time range. 15 The specific biomes examined in this work include wetland (WET), grassland (GRA), cropland (CRO), shrubland (SHR), evergreen needleleaf forest (ENF), evergreen broadleaf forest (EBF) and deciduous broadleaf forest (DBF). For simplicity, the shrubland biome is comprised of closed shrubland, woody savannah and mixed forest biomes. The number of towers for each biome type varies, with fourteen for evergreen 20 needleleaf forest, ten for grassland, seven for cropland, seven for deciduous broadleaf forest, four for shrubland, two for wetland and only one for evergreen broadleaf forest (see Table A1). The climate zones include boreal (BOR), sub-tropical (subTRO), temperate (TEMP), temperate-continental (TempCONT) and dry (DRY) for arid and semi arid regions. As with biome type, the towers are not evenly distributed across 25 climate zones, with fifteen for temperate, eleven for sub-tropical, eight for temperatecontinental, five for boreal and six for dry regions (see Table A1). Introduction

LandFlux Model Descriptions
Following are brief descriptions of the models employed in this analysis. For a more comprehensive explanation of the implementation of these different schemes, the reader is referred to the principal model references as well as the recent contributions of Ershadi et al. (2014.

SEBS
SEBS is a widely-employed process-based model used in the estimation of evaporation. The model uses a variety of land surface and atmospheric variables and parameters for simulating the transfer of heat and water vapor from the land surface to the atmosphere. To do so, the model first estimates the representative roughness of the land surface and then uses roughness parameters, temperature gradient and wind speed data to estimate sensible heat flux via a set of flux-gradient equations describing the transfer of heat from the land surface to the atmosphere. Depending on the atmospheric boundary layer height, the model uses either the Monin-Obukhov Similarity Theory or the Bulk Atmospheric Similarity Theory equations (Brutsaert, 2005). 15 The model estimates the sensible heat flux of hypothetically wet and dry conditions and uses these extreme-cases to calculate the evaporative fraction. Evaporation is then calculated as a fraction of the available energy. The model requires accurate values of net radiation, land surface temperature, air temperature, humidity, wind speed and vegetation phenology to calculate surface fluxes. SEBS relaxes the need for parameterization 20 of the surface resistance, but is sensitive to aerodynamic resistance parameterization (Ershadi et al., 2013). Further details on SEBS and its model formulation can be found in Su (2002). 6821

PT-JPL
The PT-JPL model of evaporation uses a minimum of meteorological and remote sensing data and has been employed in a number of studies to estimate regional and global scales flux response (Fisher et al., 2008;Sahoo et al., 2011;Vinukollu et al., 2011a, b;Badgley et al., 2015). A key characteristic of the model is the use of bio-physiological 5 properties of the land surface to reduce Priestley-Taylor potential evaporation to actual values. The PT-JPL is a three source model in which the total evaporation is partitioned into soil evaporation (λE s ), canopy transpiration (λE t ), and wet canopy evaporation (λE i ), i.e. λE = λE s + λE t + λE i . The model first partitions the total net radiation to soil and vegetation components and calculates potential evaporation for soil, for canopy and for the wet canopy. The model then determines a set of constraint multipliers to represent the impacts of green canopy fraction, relative wetness of the canopy, air temperature, plant water stress and soil water stress on the evaporative process. The model uses the constraint multipliers to reduce the potential evaporation to actual values for each component of the system. PT-JPL does not calibrate or tune parameter 15 values and does not use wind speed data or parameterizations of the aerodynamic and surface resistances. However, the model does require accurate estimates of optimum temperature (T opt ) (Potter et al., 1993) for canopy transpiration. The optimum temperature is the air temperature at the time of peak canopy activity, when the highest values of absorbed photosynthetically active radiation and minimum values of vapour pressure 20 deficit occur. Further details of the PT-JPL model can be found in Fisher et al. (2008).

PM-Mu
The PM-Mu was developed from a two-source Penman-Monteith implementation (Mu et al., 2007) to a three-source version (Mu et al., 2011) and forms the basis behind the near real-time estimation of global evaporation in the MOD16 product (Mu et al., 2013). 25 Evaporation in the PM-Mu model is the sum of soil evaporation, canopy transpiration and evaporation of the intercepted water in the canopy, i.e. (λE = λE s + λE t + λE i ). Esti-mation of evaporation for each component is based on the Penman-Monteith equation (Monteith, 1965), but weighted based on the fractional vegetation cover, relative surface wetness and available energy. Parameterization of aerodynamic and surface resistances for each source is based on extending biome specific conductance parameters from the stomata to the canopy scale, using vegetation phenology and meteorological 5 data. In contrast to the majority of Penman-Monteith type of models, the PM-Mu does not require wind speed and soil moisture data for parameterization of resistances. However, globally application of the model requires consideration of the fact that resistance parameters were calibrated against data from a set of eddy-covariance towers. Further details on PM-Mu can be found in Mu et al. (2011Mu et al. ( , 2013. Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ysis for these gap-filled periods. Further details on GLEAM can be found in Miralles et al. (2011a, b).

Model Simulation and Analysis
The four selected models were forced with both tower-and grid-based data. The results were then filtered for daytime-only periods, defined as when the shortwave downward 5 radiation exceeds 20 W m −2 , to avoid issues associated with negative net radiation and nighttime condensation. The data were also filtered for rain events, for negative sensible and latent heat flux observations, for low quality or gap-filled tower records, for frozen land surfaces and for times in which air temperature was less than or equal to 0 • C. The performance of the models was evaluated for individual towers, for the collection of data from all towers, for towers classified across biome types and for towers classified across climate zones.
To evaluate the skill of the models, we used traditional scatterplots and common statistical metrics including the coefficient of determination (R 2 ), slope (m) and y intercept (b) of the linear regression, the root-mean-square difference (RMSD), relative 15 error [RE = RMSD/mean(λE obs )] and the Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe, 1970). In developing these performance metrics, simulated evaporation was compared with tower-observed evaporation (λE obs ) that were corrected for non-closure using the energy residual technique, as described in Ershadi et al. (2014). Scatterplots of matching percentiles (referred to hereafter as percentile plots) of observed evapora-20 tion versus simulated values from the 1st to 99th percentile increment were also used (Sect. 3.1). The 25th percentile (Q 25 ), median (Q 50 ) and 75th percentile (Q 75 ) were used for further model assessment. To establish the response of the models to water availability at individual tower sites, we calculated an aridity index as AI = P/E p , with P the annual precipitation (mm yr −1 ) and E p the annual evaporation ( ues and an average value was calculated to represent the state of water availability at specific tower locations. GLEAM seem to present as more robust candidate models for estimation of evaporation, at least in terms of their statistical response at the tower scale. All models show a large spread around the fitted linear regression line, due perhaps to variability in performance as a response to the land surface type or climate condition. The inter-tower variability of the models is an important element of this work and will be discussed 25 further in the following sections. The effect of using globally-gridded forcing data on the evaporation models is presented in Fig. 3. Apart from providing a direct evaluation on the accuracy of the global LandFlux product, assessing flux response to a change in forcing aids in diagnosing the model sensitivity to data uncertainties. Likewise, an indirect assessment of the issue of footprint mismatch between the gridded data (0.5 • ) and the eddy-covariance 5 tower (hundreds of meters) can also be inferred. Figure 4 clearly shows that use of the grid-based data reduces the performance of all models relative to the tower-based runs, with all statistics degrading with a change in forcing resolution. SEBS displayed the largest sensitivity to forcing data, with a 0.4 decrease in NSE and a 28 W m −2 increase in RMSD. The sensitivity of PT-JPL and GLEAM to the use of gridded data was  Overall, these results confirm that all models display a relatively high sensitivity to changes in the type and quality of input forcing data. While gridded forcing data are expected to have a mismatch with the tower-based forcing due to their larger pixel (and footprint) sizes, this spatial mismatch will impact all of the applied models, albeit to a lesser or greater extent, depending on forcing data requirements. While spatial 20 scale no doubt plays a major role in decreasing model efficiencies at grid-scales, the most likely reason for the differences in tower-versus grid-based results relates to internal inconsistencies within the gridded forcing data. For instance, SEBS is known to be particularly sensitive to the temperature gradient between the land surface and the atmosphere (van der Kwast et al., 2009;Ershadi et al., 2013). While the tempera-Introduction sensitivity to changes in the forcing. As such, any inconsistency between the tower and gridded data is likely to have less influence on the PT-JPL, GLEAM and PM-Mu models than it will on SEBS, which in addition to vegetation height, requires both land surface temperature and wind speed data: two variables with considerable spatial variability. The large spread of data in the scatterplots indicates that there is considerable vari-5 ability in the performance of the models at individual towers, irrespective of whether tower or gridded data are used. Of course, it may also be indicative of systematic biases in in-situ data, which vary from one tower to another and subsequently impact on model spread: however, this is non-trivial to determine. To investigate the nature of this variability, we extend the analysis by developing time series of R 2 , RE and NSE at 10 3-hourly resolution for individual tower locations, as shown in Fig. 4. To examine performance as a function of hydrological condition, the towers are arranged by degree of increasing aridity, as determined by calculation of an aridity index (see Sect. 2.3), with left-to-right representing the transition from wet-to-dry and describing an aridity index varying between 1 and 0.

15
From Fig. 4 it can be observed that there is a general downward trend in both R 2 and NSE as aridity increases, with a slight upward trend reflected in RE. In terms of R 2 , most of the models except for PM-Mu show some consistency in performance until an aridity index of around 0.7, wherein models start to diverge. Similar agreement is seen in the relative error plot, although the outlier here is SEBS, which shows vari-20 able performance unrelated to aridity changes. Examining the Nash-Sutcliffe efficiency allows for a clearer evaluation of model response to be obtained. For this metric, PT-JPL and GLEAM display relatively good correspondence for most of the towers, but start to diverge more regularly for aridity indices below 0.8. Overall, PT-JPL presents a marginally better response than GLEAM, with higher values of NSE and R 2 and lowest 25 values of RE produced across the majority of towers. From Fig. 2  parable to those of PT-JPL and GLEAM, or even higher for a majority of towers that have an aridity index less than 0.7. Inspection of individual tower-based scatterplots for each of the models (not shown) illustrated that while the SEBS evaporation has a strong linear relationship with observed values for a majority of towers, the linear regression line exhibits a large slope, indicating an overestimation in SEBS predictions.

5
Those towers that exhibit drops in NSE (and rise in RE) for the SEBS model (e.g. DE-Tha, NL-Loo, US-Wrc, FR-Pue; see Table A1) are located mainly in shrubland and forest biomes, suggesting a dependency of SEBS model performance that is tied to land surface vegetation characteristics. Although statistical variations are evident in all models, the greater response variability in SEBS is likely due to problems in simulat- 10 ing heat transfer within the roughness sub-layer (RSL), which often forms over tall and heterogonous land surfaces (Harman, 2012). We explore the issue of skill dependency of certain models to biome type and climate zone in Sect. 3.2 and 3.3. As noted, Fig. 4 shows a general decrease in the predictive skill in all models where towers have an aridity index less than 0.7, but particularly so for PM-Mu and SEBS. 15 These reductions may in part be due to data uncertainties in tower observations that originate from the advection of dry air into the tower footprint, or to a reduced capacity of the models to reproduce the evaporative response when evaporation represents a small fraction of the total available energy. Two towers at which all models display poor performance are IT-Noe and IL-Yat (see Fig. 1). It seems likely that IT-Noe is 20 influenced by strong advection of moist air from the Mediterranean Sea, while IL-Yat is influenced by advection of hot and dry air from surrounding desert regions. None of the models in this study are able to specifically account for advection and are thus prone to misrepresenting the observed evaporative response. 25 The variability in model performance across the tower sites observed in Fig. 4 indicates that a biome-specific assessment could be useful to determine whether the performance of the models is also correlated to the underlying land cover, in addition to 6828 the seven different biome classes. The analysis was conducted using the higher quality tower-based simulations for all available 3-hourly data. One immediate highlight from Fig. 5 is the relatively poor performance of all models over shrubland sites, where low values of NSE (i.e. NSE ≤ 0.05) and reduced R 2 can be observed. Ershadi et al. (2014) 5 observed a similarly poor response over shrublands in a separate tower-based analysis that employed some of the same models examined here. They attributed the result to difficulties in the parameterization of the models over such landscapes due to the strong heterogeneities present in these environments, as well as inherent water limitations. For instance, the capacity of the GIMMS NDVI data with 8 km spatial resolution is clearly insufficient in effectively parameterizing the roughness for SEBS, resistances for PM-Mu and constraint functions for the PT-JPL. Excluding shrublands from the analysis, the PT-JPL is one of the best performing models across the remaining biomes, having the highest values of NSE and R 2 and lowest relative errors. Consistency in the performance of PT-JPL across biome types 15 has been reported in earlier studies (Vinukollu et al., 2011a;Ershadi et al., 2014) and was variously ascribed to the formulation of its constraint functions (see Sect. 2.2.2) and the minimal forcing data requirements, which reduce its sensitivity to uncertainties in input data. GLEAM closely follows PT-JPL for evergreen needleleaf forest and grassland biomes, but shows marginally lower NSE values for other biomes. Figure 5 20 also indicates that while SEBS has relatively high values of R 2 over the majority of biome types, it fails to provide sufficient predictive skill for the estimation of evaporation over shrublands and forest biomes. These biome types are characterized by tall and heterogeneous canopies, within which the roughness sub-layer forms. The reduced capacity of the SEBS flux gradient functions in simulating heat transfer within 25 the roughness sub-layer has been highlighted previously (Weligepolage et al., 2012;Ershadi et al., 2014  , but the key reasons for low R 2 values of the model across other biomes is not immediately apparent and requires further investigation.

5
Percentile plots of the 3-hourly tower-based results were used to identify whether a model under-or over-estimates evaporation across its distribution function. From Fig. 6 it can be seen that SEBS clearly overestimates while PM-Mu underestimates evaporation across all biome types, reflecting those results presented in Fig. 2. The percentile plots for SEBS are close to the 1 : 1 line for grassland and cropland biomes 10 that have short canopy height, confirming the observations made for Figs. 4 and 5. PT-JPL shows good model reproduction of observed values over grassland and deciduous broadleaf forest biomes, with the percentile plots close to the 1 : 1 line. However, the model slightly underestimated evaporation for croplands and overestimated evaporation for wetlands, with the tails (percentiles greater than Q 75 ) reflecting greater 15 divergence than the bulk of the distribution. The rate of overestimation was higher for evergreen needleleaf forest, evergreen broadleaf forest and for shrubland biomes. Figure 6 also shows that GLEAM presents strong performance over grasslands, croplands and evergreen needleleaf forest sites, but underestimated evaporation across deciduous broadleaf forest sites and tended to overestimate evaporation across the remaining 20

biomes.
Overall, all models show a tendency towards reduced performance when applied over forest biomes, but improved performance over shorter canopies. These results may be reflecting the fundamental physical basis behind approaches such as the base Penman-Monteith (Penman, 1948), Priestley-Taylor (Priestley and Taylor, 1972) and 25 Monin-Obukhov flux gradient functions, which were developed for such surface types (Brutsaert, 1982), highlighting the challenges inherent in global scale application of such models, especially over diverse land cover types. To further evaluate the influence of biome type on evaporation estimation and to discriminate the role of individual forcing variables in impacting model efficiencies, the NSE and R 2 values between tower-and grid-based data were calculated for the flux response, as well as for key forcing variables such as net radiation, land surface temperature, air temperature, wind speed, specific humidity, fractional vegetation cover 5 and leaf area index. As can be seen in Fig. 7, agreement between tower-based and grid-based net radiation data is relatively high across all biomes, but especially so over forest biomes (NSE ≥ 0.67). Grid-based wind speed data have the most variable agreement with tower data, with R 2 and NSE values generally lower than other selected variables across all of the examined biomes. Air temperature shows good agreement, with 10 both high NSE values (NSE ≥ 0.7) and high R 2 values (R 2 ≥ 0.84). Specific humidity data are also well reproduced (NSE ≥ 0.72), as is land surface temperature with an NSE ≥ 0.80 for all biomes. In sharing a common GIMMS-NDVI based derivation, the agreement for fractional vegetation cover and leaf area index data is reasonable over the majority of biomes, except over evergreen broadleaf forest, where both the R 2 and 15 NSE are low. The lower panel of Fig. 7 show R 2 and NSE values for both the tower-and gridbased simulations against eddy-covariance observations for each of the models, discriminated by biome type. As can be seen, the performance of all models is reduced across all biomes when grid-based forcing data is used, a result reflected in all cases 20 by relatively lower NSE and R 2 values. PM-Mu had the smallest and SEBS had the largest decrease in performance over a majority of the biomes, in accordance with the findings of Sect. 3.1. PT-JPL and PM-Mu had a relatively constant decrease in NSE and R 2 for the grid-based simulations. Decreased modelling performance was also maintained for GLEAM, except over the single evergreen broadleaf forest tower, where 25 a more significant departure (relative to the other biome types), was observed. SEBS showed a much larger variability in performance reduction, with smaller variations due to forcing over forest biomes and larger reductions over biomes with shorter canopies. The significant decrease in NSE for SEBS over grassland, cropland and to some extent Introduction the wetland biome, cannot be immediately associated with NSE or R 2 changes in any of the forcing variables. It is interesting that the agreement over grassland and cropland biomes between tower-and grid-based variables is amongst the highest (especially for wind speed, fractional vegetation cover and for leaf area index data), yet the subsequent model performance is among the worst. The use of global statistics to evaluate 5 model response makes discriminating the cause of this variability difficult. It is possible that the statistics are biased low due to the influence of one or a few individual towers, by errors in the forcing fields driving model parameterizations (i.e. vegetation height) or in response to model sensitivities to particular forcing variables. Either way, these results highlight the difficulties in diagnosing the cause of performance response and related sensitivity to forcing data variables in complex process-based models, which often display a high degree of interactions between the variables. Indeed, diagnosing the forcing variables responsible for reducing the efficiency of particular models is not feasible with a simple correlation analysis of the input data fields, but requires a separate and focused sensitivity analysis.

Performance of the Models over Climate Zones
Similar to the biome-wise analyses, an evaluation of the models was conducted across a number of distinct climate zones, with R 2 , RE and NSE values for tower-based 3hourly evaporation estimations shown in Fig. 8 performance over the temperate-continental climate zone, with high NSE and R 2 and low RE, which was followed closely by the temperate climate zone. The lowest overall performance for all models corresponded to the dry climate zone, again reflecting the aridity based results in Fig. 4. As discussed in Sect. 3.1, data uncertainties due to the role of advection in dry regions and difficulties in the accurate estimation when 5 confronted with low evaporative fractions are likely reasons behind such performance reductions in dry regions. Figure 9 displays the corresponding percentile plots of model performance over the five different climate zones. As can be seen, PT-JPL and GLEAM provide generally good performance over all climate zones, although GLEAM slightly underestimates 10 evaporation for temperate-continental and boreal climate zones. SEBS overestimates relative to tower-based evaporation across all biomes, while PM-Mu generally underestimates, except over temperate and temperate-continental climate zones, for which the percentile plot of PM-Mu are relatively close to the 1 : 1 line. Similar to Fig. 7, Fig. 10 outlines the model response differentiated for the differ-15 ent climate zones when using grid-based forcing data. As can be seen from the lower panel, the simulation performance is reduced across all climate zones, relative to the tower data. In particular, SEBS is significantly impacted across the majority of climate zones, with both a reduction in NSE and R 2 , except over boreal forests. One possible reason for this smaller variation over boreal forests could be due to lower surface-to-air 20 temperature gradients over forests, which contributes to smaller sensible heat fluxes and consequently larger evaporative fraction values (in contrast to model performance over dry climates, where the temperature gradient is large). Nevertheless, the relationship between uncertainty in individual variables and the reduction of modelling performances is not able to be determined here. Further analysis examining the sensitivity 25 of individual models to their forcing is required.

Discussion
Understanding the role of model forcing in influencing simulation results, as well as examining the impacts of biome type and climate zone on flux response, are important elements in the development of robust globally-distributed evaporation products. The focus of this study was on evaluating a set of process-based models, to support 5 the development of globally distributed and long term observations of surface fluxes as part of the GEWEX LandFlux project. Overall, the PT-JPL and GLEAM models provided the most consistent performance, while PM-Mu tended to underestimate and SEBS overestimate evaporation relative to the forty-five eddy-covariance tower observations examined here. However, while statistical analysis allows a pseudo-ranking 10 of model performance, more detailed evaluation across towers, and biome and climate types highlighted the considerable within-model variability in performance. Results also demonstrated that changing the scale of input forcing data from tower-to grid-based reduced the quality of model estimates in all cases, but especially for SEBS, where a sensitivity to surface-air temperature gradients plays a strong role. In the following, we 15 examine these results and interpret any implications for large-scale global applications. With its relatively simple modelling structure, PT-JPL performed consistently well relative to the other models that have more complex structures and parameterization configurations. One possible reason for this response may relate to the constraint functions of PT-JPL serving a wide range of hydro-meteorological conditions, encompass-20 ing energy-limited (e.g. boreal climate) to water-limited (e.g. dry climate) conditions. The good performance of PT-JPL was also observed in a recent multi-model evaluation study, with a summary of the strengths and limitations of the model presented in Ershadi et al. (2014). GLEAM also performed well, both at the tower and at the gridscale (see Figs. 4 and 2). Previous studies have shown that the model is sensitive 25 to the accuracy of precipitation data (Miralles et al., 2011b), as this determines the partitioning of intercepted evaporation in the model and the root-zone soil moisture. Unfortunately, testing for such sensitivities was not possible here, as both tower-and Introduction grid-based records were filtered for rainfall events in post-processing steps, in response to the limitation of eddy-covariance observations during such events. In terms of the NSE, R 2 and RE, PM-Mu followed PT-JPL and GLEAM, with the model tending to underestimate evaporation when applied to most of the tower-and grid-based records. While reasons for this underestimation are not immediately clear, 5 a recent study examining the structure and parameterization of Penman-Monteith type models  showed that the PM-Mu, which has a three-source structure, underperformed relative to a single-source (Monteith, 1965) and a two-layer approach (Shuttleworth and Wallace, 1985) across all studied biome types, except croplands. An interesting aspect of  was that application of the canopy transpiration resistance scheme of the PM-Mu in those simpler models improved their prediction skills. As such, the reduced performance of the PM-Mu predictions might relate to underlying structural and parameterization issues in the model. As the operational model behind the generation of the current MOD16 global evaporation product (Mu et al., 2013), further studies to diagnose the cause of these responses are re-15 quired.
In terms of assessment against the tower-based eddy-covariance observations, SEBS performed relatively poorly in most statistical metrics when compared to the other models, as it overestimated evaporation across a majority of studied biomes and climate zones, except over grasslands and cropland sites with short canopies (e.g. 20 less than 3 m). Interestingly, even though generally over-predicting results, it had one of the highest R 2 values, indicating good correlation with the eddy-covariance observations. Findings from Ershadi et al. (2014) confirm the good performance of the model over short canopies and its lack of performance over shrublands and forests. In terms of performance against underlying biome type, it was observed that any performance 25 reduction was observed mainly across shrublands and forest biomes, where the roughness sub-layer forms above the canopy (Harman, 2012). Importantly, the flux-gradient functions of the SEBS model are not parameterized to effectively simulate the heat transfer process in the roughness sub-layer, and hence the model fails to perform well Introduction  (Weligepolage et al., 2012). The reliance of SEBS on an accurate representation of the surface-air temperature gradient also limits the effectiveness of the model for global application, demanding improvements in characterizing the spatial and temporal representativeness of such variables. It is apparent from Sect. 3.2 and 3.3 that the application of gridded data for modelling 5 evaporation inevitably reduces the predictive performance of all models, regardless of their complexity in the evaporation process or their economy in forcing data requirements. In fact, the footprint mismatch between the tower-and grid-based simulations is likely to increase uncertainties in the forcing data and cause discrepancies between the simulated and tower-based evaporation values. Importantly, comparing the models 10 for their relative performance (see Figs. 7 and 10) reveals that the performance decrease for grid-based analysis was not equal amongst all of the models. For instance, SEBS was observed to be more sensitive to the use of gridded forcing data, most likely as a result of inconsistencies in temperature gradient fields: an aspect that has been noted previously (van der Kwast et al., 2009;Ershadi et al., 2013). Although input un- 15 certainty also impacts the performance of PT-JPL, PM-Mu and GLEAM, the NSE and R 2 of gridded simulations for those models are closer to their tower-based counterparts. Apart from indicating a robust model structure, the reduced impact seen in these schemes may also be a consequence of avoiding the use of forcing data such land surface temperature and wind speed data, which are known to be uncertain at both the 20 grid and tower-scale. Regardless of the culprit behind the observed performance discrepancy between tower and grid-based simulations, it is clear that some models are better suited to global application than others -at least given the quality of currently available global forcing datasets. Importantly, the results presented in Sect. 3.2 and 3.3 showed that evaluating tower or grid-based statistical responses alone is not enough 25 to identify those forcing variables most impacting model performance. Diagnosing forcing sensitivity is not trivial given non-linearities in the models and the high level of interaction within model variables and parameters. Indeed, caution is warranted in any approaches seeking to evaluate evaporation models using gridded data in isolation, as this is likely to yield unreliable performance metrics of the models. It is important to perform a parallel tower-based data assessment to increase confidence in any single models performance  in any evaluation approach, particularly those occurring at global scale,.
Although the largest possible set of eddy-covariance towers and a common set of 5 forcing data was used to evaluate the different model simulations, there are still inevitable limitations in the evaluations. Identifying such limitations is important not only for the current evaluations, but also in guiding future contributions. One such example relates to the period of tower data used for evaluation in this study (see Fig. A1), as the data record length varies amongst the towers and the data are not uniformly 10 distributed across seasons. Moreover, the towers are not evenly distributed across the studied biomes and climate zones (see Fig. 1, Table A1), with only one tower covering the entire evergreen broadleaf forest biome and two towers covering the wetland biome. Further, no towers were available for use in arctic and tropical climate zones.
Although the tropical climate zone, especially Amazonian forests, is accounted as a 15 critical component in studies of the global water and energy cycles (Chahine, 1992;Wohl et al., 2012), relatively few towers in this zone provide land surface temperature and longwave upward radiation data needed for the SEBS model. An additional limitation is the coarse (8 km) spatial resolution of the GIMMS NDVI data used in the models for the tower-based analysis, as this resolution certainly does not correspond 20 with the footprint of eddy-covariance sensors at any of the towers. Developments towards improving the availability and access to long-term high-resolution Landsat images (e.g. via Google Earth Engine; https://earthengine.google.org) might be one way to improve model forcing and evaluation exercises, especially with the development of high-resolution vegetation products (Houborg et al., 2015). 25 While the accuracy of individual variables in the LandFlux dataset were enhanced by bias correction against independent data sources (see Sect. 2.1), diagnosing the internal consistency of the data fields (McCabe et al., 2008), especially for air temperature, land surface temperature, wind speed and humidity, is a concept that has not received 6837 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | much attention to date and demands more considered investigations and analysis. Internal consistency is an extremely challenging objective, but is critically important for flux estimation, where so many different forcing data are required. Essentially it demands that all required model data are derived from a common set of forcing variables, rather than by the standard approach of compilation based on availability and acces-5 sibility. The most illustrative example would be in the development of radiation data, derived here from NASA-GEWEX SRB sources (Stackhouse et al., 2011). Calculation of radiation components requires air temperature, surface temperature, land surface and vegetation features, as well as numerous other elements. However, these underlying variables are rarely if ever retained to provide a consistent overall forcing data set (i.e. the meteorological variables used in producing the SRB data are not subsequently used to drive the models). Interdependencies in forcing affect many variables in the estimation of evaporation, yet products are not developed with this simple consistency principle in mind. Apart from introducing further biases and uncertainties into model simulations, until such consistency is attained, discriminating between the impact of 15 forcing versus the model sensitivity to that forcing will remain extremely challenging.
From one perspective, the performance of the evaporation models examined here seems relatively poor, even when they are forced with high-quality tower-based data. PT-JPL, which was identified as one of the most consistent and best performing models, still presented a relative error of 41 %, with errors for GLEAM, PM-Mu and SEBS 20 of 43, 52 and 72 %, respectively. However, it is important to recognise that tower-based evaluation is perhaps the strictest measure of model performance and comes with its own caveats. One question that remains outstanding is whether it is even appropriate to expect models run with large-scale gridded forcing to replicate the small-scale response observed by eddy-covariance towers. The alternative perspective, given in- 25 herent uncertainties in forcing, observations and specification of model parameters, is that these results are encouraging. Perhaps broader scale metrics such as hydrological consistency (McCabe et al., 2008), catchment based assessments or water budget closure approaches would provide a better guide (Sheffield et al., 2009) and indeed, 6838 such evaluations will need to be performed. These questions highlight the difficulties in not just producing global estimates, but perhaps more importantly, in evaluating their quality.
The observed variability of modelling performance across the studied biomes and climate zones implies that caution is required in advocating any single model for large-5 scale or global application. These results reflect previous findings that any one modelling approach is incapable of accurately reflecting the range of flux responses occurring across diverse landscapes (Ershadi et al., 2014. One possible solution to address this inherent model limitation is to assemble a mosaicked product based on the predictive skill of the model(s) over particular biomes or climate zones. Another ap-10 proach might be to develop an ensemble product using a suitable multi-model blending technique, such as a Bayesian Model Averaging approach (Hoeting et al., 1999;Yao et al., 2014). Either way, it is clear that further multi-model assessments are required for progressing global scale flux characterisation and to ensure a robust and representative product is developed.

Conclusions
It is something of a contradiction that the global-scale estimation of surface fluxes is both straightforward and extremely challenging at the same time. It is more straightforward than ever due to the availability of needed forcing data from various sources, such as numerical weather prediction or other operational products, as well as the 20 increased development of global satellite based datasets. However, the comparative ease with which products can be developed belies the difficulties in actually developing robust and coherent simulations. Uncertainties in the use of internally inconsistent forcing data, the influence of untested model parameterizations over different land surface and climate types, violation of model assumptions in their graduation from the local 25 scale to global scale and the perennial question on how to best evaluate model output all seek to confound global flux efforts.

6839
The evaluation of four process-based evaporation models as part of the GEWEX LandFlux project, undertaken here over a range of biome types and climate zones, highlighted the variable performance and verified the sentiment that no single model is able to consistently outperform any other. While individual model results at the tower scale allowed for a relative performance ranking, the overall model errors when con-5 sidered globally were high. Of those models assessed here and being considered as potential candidates for a GEWEX-LandFlux product, PT-JPL and GLEAM represent the most likely schemes for providing consistent simulation response over a range of biome and climate types. In a challenge for the development of more accurate global flux products, application of gridded data reduces the performance of all models, even 10 if the overall performance ranking does not change between simulation runs. Such a response has obvious implications when model simulations at the continental and global scales are increasingly required in many applications and where not only the forcing data have large uncertainties, but also the underlying assumptions of the models themselves are likely to be questioned. Further investigations on the reasons for such 15 variable performance and ways to offset the inherent uncertainties in global forcing are required. Additional research is also needed to improve the structure and parameterization of some of these candidate models, to understand model sensitivities to forcing by conducting a thorough sensitivity analysis and to develop and implement an appropriate ensemble modelling and merging technique that takes advantage of individual 20 model performance over defined regions. Further detailed comparisons against estimates from more complex modelling systems, such as reanalysis, are also necessary tasks for future investigations.

Variable
Tower-based Grid-based Model Air temperature Tower data aggregated to 3-hourly LandFlux data at 0.5 • and 3-hourly All models Humidity Tower-based relative humidity converted to specific humidity and aggregated to 3-hourly Specific humidity from LandFlux data at 0.5 • and 3-hourly All except GLEAM Pressure Calculated as a function of ground elevation LandFlux data at 0.5 • and 3-hourly All models Net radiation Tower data aggregated to 3-hourly LandFlux data from SRB v3 at 1 • and 3-hourly All models Ground heat flux Tower data aggregated to 3-hourly Calculated from net radiation and fractional vegetation cover data, 0.5 • and 3-hourly