Ecosystems are important and dynamic components of the global carbon cycle,
and terrestrial biospheric models (TBMs) are crucial tools in further
understanding of how terrestrial carbon is stored and exchanged with the
atmosphere across a variety of spatial and temporal scales. Improving TBM
skills, and quantifying and reducing their estimation uncertainties,
pose significant challenges. The Multi-scale Synthesis and Terrestrial Model
Intercomparison Project (MsTMIP) is a formal multi-scale and multi-model
intercomparison effort set up to tackle these challenges. The MsTMIP
protocol prescribes standardized environmental driver data that are shared
among model teams to facilitate model–model and model–observation
comparisons. This paper describes the global and North American
environmental driver data sets prepared for the MsTMIP activity to both
support their use in MsTMIP and make these data, along with the processes
used in selecting/processing these data, accessible to a broader audience.
Based on project needs and lessons learned from past model intercomparison
activities, we compiled climate, atmospheric CO
The need to understand and quantify the role of terrestrial ecosystems in
the global carbon cycle and its climate change feedbacks has been driving
the development of global terrestrial biogeochemistry and biogeography
models since the late 1980s (Foley, 1995). Since that time, the carbon cycle
science modeling community has continued to improve understanding of
terrestrial ecosystems in global and regional carbon cycling (US CCSP,
2011).
One strategy for doing so has been through several multi-model
intercomparison projects (MIPs) conducted starting in the 1990s. The
Vegetation–Ecosystem Modeling and Analysis Project The
Vegetation–Ecosystem Modeling and Analysis Project
Trends in net
land–atmosphere carbon exchange
Huge challenges still remain, however, especially in developing approaches for evaluating model predictions and assessing the uncertainties associated with model estimates (e.g., Randerson et al., 2009; USCCSP, 2011; Schwalm et al., 2013). The challenges associated with representing terrestrial ecosystem fluxes of carbon dioxide are illustrated by the huge variability in model predictions observed as part of the recent North American Carbon Program (NACP) regional and site interim synthesis activities (e.g., Huntzinger et al., 2012; Schaefer et al., 2012). The results from these activities confirmed the large uncertainties associated with our ability to represent terrestrial ecosystem carbon fluxes, but the reliance of the regional synthesis on off-the-shelf simulations without a prescribed protocol or standardized driver data sets limited the degree to which the observed variability could be attributed to specific sources of uncertainty.
Four types of uncertainties drive differences between predictions of terrestrial carbon flux (e.g., Enting et al., 2012): uncertainty associated with (1) the choice of driver data, (2) parameter values, (3) initial conditions as well as (4) the choice of processes to include and how these processes are represented within the model (i.e., structural uncertainty). Estimating and reducing these uncertainties are both critical to improving model performance, and consequently to understanding the role of terrestrial ecosystems in the global carbon cycle.
In response to this need, the Multi-scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP) was established to build on previous and ongoing MIPs to provide a consistent and unified modeling framework to interpret and address structural and parameter uncertainties. Huntzinger et al. (2013) discusses the philosophy of MsTMIP and how past and ongoing MIP activities impacted and inspired its design. Similar to VEMAP, the Potsdam NPP MIP, and GCP-TRENDY, MsTMIP prescribes standardized environmental driver data and a consistent spin-up protocol for all model simulations. VEMAP, as a pioneer in model intercomparison activities, provided a valuable backdrop against which the approach for preparing modeling input data sets was developed (Kittel et al., 1995, 2004). Although focused only on the conterminous United States, VEMAP was one of the first MIP activities that applied a consistent set of input data and boundary conditions to multiple models in order to isolate the impact of model choice on across-model variability. Thus, providing standardized input for MsTMIP greatly reduces the inter-model variability caused by differences in environmental drivers, initial conditions and the process used for defining steady-state conditions, and helps to focus the analysis on the ways in which the structure of TBMs (i.e., their choice and formulation of ecosystem processes) and associated internal parameters impact a model's estimates of terrestrial ecosystem carbon dynamics.
This paper describes the driver data needs of MsTMIP and outlines the environmental driver data sets compiled and synthesized for the MsTMIP activity. In doing so, this paper aims to address the needs of multiple communities and audiences. First, it provides the detailed background about environmental driver data choices that are necessary for the scientific interpretation of modeling results coming out of the MsTMIP effort. As such, it addresses the needs of researchers focusing on the scientific interpretation of the MsTMIP results. Second, it provides the rationale for the choice of specific environmental driver data and the details associated with their processing. Thus, the paper also aims to address the needs of researchers who wish to leverage the work reported here by using the driver data for follow-on studies or related applications. Third, this paper reports on the decision-making and implementation process involved in putting together common driver data for large modeling studies and intercomparison efforts, including lessons learned that are independent of the specific applications addressed by MsTMIP. As such, this paper also aims to inform future efforts focused on assembling consistent data sets for use by multiple modeling teams.
The remainder of this paper is structured to address the needs of the three intended audiences described above. For each data category, we first provide a brief review of the data source chosen for MsTMIP and the rationale for the choice, along with a description of other similar data sources currently available and data products used in past and/or ongoing MIPs. We then describe the processing and analysis completed to convert the original data source into a form meeting the needs of the MsTMIP activity, and in some cases to improve the quality of the original data source. We also provide a brief evaluation of standardized MsTMIP data products, and suggestions on how the data should be used in terrestrial biosphere modeling. Finally, we introduce some lessons learned on data processing and management, to guide future data-intensive projects.
The overarching goal of the MsTMIP activity is to provide a unified
intercomparison framework that allows for the critical synthesis,
benchmarking, evaluation and feedback needed to improve TBMs (Huntzinger et
al., 2013). To meet this goal, the MsTMIP activity conducts a suite of
simulations that can be used to quantify (1) the impact of the scale and
spatial resolution of model simulations on model estimates and (2) the
additive influence of a suite of time-varying environmental drivers or
forcing factors on model estimates of carbon stocks and fluxes. As such,
MsTMIP includes simulations over two spatial domains and resolutions:
globally at 0.5
One source of variability in model estimates is the choice of (and
uncertainty associated with) environmental driver and input data sets. Most
uncoupled TBMs require, at a minimum, a land–water mask, climate forcing
data, soil characteristics and atmospheric CO
data sets must be compatible with over 20 different TBMs; data sets must provide consistent spatial coverage for the land surface within
the two simulation domains: (1) North American: 10–84 Spatial resolutions must be compatible with the two sets of simulations:
(1) North American (0.25 Temporal resolution and extent must be compatible with the two sets of simulations:
(1) North American (3-hourly, 1801–2010) and (2) global (6-hourly, 1801–2010); data sets must provide smooth transitions in time, without any unrealistic spikes or discontinuities; data sets must be physically consistent with one another. For example, climate, soil
and land cover change history needed to represent the same land domain as indicated in the land–water
mask, and the prescribed phenology data needed to be consistent with the time-varying land cover data for each time step.
The environmental driver and input data sets chosen for the MsTMIP activity are a reflection of these overall project needs and requirements.
The MsTMIP environmental driver data summary.
MsTMIP environmental driver and associated data products include data sets
describing climatology, time-varying atmospheric CO NetCDF Climate and Forecast (CF) Metadata
Conventions, version 1.4.
For most data categories, the North American data sets are based on the same data sources as the global products. We did, however, choose different climatology and soil data products for the two domains. This decision was driven primarily by the availability of these drivers at the spatial and temporal resolution needed for the regional simulations. However, by holding the source of other drivers constant between the global and North American simulations, we are also creating an opportunity to test the impact of the choice of climate and soil characteristics on model estimates.
Several reanalysis and observation-based gridded global climatology data
sets exist, including products produced by the Climate Research Unit (CRU)
(Harris et al., 2014), the National Centers for Environmental Prediction
(NCEP)/National Center for Atmospheric Research (NCAR) Reanalysis 1 (Kalnay
et al., 1996), and the European Centre for Medium-Range Weather Forecasts
(ECMWF) (Uppala et al., 2005; Dee et al., 2011). The NCEP/NCAR Reanalysis 1
data was adopted by the Inter-Sectoral Impact Model Intercomparison Project
(ISI–MIP) as one of its climate inputs (Warszawski et al., 2013) to assess
the influence of the choice of forcing data on the overall results. However,
none of these available climatology data sets fully met the spatial and
temporal requirements of MsTMIP. The CRU Time Series (TS) 3.2 product covers
the time period from 1901 to present at a 0.5
Thus, we combined the strengths of the CRU and NCEP/NCAR Reanalysis
products, fusing them to produce the CRU–NCEP global climate data set.
This new data set provides a globally gridded (0.5
Comparison of the mean of long-term mean downward shortwave radiation (1948–2010) on land surface for each 0.5 degree latitudinal band from NCEP/NCAR Reanalysis 1 and CRU–NCEP data sets.
Several climatology products are available for North America at finer
spatial and temporal resolutions than the new CRU–NCEP product. In addition
to better addressing the resolution needs of MsTMIP regional simulations
(0.25 PRISM Climate Group,
Oregon State University,
The NCEP North America Regional Reanalysis (NARR), on the other hand,
provides long-term high-resolution high-frequency atmospheric and land
surface meteorological data for the North American domain (Mesinger et al.,
2006). The NARR climatology begins in 1979 and extends to present at
3-hourly temporal and 32 km spatial resolutions. Although the temporal
coverage is shorter than desired, the NARR product was selected for the
MsTMIP activity because it best matched the needs of the North American
simulations, and the time covered by the data set was extended as described
in Sect. 4. The original NARR data were provided by the NOAA/OAR/ESRL
PSD NOAA/OAR/ESRL PSD: National Oceanic & Atmospheric
Administration/Oceanic and Atmospheric Research/Earth System Research
Laboratory Physical Sciences Division, Boulder, Colorado, USA.
The NARR variables were regridded to a spatial resolution of 0.25
In a study of rain gauge and NARR data, Sun and Barros (2010) found that,
although NARR reproduces the spatial patterns of precipitation, it
underestimates the frequency and magnitude of large rainfall events. In
addition, Xie et al. (2003) found that the Global Precipitation Climatology
Project (GPCP) monthly gridded (2.5
Difference map between the long-term mean (1979–2010) annual total precipitation from rescaled NARR and original NARR (rescaled NARR precipitation – original NARR precipitation).
As mentioned previously, biases in shortwave radiation can have a strong
impact on model estimates of GPP. Kennedy et al. (2010) showed that between
1999 and 2001 the NARR product overestimates downward shortwave radiation flux
relative to the Atmospheric Radiation Measurement (ARM) southern Great
Plains (SGP) site observations by about 10 % under clear sky and by about
30 % under all-sky conditions. We also compared NARR downward shortwave
radiation flux with observations from 23 FLUXNET FLUXNET, a “network of regional networks”,
coordinates regional and global analysis of observations from micrometeorological tower sites. MTCLIM, a mountain microclimate simulation model,
Comparison of shortwave radiation from original and reanalyzed NARR against observations averaged over 23 FLUXNET sites across North America.
Comparison of the latitudinal zonal (0.25
One of the goals of MsTMIP is to test the influence of both spatial resolution and changing driver data on model estimates (Huntzinger et al., 2013). A comparison between the MsTMIP global (CRU–NCEP) and North American (NARR) climate data over the years 1979–2010 reveals that MsTMIP's MTCLIM-calibrated NARR downward shortwave radiation has much higher seasonal variability than CRU–NCEP in North America. MsTMIP's MTCLIM-calibrated NARR downward shortwave radiation also has a decreasing trend in the 1980s and an increasing trend after 1990, which is consistent with the findings reported in Wild et al. (2005) and Pinker et al. (2005). However, this decreasing–increasing trend was not observed in the CRU–NCEP data. MsTMIP's NARR and CRU–NCEP downward longwave radiation products share similar seasonal variability and spatial distribution patterns, while NARR has much finer spatial details due to higher spatial resolution. Though sharing similar seasonal variability and spatial distribution patterns, MsTMIP's GPCP-rescaled NARR precipitation was higher than that of CRU–NCEP, especially before 2003, and had a decreasing trend between 1979 and 2010. This decrease in the rescaled NARR precipitation had a significant impact on MsTMIP regional-scale sensitivity simulations. MsTMIP's NARR and CRU–NCEP generally share similar seasonal variability, trend and spatial distribution patterns for other climate variables. Details of this comparison can be found in Supplement 2.
The land–water mask specifies the land grid cells on which MsTMIP global and
regional simulations are run, and needs to be consistent with the climate
driver data. We therefore based the global land–water mask on the CRU–NCEP
land–water mask, and the North American land–water mask on the original NARR
mask regridded to a spatial resolution of 0.25
Atmospheric CO
The atmospheric CO
Comparison of MsTMIP driver data atmospheric CO
Nitrogen enrichment, increasing atmospheric nitrogen deposition in
particular, has been recognized as one of the most significant global
changes since it could stimulate plant growth, enhance terrestrial carbon
sequestration capacity and thus mitigate global climate warming (e.g.,
Holland et al., 1997; Pregitzer et al., 2008; Reay et al., 2008; De Vries et
al., 2009). Models failing to capture nitrogen input and nitrogen cycling
may overestimate ecosystem carbon uptake (Hungate et al., 2003). Up to now,
more and more TBMs include nitrogen deposition as an important driving
force. However, few global and North American nitrogen deposition products
are available over the full period required by MsTIMP. Monitoring networks
of nitrogen deposition in the United States and Europe were launched in the
late 1970s, while other countries began such nationwide observations later
(Holland et al., 2005; Lu and Tian, 2007). The Dentener global nitrogen
deposition data product was generated using a three-dimensional chemistry
transport model that estimated atmospheric deposition of total inorganic
nitrogen (N), NH
To address the above issue, we used a different approach as described in
Tian et al. (2010) and Lu et al. (2012) to create a time-varying annual
nitrogen deposition data set for both global (0.5
LULCC has considerable influence on the biogeochemical cycling of carbon
(e.g., Friedlingstein et al., 2010; Pielke Sr. et al., 2011; Sohl et al.,
2012). Activities such as afforestation (Potter et al., 2007) or
deforestation (Ramankutty et al., 2007) can alter carbon stocks. Similarly,
biomass burning used in land clearing results in direct carbon emissions
(Giglio et al., 2010). Despite its importance in carbon cycle dynamics,
LULCC-caused CO
Many global data products describing historical LULCC are available (e.g.,
Hurtt et al., 2011; Klein Goldewijk et al., 2011). In an effort to hold as
many of the environmental drivers constant as possible in the MsTMIP
activity, we chose to prescribe LULCC by merging a static satellite-based
land cover product, SYNergetic land cover MAP (SYNMAP) (Jung et al., 2006), with the time-varying land
use harmonization (LUH) data for the fifth Assessment Report (AR5) of the
Intergovernmental Panel on Climate Change (IPCC) (Hurtt et al., 2011). We
chose the LUH product based on its global coverage, inclusion of land use
change fractions (required for a subset of participating models), overlap
with the time horizon of MsTMIP simulations, and use in the IPCC process.
The LUH product was derived using a bookkeeping approach based on historical
time series of crop and pasture data, national wood harvest, shifting
cultivation, and population (Hurtt et al., 2011). LUH product provides
mapped fractional coverages and underlying annual land use transitions for
six land use classes (primary land, secondary land, cropland, pasture,
urban, and barren) at 0.5
As TBMs require a different land use/cover scheme than the six classes associated with the LUH, we merged the 1801–2010 LUH with the static 2000/2001 SYNMAP land cover product (Jung et al., 2006). Although numerous land cover products exist, we chose SYNMAP due to its (1) reconciliation of multiple global land cover products, i.e., Global Land Cover Characterization Database (GLCC) (Hansen et al., 2000; Loveland et al., 2000), GLC2000 (2003) and the 2001 MODIS land cover product (Friedl et al., 2002); (2) global coverage at 1 km resolution; and (3) general definition of classes based on life form, leaf type and leaf longevity which allowed for simple crosswalks to plant functional types (PFTs) used in different TBMs. Generality was a key concern as PFT schemes used in TBMs vary widely. The SYNMAP product contains 47 land cover classes such that a PFT scheme for a given TBM is a subset of SYNMAP classes based on a crosswalk between the two different schemes.
To provide annual maps of LULCC, LUH and SYNMAP were merged using a set of one-to-one and one-to-many mapping rules based on map intersection during their period of overlap, i.e., both products exist for 2000–2001. These invariant grid cell-specific mappings were then used to translate the six LUH classes to the 47 SYNMAP classes (Jung et al., 2006) for each annual LUH map. For example, assume a grid cell with LUH pasture at a fractional coverage of 0.5 for 2000–2001, in that same grid cell the SYNMAP product has only two eligible target classes: the shrubs and the grasses classes with fractional coverages of 0.2 and 0.4, respectively. This map intersection forms the basis of a one-to-many mapping, i.e., 0.5 LUH pasture is equivalent to 0.17 SYNMAP shrubs plus 0.33 SYNMAP grasslands, which preserve the original shrubs / grasslands ratio in SYNMAP for that grid cell. This scalable mapping rule is used for all other time steps for this grid cell and reflects the legacy of grid cell-specific changes in land use/cover through time.
Few models use these 47 SYNMAP classes directly in their simulations. For
example, the Simple Biosphere (SiB) model uses 12 biome classes (Sellers et
al., 1996b). In such instances, model teams developed crosswalks from the 47
SYNMAP classification scheme to their internal schemes. Given that many
SYNMAP classes are mixed classes, e.g., “shrubs and crops” and “trees and
crops”, which cannot be accommodated by some models, we created maps of
pure biome classes by assuming each component in a mixed class was half
the total area. Finally, as several models require information on
the photosynthetic pathway in grasslands as well as crop types, we also
provided invariant maps for C3
Because photosynthesis can vary significantly between species using the C3
and C4 photosynthetic pathways (Ehleringer and Cerling, 2002), most TBMs use
separate algorithms for estimating the GPP of C3 and C4 plant types. In order to
provide the required spatial distribution of ecosystems dominating each of
these pathways, we used an approach described in Still et al. (2003) based
on growing season temperature. Since the C4 pathway is largely found in warm
season grass species, we created a global gridded (0.5
SYNMAP contains 13 land cover classes that include grasses, with 12 of these
mixtures of grasses with trees, shrubs, crops or barren land. For the mixed
classes, we assumed that grasses account for 50 % of the area of these
mixed classes contained in each cell. The SYNMAP grass fraction in each cell
was calculated as the sum of the grass fraction of all different classes,
including both pure and mixed classes, in the cell. Figure 6 shows the
relative fraction of C3 (top) and C4 (bottom) grassland globally
(0.5
Relative fractions of C3 (top) and C4 (bottom) grassland on global
0.5
The North American (0.25
The SYNMAP land cover map indicates which areas are predominantly crop but
does not provide additional information about the crop types contained
within each grid cell. This can be important when, for example, a C4 crop
like maize dominates a grid that would normally be covered by C3 vegetation,
and vice-versa. Some models make use of such additional information to
implement crop specific algorithms that capture some aspects of crop
physiology and management including planting and harvesting phenology,
fertilizer applications, irrigation or tillage practices. We therefore
identified and extracted the four globally significant crop types (maize,
rice, soybean and wheat) from the Monfreda et al. (2008) global crop
database. The original Monfreda global crop product is a detailed database
of global agricultural practices and describes the areas and yields of 175
different individual crops in 2000 at a 5 min
Some models do not have prognostic canopies and use remote-sensing products
to prescribe plant phenology to calculate GPP or NPP. Consequently, we
constructed monthly maps of normalized difference vegetation index (NDVI),
leaf area index (LAI) and absorbed fraction of photosynthetically active
Radiation (fPAR) consistent with the MsTMIP LULCC data on both global and
North American grids for 1801–2010. For NDVI data, we chose the Global
Inventory Monitoring and Modeling System version g (GIMMSg) data set (Tucker
et al., 2005), because it provides the longest global observation-based
product. The Postdam NPP MIP also used the GIMMS product to define NDVI;
however, their protocol did not mandate consistent driver data across all its
participating models (Cramer et al., 1999). GIMMSg consists of 15-day
maximum value composites at about 8 km spatial resolution from 1982 to 2010
adjusted for missing data, satellite orbit drift, sensor degradation and
volcanic aerosols (Tucker et al., 2005). We used the average seasonal cycle
in NDVI for the entire time period from 1801 to 2010, since switching to
observed values in 1982 would create abrupt changes in model output that
would be difficult to interpret. The 15-day GIMMSg NDVI was first regridded
to 0.5
To harmonize phenology data with the LULCC used in MsTMIP, we assumed that a pixel would consist of tiles, each corresponding to a different land use/cover class with fractional areas set by the MsTMIP LULCC coverage maps as a function of year from 1801 to 2010. We first calculated maps of LAI and fPAR assuming the entire land surface was one of the 12 SiB biome classes (Sellers et al., 1986) resulting in 12 sets of LAI and fPAR maps corresponding to the 12 SiB biome classes, all calculated from the same NDVI values but using different parameter values unique to each biome (Sellers et al., 1996b). We then mapped the 12 SiB biomes to the 47 SYNMAP land use/cover classes using one-to-one or one-to-many mapping, resulting in 47 sets of LAI and fPAR maps corresponding to the 47 SYNMAP classes. This two-step process was required because the parameters used to calculate LAI and fPAR are not available for each of the 47 SYNMAP types. By combining these 47 sets of LAI and fPAR maps and the yearly MsTMIP LULCC data, the time-evolving and land use/cover class explicit LAI and fPAR data products were created. If a grid cell did not contain a particular SYNMAP class in a specific year, a standard missing value was inserted into the corresponding LAI and fPAR maps. A model would then extract the LAI and fPAR values for a particular SYNMAP class in each year and use it for the corresponding tile.
The Food and Agriculture Organization – United Nations Educational, Science and Cultural Organization (FAO-UNESCO) digitized soil map of the world (FAO, 1971–1981, 1995, 2003), originally published in 1974, is commonly used in terrestrial biosphere modeling. Recently, however, significant improvements in soil mapping and databases of soil properties have led to a new generation of regional and global scale soil maps, such as the International Soil Reference and Information Centre (ISRIC) World Inventory of Soil Emission Potentials (ISRIC-WISE) (Batjes, 2008) and the harmonized world soil database (HWSD) (FAO/IIASA/ISRIC/ISS-CAS/JRC, 2011). This new generation of soil products have increased details in the spatial distribution of soil types and more accurate characterizations of soil physical and chemical properties.
For MsTMIP, we selected and synthesized the HWSD v1.1 for global simulations because it was the most recent global soil database that incorporates updated soil data from Europe, Africa, and China. However, in both the ISRIC-WISE and HWSD databases, soil information for North America is based on an outdated FAO-UNESCO soil map from the 1970s. Thus, even in the most updated global soil databases, North America is less reliable than the other regions due to the use of an obsolete database (Batjes, 2005; FAO/IIASA/ISRIC/ISS-CAS/JRC, 2011). We therefore developed the Unified North American Soil Map (UNASM) by fusing the United States Department of Agriculture Natural Resources Conservation Services (USDA-NRCS) State Soil Geographic (STATSGO2) data set with both the soil landscapes of Canada (SLC) version 3.2 and 2.2 products, and the HWSD v1.1 (Liu et al., 2013).
Both data prepared for MsTMIP, the gridded 0.5
The HWSD had been widely used as input for global-scale carbon cycle
modeling and MIP activities (e.g., ISI-MIP; Warszawski et al., 2013), and
therefore was used to define MsTMIP global soil data. The original HWSD is a
30 arcsec raster database with over 16 000 different soil mapping units that
combines existing regional and national updates of the soil information
worldwide, including the Soil and Terrain database (SOTER), European Soil
Database (ESD), Soil Map of China, and WISE, with the information contained
within the
Each soil mapping unit in the HWSD is composed of several different soil
units (or soil types) defined by major soil group code following a combined
FAO-74/FAO-85/FAO-90 soil classification system. For the global simulations,
the original HWSD was regridded to a spatial resolution of 0.5
The reference bulk density values provided in HWSD v1.1 were calculated following the method developed by Saxton et al. (1986) that relates bulk density to soil texture. This method, although generally reliable, tends to overestimate the bulk density in soils that have a high porosity (e.g., Andosols) or that are high in organic matter content (e.g., Histosols). Therefore, the bulk density values of these two soil types were corrected using the corresponding depth-weighted average values from ISRIC-WISE, version 1.0. Figure 7 shows the globally gridded HWSD topsoil reference bulk density before and after correction. The correction mainly impacts the North American boreal region and a few places of southeastern Asia where Andosols and Histosols dominate.
HWSD topsoil reference bulk density before (top) and after
(bottom) correction at 0.5
A new gridded database of harmonized soil physical and chemical properties for North America was created for MsTMIP by fusing the most recent regional soil information from US STATSGO2, SLC version 3.2 and 2.2, and the HWSD v1.1. The fused database was then harmonized into two standardized soil layers as for the HWSD. The top soil layer ranges from 0 to 30 cm and the sub-soil layer ranges from 30 to 100 cm. The comparison with the subset of HWSD demonstrates the pronounced difference in the spatial distributions of soil properties and soil organic carbon mass between the UNASM and HWSD, but overall the UNASM provides more accurate and detailed information particularly in Alaska and central Canada. The methods used to develop the UNASM and the comparisons with HWSD are described in detail in Liu et al. (2013).
The MsTMIP spin-up environmental driver data summary.
A consistent spin-up data package shared among models eliminates any
differences in prediction due to spin-up data choices. We created the
spin-up data package using the standardized environmental driver data sets
described above. MsTMIP requires that all simulations assume steady-state
initial conditions in 1801. The spin-up driver data package contains a
100-year time series for each required environmental driver data product (Table 2)
that can be recycled until steady state is reached. For climatology, the
100-year spin-up time series was created by randomly selecting from the first
30 (1901–1930, global) or 15 (1979–1993, North America) years of climate
driver data on the yearly time step. Using the first 30 or 15 years of climate
driver data ensures a smooth transition from the spin-up to transient
simulations, while preserving the seasonal cycle of the meteorological
variables. A 100-year period for the spin-up package was chosen to minimize
any long-term trend in spin-up climate data; thus, minimizing drift in
reference simulations, which use constant driver data (Huntzinger et al.,
2013). Nitrogen depositions were held constant at 1860 values and
atmospheric CO
All transient simulations defined by MsTMIP require driver data sets covering the period of 1801–2010 (Huntzinger et al., 2013). However, several of the environmental driver data sets, including climate, nitrogen deposition, and soil, do not cover the full period. The spin-up data package was thus recycled to fill these temporal gaps. For global climate data, the spin-up data were used directly to fill the gap between 1801 and 1900. For the NARR climate (North American) data, the full 100-year time series plus the first 78 years of the North American spin-up climate data were used to fill the gap between 1801 and 1978. The nitrogen deposition data in 1860 were repeated to fill the gap between 1801 and 1859 for nitrogen deposition driver data. Finally, constant soil data were used throughout the simulation period of 1801–2010.
Some of the lessons learned in the process of data preparation and distribution for MsTMIP have implications beyond the MsTMIP project. These are described here in order to provide some guidance for future data-intensive activities, especially those that involve assembling consistent data sets for use by multiple modeling teams. Some of these lessons have previously been described in the context of other MIPs (e.g., Kittel, et al., 1995 and 2004), but continue to present challenges and should therefore be taken into account in the design of future efforts.
Study the past Scientific discoveries rely heavily on findings from past activities. This
is especially true for data-intensive, multi-partner MIP activities like
MsTMIP. Beginning with VEMAP in the 1990s, there have been several MIPs
conducted that have advanced our understanding of ecosystem dynamics and
supported model development. The preparation of environmental driver data
sets for MsTMIP has been inspired by past/current MIPs, such as VEMAP,
GCP-TRENDY and NACP interim synthesis activities. The design of the MsTMIP
environmental driver data sets benefited from studying the lessons learned
from these past activities and helped us to avoid pitfalls (e.g., biases in
some reanalysis climate variables) or duplicate work unnecessarily (e.g.,
leveraging climate data prepared for GCP-TRENDY), and thus helped to reduce
data preparation time. Resources for data planning, preparation and management Dedicated funding and expertise are needed to develop a plan with the
modeling teams and to conduct the driver data compilation. The preparation
of standardized model input driver data sets, especially for a project with
many different collaborators, takes a significant amount of time and effort.
Besides data processing, detailed documentation has to be compiled to
capture all the processing steps and trace the origin of each data file. A
long-term data management plan is needed to preserve and share the data
after a project ends and maximize the value of the data products whenever
they are used. Data centers should be identified for long-term data
preservation, and the data center's requirements for metadata and
documentation should be established at the beginning of the project. Collaboration between informatics and science researchers For a project like MsTMIP, informatics personnel and modeling teams need to
work closely together to develop a shared set of requirements for the data products
and to ensure that useful data products suitable for long-term preservation
are produced. Close collaboration is required for acquisition,
harmonization and organization of the scientific data products both for the
project and for future use. Proper data formats and standards Non-proprietary and standard data and metadata formats (e.g., netCDF,
comma-separated values (CSV), geotiff, CF metadata convention, or FGDC
metadata standard Federal Geographic Data Committee geospatial
metadata, Standards also help with the long-term preservation and usability of data
(Hook et al., 2010). In addition, a data management effort should consider
both current and future needs when choosing appropriate data and metadata
formats. Version control of data files A controlled repository and versioning system should be used to control data
files, not only for final data products to be released to modeling teams and
the community but also for intermediate data to be shared between different
processing steps and among project collaborators. When working with a large
volume of data files with complicated data processing steps, version control
is critical for ensuring that intermediate data files are self-consistent,
that the provenance of data is correctly captured, and that final data
products are properly distributed to data users. Workflow systems to improve reproducibility and collaboration among team members Data processing is an error-prone activity. Even if every processing step is
performed correctly, the processing algorithms themselves usually need
adjustments to create better quality data products. Requirements on final
data products sometimes change unexpectedly. In practice, therefore, similar
data processing activities will usually be done multiple times before data
products are finalized. In MsTMIP, a workflow system (e.g.,
VisTrails VisTrails, Kepler, QA/QC Quality assurance and quality control (QA/QC) is necessary not only for the
final data products, but also for any intermediate data product produced.
Depending on the characteristics of data products, different manual and
automatic QA/QC approaches (e.g., visualization, statistics and long-term
trend analysis) can be used to identify potential errors. The best way to
QA/QC data products is always to collaborate with domain researchers and
test data with real science applications. On-demand approach to distribute data For a project such as MsTMIP that involves over 20 modeling teams, it is not
possible to prepare a single set of data that meets the requirements of all
models. TBMs have different native temporal resolutions, for example, and
modelers may therefore need to regrid data. Similarly, if the products are
used for future applications (outside the projects for which they were
created), they may need to be subset to a smaller geographic region,
rescaled to a different spatial resolution, or translated to a different
geographic projection. On-demand data distribution systems, like the
Thematic Realtime Environmental Distributed Data Services Thematic
Realtime Environmental Distributed Data Services (THREDDS),
“Better is the enemy of good enough” There is constant pressure to create the best data sets possible, but
this must be balanced against the overall priority of completing the
simulations. If too much time is spent improving the driver data, the time
available for model simulations and the evaluation of modeling results is
compromised. Therefore, in order to maintain momentum, there comes a time
when a decision has to be made to freeze data improvement activities and
release a specific version of data products to modeling teams.
This paper presents the reasoning for, and a description of, driver data and
spin-up procedures used in the setup of the global and North American
simulations that are part of the MsTMIP activity. These data sets include
climate, atmospheric CO
In addition to serving the needs of the MsTMIP activity, this work is intended to serve the needs of researchers wishing to leverage the data products produced by MsTMIP for follow-on studies or related applications. Finally, we offer our experience with MsTMIP as a case study in the development of data sets for collaborative scientific use. The lessons learned from the work reported here, including the need for dedicated support for data development and sharing, for iterative product development, and for the generation of easily accessible and traceable products, among others, are thus broadly applicable. As such, we aim for this work to inform future efforts focused on assembling consistent data sets for use by multiple modeling teams.
All standardized model input driver data sets are archived in the ORNL DAAC to provide long-term data management, preservation, and distribution to the community (Wei et al., 2014).
Funding for the Multi-scale Synthesis and Terrestrial Model
Intercomparison Project (MsTMIP) was provided through NASA ROSES grant no.
NNX11AO08A. Data management support for preparing, documenting, and
distributing model driver was performed by the Modeling and Synthesis
Thematic Data Center (MAST-DC) at Oak Ridge National Laboratory, with
funding through NASA ROSES grant no. NNH10AN68I. MsTMIP environmental driver
data can be obtained from the ORNL DAAC (