Data-mining analysis of factors affecting the global distribution of soil carbon in observational databases and Earth system models

Soil carbon dynamics are a key process in the terrestrial carbon cycle. Future climate change will dramatically change the carbon balance in soil, and this change will affect the terrestrial carbon stock and the climate itself. Earth system 10 models (ESMs) are used to understand the current climate and to produce future climate projections, but the soil organic carbon (SOC) stock simulated by ESMs and those of observational databases are not well correlated when the two are compared at fine grid scales. However, the specific key processes and factors, as well as the relationships among factors, that govern the SOC stock, remain unclear, and the inclusion of such missing information would improve the agreement between modelled and observational data. In this study, we aimed to identify the influential factors that govern global SOC 15 distribution in observational databases, as well as those simulated by ESMs. We used a data-mining (machine-learning) scheme (boosted regression trees: BRT) to reveal the factors affecting the SOC stock. We applied BRT to three observational databases and 15 ESM outputs from the fifth phase of the Coupled Model Intercomparison Project (CMIP5) and examined the effects of 13 variables/factors categorized into five groups (climate, soil property, topography, vegetation, and land-use history). These analyses revealed the influential variables and their correlations with SOC. Globally, the 20 contributions of mean annual temperature, clay content, CN ratio, wetland ratio, and land cover were high in observational databases, whereas the contribution of mean annual temperature, land cover, and NPP governed the SOC distribution in ESMs. A comparison of the influential factors in observational databases and ESMs, at the global scale, revealed that the CN ratio and clay content were key processes to include in ESMs to reproduce the distribution of SOC in observational databases. The results of this study will help elucidate the nature of both observational SOC databases and ESM outputs and 25 improve the modelling of terrestrial carbon dynamics with ESMs. This study shows that a data-mining algorithm can be used to assess model outputs.


Introduction
Soil is the largest organic carbon stock in terrestrial ecosystems (Batjes, 1996;IPCC, 2013;Köchy et al., 2015).The soil organic carbon (SOC) stock is the result of the balance between carbon input into soil and decomposition, and the soil 30 carbon influx and efflux are controlled directly and indirectly by environmental conditions (Carvalhais et al., 2014;Schimel et al., 1994).Future climate change will dramatically affect the carbon balance in the soil cycle (Bond-Lamberty and Thomson, 2010;Friedlingstein et al., 2006;Hashimoto et al., 2011Hashimoto et al., , 2015)), and this change will affect the terrestrial carbon and, consequently, the climate itself (Cox et al., 2000;Zaehle, 2013).
In the last two decades, several global soil databases have been developed, and some are under further improvement 35 (Scharlemann et al., 2014).These databases describe the global distribution of soil physiochemical properties, enabling us to calculate the global distribution of the SOC stock (e.g., Harmonized World Soil Database), and some databases provide the SOC stock by default (e.g., IGBP-DIS).These databases are based on observed data points with global coverage, although there are biases in the spatial distribution or densities of the data points.
Earth system models (ESMs), which have been developed to understand the current climate and to provide future climate projections, incorporate the terrestrial carbon cycle, including SOC.The above-mentioned observational global soil databases are often used as benchmarks and to examine whether the ESMs successfully describe the global distribution of the soil carbon stock (Hararuk et al., 2014;Todd-Brown et al., 2013;Wieder et al., 2014).However, a recent study observed that the results of ESMs were moderately consistent at the biome levels, whereas the correlation between the distribution of 5 soil carbon stock simulated by ESMs and that of observational databases was poor when the two were compared at fine scales (e.g., a 1° scale).In addition, estimates of SOC by ESMs and terrestrial biosphere models exhibit high uncertainty (Nishina et al., 2015;Tian et al., 2015).Although some potential factors (e.g., net primary production or temperature) have been suggested (Exbrayat et al., 2013;Nishina et al., 2014;Todd-Brown et al., 2013;Wieder et al., 2013), the key processes and factors, as well as the relationships among factors, that govern the SOC stock, and whose inclusion would improve the 10 agreement between model and observational data, remain unclear.
In this study, we aimed to identify the key factors that govern the global SOC distribution in observational databases as well as those simulated by ESMs.We applied a data-mining (machine-learning) scheme (boosted regression trees: BRT) to identify the influential factors and how these factors relate to SOC stock (Elith et al., 2008).BRT is a method based on regression trees and boosting.We examined the influential factors on the distribution of SOC and their relationships between 15 these factors and SOC stock.We assessed how the ESMs could match the influential factors and their relationships with factors from observational databases.By comparing the influential factors in observational databases with those in ESMs, we clarified the model-data discrepancies and the areas in which ESMs can be improved.

Observational global SOC database 20
We used SOC data from two global and one northern observational databases.The first global database is the Harmonized World Soil Database (HWSD) (FAO/IIASA/ISRIC/ISSCAS/JRC, 2012).The HWSD is a global database of soil physiochemical properties.We used SOC stock database obtained using HWSD by Joint Research Centre (JRC) (Hiederer and Köchy, 2011) (Fig. 1a).The second database was the global gridded surfaces of selected soil characteristics (IGBP-DIS) (Global Soil Data Task Group, 2000) (Fig. 1b), which contains gridded soil physiochemical properties.The third database 25 was the Northern Circumpolar Soil Carbon Database version 2 (NCSCD) (Hugelius et al., 2013;Tarnocai et al., 2009) (Fig. 1c).This database is a spatial database of SOC stock of the northern circumpolar permafrost region.We used the HWSD and IGBP-DIS to analyse the global distribution of SOC stock, and then we extracted the database of northern circumpolar regions from the three above databases and analysed the SOC stocks in the northern region.The relationships between the databases are shown in Fig. S1.The SOC in the upper 100 cm in each database was used.30

Global SOC estimated using Earth system models
The global distribution of SOC stock estimated by ESMs was obtained from the fifth phase of the Coupled Model Intercomparison Project (CMIP5).We examined the results of 15 ESMs (Fig. 2) (Appendix Table A1).When more than one result was obtained by the same model family (e.g., MIROC-ESM and MIROC-ESM-CHEM), we generated an ensemble average database for each family (e.g.average of MIROC-ESM and MIROC-ESM-CHEM).The mean values for 1980-2004 35 were calculated.The results of the historical and ensemble member r1i1p1 were used in this study.The notation "r1i1p1" is an identifier of model simulation and is an ensemble member that is often used for analyses (e.g.Todd-Brown et al., 2013).
The overviews of SOC submodels in the ESMs have been previously described (Exbrayat et al., 2014;Todd-Brown et al., 2013).In general, each soil submodel consisted of 1 to 9 pools and incorporated the effects of temperature and moisture.
Geosci.Model Dev. Discuss., doi:10.5194/gmd-2016-138, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.Some ESMs have litter carbon pools, and those were excluded from this study.The comparisons between the mean of ESMs and global observational databases in a 1° grid are shown in Fig. S2.

Other databases
We used five groups of variables/factors to examine their effects on global SOC: climate, soil property, topography, vegetation, and land-use history.Detailed data sources for the databases are described in Table 1.For climate and soil 5 property, the mean annual temperature and annual precipitation, and the clay content, CN ratio, and texture (Appendix Table A2) were used (0-30 cm), respectively.As topographical indices, the compound topographic index, elevation, slope, and wetland ratio were used.The CN ratio was calculated by dividing the carbon density by the nitrogen density.The wetland ratio was calculated by dividing the number of wetland grids with the total grids at 1°. Lake, reservoir, and river were not quantified as wetlands and were excluded from the total grids.The land cover type (Appendix Table A3) and net primary 10 production (NPP) were adopted for vegetation indices, and the cropland ratio and human appropriation of net primary production percentage, which is a percentage of human consumption of NPP to local NPP (Imhoff and Bounoua, 2006), were used as indices of land-use history.The average human appropriation of NPP percentage was calculated at 1°. Histograms of the variables are shown in Fig. S3.

Database handling 15
All global databases, except for the databases of a spatial resolution of 1° by default, including observational and ESM model outputs, were regridded to a spatial resolution of 1° for the analyses.Regridding of data in the NetCDF format was performed using the Climate Data Operators (CDO) software provided by the Max Plank Institute for Meteorology (https://code.zmaw.de/projects/cdo).

Boosted regression trees (BRT) 20
To identify the influential factors and their relationships with SOC stock, BRT were used in this study (Elith et al., 2008).
This technique involves a type of data-mining (machine-learning) algorithm that combines the advantages of a regression tree (decision tree) algorithm and boosting.Regression trees are a classification algorithm that classifies data by recursive binary splits, and boosting is a machine-learning algorithm that generates many rough models and combines them to improve their predictive capability.The main advantages of the method are that BRT can analyse different types of variables and 25 interaction effects between variables and are applicable to nonlinear relationships.In recent years, the BRT technique has been used to examine the distribution of soil characteristics at the regional scale (Aertsen et al., 2011;Cools et al., 2014;Martin et al., 2011).Major outputs from the BRT analyses identify (1) the relative importance (percentage of influence or contribution) of predictor variables (explanatory variables), which is based on the weighted and scaled number of times a variable is selected for splitting (Elith et al., 2008), and (2) relationships between variables and the explained variable shown 30 in partial dependence plots.
We used the open-source BRT package in R software (R Core team, 2013) developed by Elith et al. (2008).In practice, three parameters in the BRT package-learning rate (lr), tree complexity (tc), and bag fraction (bg)-control the BRT performance.The lr determines the contribution of each tree, the tc controls the number of splits, and the bg is the proportion of data selected in each step.The number of trees was determined using the cross-validation method in the R package.The 35 maximum number of trees was set to 15,000.The tc value was set to 5. We tested different lr (0.001, 0.005, 0.01, 0.05, 0.1) and bg values (0.5, 0.6, 0.7) and used the best parameter set for each database, but the changes in parameter values had little effect on the model performance.

Model performance
The goodness of fit between the BRT model and data was assessed using the linear relationship between the predicted and observed values, the coefficient of determination (R 2 ), and the root mean square error (RMSE) and is shown in Tables S1   and S2.For both the observational databases and ESMs databases, the BRT models showed good performance with high R 2 values in most of the databases, but the performance was relatively lower for NCSCD and CMCC (northern soils).5

Global soil
The relative contributions of variables in the BRT model of global SOC stock to the observational databases are shown in Fig. 3a.In HWSD, the contributions of land cover and mean annual temperature, CN ratio, and wetland ratio were high.For 10 IGBP-DIS, the mean annual temperature, followed by clay content, CN ratio, land cover, also greatly contributed.In particular, the mean annual temperature was very important.The contribution of elevation for each HWSD and IGBP-DIS was 6% and 7%, respectively.The NPP contributed 5% in both databases.
The relationships between the influential variables and SOC are shown in Fig. 4. In general, the two databases showed similar relationships.For example, SOC decreased with an increasing mean annual temperature, particularly for sites with a 15 mean annual temperature > 0 °C, but increased with increasing clay content and CN ratio.The SOC increased rapidly with an increasing CN ratio.Relationships with a mean annual temperature were relatively close to each other.The relationship with clay was steeper in IGBP-DIS than in HWSD, but the opposite was true for the CN ratio.With respect to land cover, evergreen needleleaf forest and permanent wetlands had higher SOC.

Northern soils 20
In the northern region, the dominant contributors differed among northern soil databases and from those identified in the global database analyses described above (Fig. 3b).For HWSD, the CN ratio was the dominant contributor, followed by the wetland ratio, clay content, and mean annual precipitation.In IGBP-DIS, clay content, CN ratio, and elevation were the most important contributors.For NSCD, elevation contributed the most (~25%), but all variables except for cropland ratio and HANPPpct contributed 5-15%.The mean annual temperature was not as influential as the global databases.25 The relationships between variables and SOC stock varied more among the databases for northern soils than those of global databases (Fig. 5).In addition, because the northern regions were extracted, the ranges of variables were narrower than the global databases.In NCSCD, the SOC decreased with increasing temperature and increased with increasing precipitation.
The SOC increased with increasing clay content and CN ratio in HWSD and IGBP-DIS, which was consistent with the findings obtained from the global databases.The increasing trend with increasing CN ratio was also observed in NCSCD.30 The SOC decreased with increasing elevation in all databases but showed considerable variability at low elevations.

Global soil
The contributions of each variable among ESMs highly varied, but the mean of the results of the ESMs showed that the mean annual temperature, land cover, and NPP distinctively contributed to SOC distribution, and large inconsistencies 35 between the observational databases and ESMs demonstrated low contributions of clay content and CN ratio and high contributions of NPP in ESMs (Fig. 6a).The contribution of NPP in ESMs was greater than in the observational databases.did not contribute to the ESMs (Fig. 6a), with respect to land cover, permanent wetlands had higher SOC (Fig. 7d).

Northern soils
The mean of the ESMs showed that for northern soils, the main contributors (mean annual temperature, land cover, and NPP) were mainly the same as in the ESMs' global outputs (Fig. 6b).The contribution of the mean annual temperature was lower than that for the global results of ESMs (mean of 14% for the northern and 29% for the global temperatures).The relatively large discrepancy between observational databases and ESMs included the lower contribution of clay content, CN ratio, and elevation and the higher contribution of mean annual temperature, land cover, and NPP in the ESMs.
The relationships between SOC and variables in ESMs as well as the results of the observational databases are shown in Fig. 8.The mean of ESMs showed that the SOC in the northern region increased with increasing NPP, and the relationship was similar to that in HWSD, although the contribution of NPP in ESMs differed from those of the observational database.The decreasing trend with elevation was not replicated in the ESMs.

Discussion and concluding remarks
Using the data-mining technique, our BRT analyses revealed influential variables for global and northern SOC in the observational databases and the output of ESMs.The influential factors differed between observational databases and between the global and northern databases.Analyses of the ESMs' output showed large variability, but the influential factors were predominantly similar among ESMs (Fig. 6).This similarity most likely indicates that the structures of the models that describe SOC dynamics in the ESMs are similar.By comparing the results from the ESMs' output to those of observational databases, we identified differences between the observational databases and the output of ESM, which suggest directions for the future improvement of SOC sub-models of ESMs.
Our analyses revealed that the most distinct differences between the observational databases and the outputs of ESMs were the effects of the CN ratio and clay contents (Fig. 6).For both the global observational databases, the CN ratio had substantial contributions (Fig. 3a).The important contribution of the CN ratio was the same in the northern databases (Fig. 3b).The SOC increased with increasing CN ratio in the observational databases (Fig. 4), whereas the outputs of the ESM were insensitive to the CN ratio.The relationships between the CN ratio and SOC may show "causal-resultant" relationship.
The decomposability of organic matter generally decreases with increasing CN ratio (Berg et al. 2001;Zhang et al. 2008).In addition, when undecomposed organic matter accumulates, the CN ratio increases, as supported by the fact that soils with higher SOC have a higher CN ratio (Batjes, 1996).The high contribution of the CN ratio may suggest the importance of soil fertility (or nutrient availability) and plant litter quality in the carbon cycle and SOC accumulation (Cotrufo et al., 2013;Fernández-Martínez et al., 2014;Liski et al., 2005;Tuomi et al., 2009;Ťupek et al., 2016).All of the ESMs except for the BCC, CESM1, and NorESM in CMIP5 do not have terrestrial nitrogen processes (Todd-Brown et al., 2013).Including the nitrogen process has been suggested as an important improvement for the next model intercomparison (CMIP6) (Hajima et al., 2014;Zaehle et al., 2015).The results derived from our analysis support the importance of the appropriate inclusion of the N cycle in ESM models.Clay content is also often used as a regulator of the decomposability of organic matter in soil (e.g., CENTURY and RothC).
Generally, high clay content inhibits organic matter decomposition in the soil.In addition, high clay content often results in low drainage and anaerobic soil conditions, which also inhibits organic matter decomposition.For IGBP-DIS, clay content had as high a contribution as the CN ratio.The control of decomposability by clay content has been previously incorporated in site-scale process-based models (Parton et al., 1987).Including the influence of clay on decomposition will be needed in ESMs.
The mean annual temperature was identified as an influential factor in global databases (Fig. 3a) but not in northern soils (Fig. 3b).Temperature is a main controller of both plant production (source of carbon input to soil) and the decomposition of soil organic matter, which are already incorporated in ESMs.The temperature sensitivities, Q10 values, of soil organic matter decomposition in ESMs were reported to be 1.4 to 2.2 (Todd-Brown et al., 2014), and our analyses showed the diverse relationships between the mean annual temperature and SOC.The lower contribution of mean annual temperature in northern soils most likely exists because temperature sensitivity is an exponential process and the magnitude of changes with changing temperature is relatively small in a low temperature range.The use of temperature sensitivity that captures the observational data will improve the performance of ESMs.The relationships between SOC and temperature obtained in this study are the integration of temperature sensitivity of both plant production and soil organic decomposition and thus do not provide the sensitivity of individual processes for ESMs.However, the results of this study can be used to examine the consistency between the ESM output and observational databases.
In ESMs, NPP was selected as an influential factor in ESM analyses for global and northern SOC (Fig. 6ab) but not in observational databases (Fig. 3ab), which is consistent with findings obtained in a previous study (Todd-Brown et al., 2013).This high NPP contribution in ESMs is understandable because in the terrestrial carbon balance modelled in ESMs, the SOC stock is calculated through NPP or plant litter input to soil and soil organic matter decomposition.Plant litter input is proportional to NPP.However, our analyses suggest that the influence of NPP on soil organic matter in observational soil databases was obscured by other factors.When ESMs incorporate the effects of other factors, for example, the CN ratio and clay contents, the effect of NPP may be diluted in ESMs.Furthermore, SOC storage results from organic matter accumulation over decades and even millennia.Thus, past NPP, land fires, and land-use change may still have an effect on current SOC (Carvalhais et al., 2008;Wutzler and Reichstein, 2007).Land cover was also an important factor.Wetland is one of influential land cover types with high carbon contents.In general, wetland soil stores more carbon per unit area than upland soils.Incorporating hydrology and the resulting carbon dynamics in wetlands would be an important improvement for ESMs.
Elevation was revealed as an influential factor particularly in northern observational databases (Fig. 3b).We speculate that elevation may serve as a comprehensive index of SOC in a limited area because other variables, such as temperature, NPP, soil texture and other factors, change with increasing elevation.The effect of elevation in ESMs was not as high as in observational databases (Fig. 6).We estimated that the effect of elevation might automatically increase if the aforementioned other processes are properly adjusted/included in ESMs.
We examined key factors from a wide variety of candidate properties, but some potentially important mechanisms that would improve the reproducibility of SOC by ESMs and process-based ecosystem models may be missing.For example, it has been suggested that including microbial dynamics in SOC models improves projections of global soil carbon by ESMs (Wieder et al., 2013).Mycorrhizae have been reported to play an important role in soil carbon storage (Averill et al., 2014).
Because soil carbon accumulation and decomposition are slow processes and land cover is an important factor of SOC, as shown in our study, taking land-use history into consideration may be essential.In addition, because soil has depth and SOC and soil environments vary according to depth (Davidson and Trumbore, 1995;Hashimoto and Komatsu, 2006;Jobbágy and Jackson, 2000), vertical soil heterogeneity/processes are important (Braakhekke et al., 2013;Wieder et al., 2013).The importance of mineral reactivity has also been suggested (Doetterl et al., 2015).However, our results may suggest that the Another approach would be model-data fusion (assimilation) (Hararuk et al., 2014).Constraining model parameters by observational databases through data assimilation such as a Bayesian approach would improve the performance of ESMs.
Another uncertainty of this analysis is the issue of scale: if the analysis is applied at a much finer resolution, such as 1 km, then the influential factors may differ.5 In this study, the same data-mining BRT algorithm was applied to observational databases of SOC stock and ESM outputs.
By comparing the outputs from both analyses, we revealed the similarities and differences between the observational databases and ESMs.On the global scale, incorporating the influence of the CN ratio and clay content in ESMs was identified as a potential means to improve the ability of these models to reproduce the distribution of SOC in observational databases.The results of this study will help elucidate the nature of both observational SOC databases and ESM outputs and 10 improve the terrestrial carbon dynamics modelled in ESMs.This study demonstrates that the data-mining scheme can be used to compare results from observational databases and ESMs in detail and to determine the key factors involved in the mismatches.
Geosci.Model Dev. Discuss., doi:10.5194/gmd-2016-138, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.The fitted functions were centred by subtracting their mean.See Table A3 for land cover classifications.Because of the small number of data points, the results for "15, Permanent Snow and Ice" were not shown (e). 5 Geosci . Model Dev. Discuss., doi:10.5194/gmd-2016-138, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.The fitted functions were centred by subtracting their mean.See Table A3 for land cover classifications.Because of the small number of data points, the results for "15, Permanent Snow and Ice" were not shown (d). 5 Geosci . Model Dev. Discuss., doi:10.5194/gmd-2016-138, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.A3 for land cover classifications.Because of the small number of data points, the results for "15, Permanent Snow and Ice" were not shown (c).5 Geosci . Model Dev. Discuss., doi:10.5194/gmd-2016-138, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.
Fig. 7.The relationships between SOC and certain variables largely varied among the ESMs databases, particularly for the mean annual temperature.The SOC decreased with increasing mean annual temperature but increased with increasing precipitation and NPP.The mean of the relationship with mean annual temperature for ESMs was very consistent with that in the HWSD and IGBP-DIS databases at temperature range −5-15 °C.The increasing trend with increasing NPP in ESMs was consistent with that of the HWSD, particularly below approximately 500 g C m −2 of NPP.Although the wetland ratio Geosci.ModelDev.Discuss., doi:10.5194/gmd-2016-138,2016   Manuscript under review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.

Figure 2 .
Figure 2. Soil carbon stock (kg C m −2 ) from Earth system models (CMIP5)."ensemble" indicates the result of an ensemble of family members.

Figure 3 .
Figure 3. Relative contribution (influence) of predictive variables for the model of soil carbon stock in the global observational databases (left) and northern observational databases (right).

Figure 4 .
Figure 4. Effect of the most influential variables in the model of the soil carbon stock for each global observational database.

Figure 5 .
Figure 5.Effect of the most influential variables in the model of the soil carbon stock for each northern observational database.The fitted functions were centred by subtracting their mean.Note that the y-axis scales for clay and the CN ratio are different from other factors.

Figure 6 .Figure 7 .
Figure 6.Relative contribution (influence) of predictive variables for the model of the soil carbon stock from ESMs and the comparison with those of observational databases.Grey lines show the results of each ESM, and the purple line indicates the mean of the ESMs. 5

Figure 8 .
Figure 8.Effect of the most influential variables in the model for northern outputs from ESMs and the comparison with those of observational databases.Grey lines show results of each ESM, and the purple line indicates the mean of the ESMs.The fitted functions were centred by subtracting their mean.See Table A3 for land cover classifications.Because of the small

TableTable 1 .
Variables used in the analyses and their sources.1Theoriginaldatabaseprovidesmonthlydata.Annual means were calculated by the authors.5 * 2 The CN ratio was calculated by dividing the carbon density by nitrogen density.3Thenativedatabase is hydro1k, and its resolution is 1 km.The mean value of 1 km values was used in this study.Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-138,2016Manuscriptunder review for journal Geosci.Model Dev.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.

Table A3 :
Classification of land cover.