Accurate model representation of land–atmosphere carbon fluxes is essential for climate projections. However, the exact responses of carbon cycle processes to climatic drivers often remain uncertain. Presently, knowledge derived from experiments, complemented by a steadily evolving body of mechanistic theory, provides the main basis for developing such models. The strongly increasing availability of measurements may facilitate new ways of identifying suitable model structures using machine learning. Here, we explore the potential of gene expression programming (GEP) to derive relevant model formulations based solely on the signals present in data by automatically applying various mathematical transformations to potential predictors and repeatedly evolving the resulting model structures. In contrast to most other machine learning regression techniques, the GEP approach generates “readable” models that allow for prediction and possibly for interpretation. Our study is based on two cases: artificially generated data and real observations. Simulations based on artificial data show that GEP is successful in identifying prescribed functions, with the prediction capacity of the models comparable to four state-of-the-art machine learning methods (random forests, support vector machines, artificial neural networks, and kernel ridge regressions). Based on real observations we explore the responses of the different components of terrestrial respiration at an oak forest in south-eastern England. We find that the GEP-retrieved models are often better in prediction than some established respiration models. Based on their structures, we find previously unconsidered exponential dependencies of respiration on seasonal ecosystem carbon assimilation and water dynamics. We noticed that the GEP models are only partly portable across respiration components, the identification of a “general” terrestrial respiration model possibly prevented by equifinality issues. Overall, GEP is a promising tool for uncovering new model structures for terrestrial ecology in the data-rich era, complementing more traditional modelling approaches.

One prerequisite to understand and anticipate the global consequences of
anthropogenic climate change is an accurate quantitative description of the
terrestrial carbon cycle

Traditionally, respiration models have been based on some theoretical
considerations, but largely remain empirical in nature

Direct approach and reverse engineering in model development for describing dynamical systems. Existing and possible steps needed in the process of building a model. For the direct approach, the process starts with the building of a hypothesis from existing knowledge. The hypothesis is then the subject of abstraction and is summarized in a mathematical model that has two components: the structure and the parameters. The mathematical model can be translated into a computational form that will generate predictions. Depending on how well the predicted values manage to recreate the available observations, the model's parameters are calibrated or, if the general trends are missed, there might be a need for structural reformulation. On the other hand, in the reverse engineering approach, a machine learning method is used to generate a set of candidate models that are then compared with the available observations and which according to the prediction capacity may have to go through structural changes by automatic evolution or through a final parameter adaptation. From the set of evolved models, the best model in terms of prediction capacity is chosen and its structure will be the basis for hypothesis building, as an expert would try to explain why a specific structure was automatically evolved and whether the structure of the model can be explained from the studied system-intrinsic processes. If that is the case, and the structure has not emerged randomly, the conclusions can be compared with the existing knowledge which can be reconfirmed, or new aspects of the studied system might be brought to light.

We explore the possibility of reverse engineering offering an automated
alternative to model development for predicting terrestrial carbon fluxes
(Fig.

Of course, expert knowledge still has a large influence on the modelling process, as only a certain set of variables can be measured and an even smaller subset is indeed available for model development, which includes the restriction to a certain plausible number of time lags, and hence full objectivity of automatic model development cannot be truly achieved. Furthermore, expert knowledge comes into play when the algorithm is set for running, by tuning the set of parameters according to the problem needed to be solved and as well during the observation collection and during the final decision on whether the solution returned by the algorithm actually makes sense at all and whether it can be used further. Nevertheless, we believe that by shifting the moment when the analyst makes the decision regarding the selected model, a larger degree of objectivity in modelling is achieved.

Reverse engineering is close to machine learning based regression techniques,
where various candidate model formulations and specifications are explored in
order to minimize the prediction error. The fundamental difference from
typical model building is that reverse engineering typically provides a
symbolic regression, that is, the resulting structures are ideally directly
readable as mathematical functions (i.e. response functions) and can be
interpreted. The readable character of the returned solutions allows us to
consider the applicability of the derived structures in other system domains

Here, we focus on the gene expression programming

We seek to understand as well whether automating model development can
provide new insights into understanding the dynamics of terrestrial
respiration processes. We base our study on data from a long-term monitoring
experiment of

The fundamental question addressed in this paper is whether regression models can be constructed more objectively by leaving the task of proposing a final regression model to an algorithm rather than directly to an analyst. The need for human intuition during the actual process of constructing a regression model becomes reduced, and the input of expert knowledge shifts towards identifying input variables, parameters, a suitable cost function and model plausibility.

With the current study we investigate as well whether automatically derived
model structures differ substantially from models conventionally used in the
study of

The work flow used in solving symbolic regression problems with GEP. The process of evolving an optimal solution from observations starts with randomly generating a set number of evolution individuals called chromosomes. The chromosomes are composed of genes that are sets of strings encoding expression trees that can be translated into mathematical expressions in the subsequent step. Following the mathematical expression comes the evaluation of each emerging individual (model) against the target variable values and for each one a fitness value is assigned. If the stopping criterion has not been reached (e.g. best fitness possible, highest number of generations allowed, convergence) the best individual in terms of fitness is saved and the remaining set of chromosomes are selected for genetic manipulation. When the stop criterion is reached, the parameters of the best chromosome is calibrated against the training data with an optimization approach, the CMA-ES, and the best solution is returned.

First, we introduce the GEP methodology and explore its performance for
symbolic regression types of problems using an artificial experiment under
varying degrees of noise contamination designed to resemble

The observational record provided by

For both the artificial experiment and real-world observations, we systematically confront the prediction error of GEP with other state-of-the-art machine learning regression approaches. In addition, we adjust the modelling approach such that the objective function (or fitness function) not only accounts for absolute or relative error, but also reduces structure in the residuals. The discussion focuses on the comparison of the various GEP-derived models, their equifinality, and performance compared to widely used literature models.

We rely on the GEP method

The process of finding the most suitable model structure based on the signal
present in data in GEP starts with an initial generation of

GEP evolution process components.

The choice of input functions used for applying mathematical transformations to the predictors depends on the type of problem we try to solve with GEP. When the problem is a symbolic regression type of problem, as here, most often a set of primitive functions is proposed, such as addition, multiplication, or exponential. More complex functions could increase model complexity too much and risk overfitting. However, if there are already known functional transformations of certain predictors that could be part of the final desired solution, the user can define a new function and introduce it in the set of input functions.

All genes are made up of a “gene head”, containing a combination of
characters mapping to both predictors and functional transformations, and a
“gene tail”, with characters that map only to predictors. The gene length
is given by

As in biology evolution, regardless of the actual length, the GEP genes have
active sections of variable length called “open reading frames” (ORFs) that
can encode various expression trees which can be evaluated into mathematical
expressions

The total number of chromosomes generated over each evolution step make up the GEP population. The evolution steps are also known as “generations”. The maximum number of generations allowed to run until reaching a solution is often used as a stopping criterion.

One of the crucial components of model development within an evolutionary algorithm is the selection process. In GEP, the chromosomes can be translated into mathematical expressions that can be evaluated, and a distance between the current structure based predictions and the original target is computed. The measures are known as “fitness values” and are assigned to all the chromosomes in the population at each generation by means of a predefined fitness function. The evolution of the final solution with GEP is done based on optimizing the fitness function values after each generation, usually by minimizing prediction error, but more complex criteria can be taken into account as well.

Once all the fitness values have been computed and assigned, the chromosomes in a generation are sorted from best to worst fit.

If no stop criteria have been met, preparations for the reproduction of new
chromosomes for the next generation are made. The chromosome with the best
fitness value is reproduced unchanged in the first position of the new
generation. To fill the remaining

In tournament selection, two chromosomes are randomly selected from the entire population and the individual with the better fitness value goes through.

To ensure that novel material is introduced in the pool of possible model
structures,

Once the population of chromosomes is ready for the new generation, the evolution procedure is repeated until a stop criterion is reached, such as best fitness achieved, maximum number of unimproved generations reached, or time limit.

The hyper-parameter needed for a GEP run, i.e. the set of all parameters that
need to be fixed before a GEP run is performed, has either components with
recommended default values, especially for the genetic operator rates
considered when applying the available genetic operators

Such is the case for setting the length of the gene head or the number of genes in a chromosome that can be lower if the interest is in obtaining more compact solutions, with larger values possibly leading to a fast expansion of solution length which can easily overfit the initial target. When the lengths of the chromosomes are kept too low, the structures in the population can converge too soon to a unique solution that might lack the ability to capture meaningful signals present in the training data, due to low diversity of the encoded expression trees.

Another important component of the hyper-parameter to fix is the mutation rate, which is one of the genetic variation operators. When the mutation rate is too large, it can become disruptive and lead to loss of information acquired along the previous evolutionary time steps, reducing the general convergence of the GEP run. Conversely, if the rate is too low, relevant structures may not be constructed in the given time limit.

The current implementation of the GEP approach does not contain an explicit
population diversity management component which could increase the confidence
that a certain solution did not just appear by chance but was actually
selected over a larger pool of possible model structure types. In order to
reduce stochastic bias and avoid getting stuck in local optima that would
produce overfitted results, we chose the practical approach of multi-start
(multiple runs with the same settings) as proposed by

The version of the GEP method presented in this paper was implemented by the
first author in the C

In our study, the fitness measure is reported in terms of the Nash–Sutcliffe
modelling efficiency (MEF) coefficient

During the GEP learning process, however, we use the (1

Although the MEF metric offers a straightforward interpretation, it does not take the number of parameters of the models into account. In real-world applications, it might be desirable to derive models with fewer parameters if those are not (much) worse in terms of prediction capacity than models with a higher number of free terms. Thus, we include in our cost (fitness) function a normalized term related to the number of parameters (ratio of the current number of parameters to the maximum number of possible parameters given the GEP run settings).

Moreover, any systematic pattern in the model residuals needs to be reduced
as the latter should ideally only represent uncorrelated noise. To meet this
criterion, we complement the fitness function with a term related to the
information content (entropy) in the residual time series. Entropy values
would be maximized for data without structure (i.e. white noise), and lower
entropy values would be obtained for structured data, e.g. correlated
stochastic or deterministic processes

In short, the calculation of an entropy as a measure for randomness from a
time series (e.g. Shannon's entropy) requires us to determine a probability
distribution that underlies the time series (or dynamical system), which is
usually done by a partitioning step (also called phase space reconstruction
in other contexts). This is a fundamental step in the methodology, and
various methods have been used to arrive at this probability distribution,
for instance frequency or histogram-based measures, procedures based on
amplitude statistics, or symbolic dynamics (see e.g

As our aim is to minimize structure in the residuals, the temporal order
becomes important. In recent years, the Bandt–Pompe approach has become
popular, because it directly takes time sequences into account: the technique
hence divides the time series into ordinal sequences (i.e. ordinal patterns,
or symbolic sequences), and then computes entropy measures directly from the
probability distribution of these ordinal patterns

This approach has a number of advantages, namely that it is robust to noise
(no sensitivity to numeric outliers) and to trends or drift in the data, it
is an (almost) non-parametric method and no prior assumptions about the data
are needed (the only parameter that has to be specified is the embedding
dimension, i.e. window length), and it allows us to disentangle various
possible states of the system that are then encoded in the probability
distribution (see e.g.

The single parameter that needs specification is the window length. This
parameter is fixed to

The final normalized form of the fitness function further used in our work is

To assess the effect of adding the entropy component for the residuals in the
CEM fitness function, we introduce as well a fitness measure containing
elements regarding only the MEF and the number of parameters.

The GEP algorithm does not have a specific treatment of constants in the building of model formulations, but mutations can change both the model structure and constants. However, the scaling of constant values (model parameters) might be a decisive factor in adequately determining the fitness of a formulation. Without this, a model structure might be discarded regardless of potentially being a very powerful candidate. Furthermore, model parameters are often very informative regarding a system's sensitivity to some modifications of the drivers. These aspects have led to the addition of a final parameter optimization step at the end of each GEP run.

In order to obtain an optimal set of parameters for the GEP-extracted model
structures, an approach that would be applicable in a large set of generated
search spaces was necessary. Here we use the covariance matrix adaptation
evolution strategy (CMA-ES,

The CMA-ES version used for the final step of optimization is the Hansen
Python implementation found at

To explore the possibility of using GEP in developing relevant model
structures for describing the terrestrial carbon fluxes, two case studies
were designed: firstly, an experiment based on artificially generated data to
better understand and present the general properties and capacities of GEP.
Secondly, we explored the use of GEP on real measurements of various
respiratory flux components monitored continuously over 2 years in an oak
forest

These experiments were designed to explore whether our implementation of the
GEP method is suitable for symbolic regression types of problems, and how
robust/vulnerable it is across various signal-to-noise ratios. We explored a
set of functions with increasing levels of non-linearity to generate data
points.

To investigate the capacity of GEP to reconstruct a simple model used in the
ecology field as well, we introduced as well an artificial test for the
“

In order to investigate the response of the GEP approach to
noise-contaminated data, we simulated Gaussian noise that scales with signal
amplitude as often observed in the case of terrestrial ecosystem

For each of these functions and SNR levels, we sampled 100 validation data points 10 times; 20 GEP runs were performed on the 1000 training data points and the GEP model structure with the highest mean MEF value over the 10 validation sets was chosen.

GEP settings.

As the choice of fitness function was crucial for the construction of
structures in a GEP type of approach, we also investigated in one experiment
the effects of minimizing the CEM values (Eq.

The prediction performance of the best GEP-derived models based on the data
in Sect.

The toolboxes and settings used for generating the predictions by the ANN and
KRR methods are described by

All the present machine learning approaches have been applied to the same training data sets as those used for building the GEP models, and their predicted values were compared with the validation sets used for determining the best GEP solution.

In the second experiment we assessed the possibility of reverse-engineering
model structures

The Alice Holt data set contains observations of

A multiplexed chamber system was used for separately measuring soil
respiration (

The above-ground respiration (

We used the following candidate driver variables: soil volumetric moisture
measurements, air temperature (from micro-meteorological stations),
temperatures at different soil depths, and GPP. A number of recent studies
have shown a tight linkage between GPP and

To reduce the skewness and the search space that the GEP evolution would have
to cover in order to construct valuable solutions

As such, if the log model is

For each combination of respiration target and possible drivers, 50 subsets
of 500 target time steps each were randomly selected and used for the
training of GEP models using the settings found in Table

We were particularly interested in determining the general character of each
extracted model with respect to the different respiration fractions. We
therefore re-optimized the parameters of all extracted model structures when
applying one extracted model as the candidate function for a different
respiration term. For example, the model formulation extracted for

Respiration model formulations commonly used in the environmental science community.

Effect of adding noise to the original signal on the prediction
capacity for GEP, KRR, RF, SVM and ANN. The first panel contains the
evolution of mean modelling efficiency (MEF) values from 20 independent runs
for each increasing level of noise. MEF is computed after learning from a
data set of 200 data points and validating against 1000 data points
containing noise. The second panel shows the evolution of mean MEF values
from 20 independent runs for each increasing level of noise where MEF is
computed after learning from a data set of 200 data points and validating
against noise-free 1000 data points generated from
Eq. (

As in the artificial example, we compared the returned GEP solution
prediction performance with that of other common MLMs such as SVN, KRR, ANNs,
and RF. All methods were used to generate 50 subsets of 113 prediction
values, after training on the 50 subsets of 500 time steps of observations
presented at the start of Sect.

A comparison was done between the GEP-built models and some common literature
respiration models with different structures and driving variables that were
also optimized using CMA-ES. The optimization was performed for each
respiration data set and its candidate drivers and parameters
(Table

In the first artificial experiment the GEP approach is used to verify whether
it can reconstruct prescribed functions. Following the training of the
20 independent GEP runs, the initial functions were successfully
reconstructed for all 10 equations defined in
Sect.

For the

Effects on modelling performance and parameter number caused by
choice of fitness function during GEP training for artificial noisy data
generated by Eq. (

MEF values for the GEP-extracted models and for the predictions generated by
ANN, RF, KRR and SVM are illustrated in Fig.

Figure

In order to verify the effects of changing the fitness function from MEF to
CEM, we compare the distributions of MEF values for all runs for all studied
SNRs. Figure

Applying GEP to the Alice Holt data set yielded a series of model structures
for each respiration type. The returned model structures after bias corrected
back-transformation are illustrated in
Eqs. (

Whilst GEP-derived models may differ between respiration types, there are a
number of equivalent models for different respiration components.

Observed and predicted outgoing CO

Observed and predicted outgoing CO

Modelling performance for all extracted model structures after cross-validation over 90 cases.

The highest performance in terms of MEF value was recorded for

In order to explore the capacity of the GEP models generated for the

The residuals depict some remaining patterns (Figs.

Average validation MEF performance for all extracted model structures
when re-optimized against all other respiration CO

Observed and predicted outgoing CO

We investigated the capacity of each extracted model structure
(Eqs.

After optimization, none of the structures show an overall best MEF for all
the

The prediction capacity of the GEP-generated models in the context of other
commonly utilized MLMs was assessed as well. KRR, ANN, SVM and RF were used
for generating 113 predicted data points as described in Sect. 3.2
(Fig.

Average validation MEF performance for CMA-ES optimized selected
literature model formulations when compared with respiration CO

Observed and predicted outgoing CO

Observed versus predicted

Residuals computed for smear term bias corrected back-transformed
GEP models for various types of CO

Observed CO

Machine learning methods (MLM) prediction performance for all
respirations components

MEF validation values for literature models and for the best GEP
model in terms of MEF at each respiration level. Each

Daily

Lastly, the GEP-generated models were compared with some of the most commonly
used literature models for describing respiration. The resulting MEF values
obtained after individual parameter optimization using the CMA-ES procedure
for each literature model are given in Table

As the studied literature models performed best in modelling

In this work, the primary reason for the artificial experiments was to obtain
a better understanding of the capacity of GEP to solve symbolic regression
types of problems. We put an emphasis on GEP performance in the presence of
noise. This aspect was important, given that monitoring data from terrestrial
ecosystem CO

Our findings illustrate that the selection of CEM over MEF as a fitness
function for optimization has a minor effect on the global mean MEF
(Fig.

One of the critical aspects in our work is that GEP, as implemented here, can
only represent and derive “

Lagged responses can only be detected if the number of lags from a driver is correctly included in the input, which already implies sufficient knowledge of their existence and behaviour. Whilst in the current implementation of the GEP algorithm, shifts in conditions and responses cannot be encoded or detected, these could be addressed with the inclusion of a conditional operator in the set of functions encoded in the GEP evolution individuals.

Nevertheless, it would be fair to mention that the same limitations can
affect the results of the other MLM and empirical models presented in this
paper. A clear advantage ANN, RF and SVM have though over the GEP symbolic
regression construction is the fact that when the target variable presents a
skewed distribution,

We automatically generated a series of model structures to describe
terrestrial CO

Interestingly, the models derived for

When we compared the GEP-derived models with the community established
semi-empirical models from a structural point of view, we found that they
shared some key features for temperature dependencies of CO

A major difference was in the response of the respiration components to SWC, where the GEP models often chose SWC as one of the drivers. Moreover, the GEP models often contained an exponential dependency, i.e. there are only certain parts of the signal that are strongly sensitive to varying SWC. We believe that the exponential dependency of terrestrial ecosystem respiration components on SWC is a very intuitive pattern that has not yet been reported in the literature and requires further exploration.

Another difference we found was the strongly seasonal response of the respiration components to GPP, possibly as a proxy to light and vegetation availability which were not included in the set of candidate predictors.

Considering that GEP identified plausible models that are very different structurally from previously reported semi-empirical models, still yielding equivalent or better modelling performance, the validity of the conventional semi-empirical models can be questioned. Nevertheless, we do believe that there is a need for more in-depth analysis for determining whether the GEP described processes make actual biological sense and whether the selected drivers and their interactions represent true processes and responses.

During our study, it was apparent that the highest MEF values were obtained for all the studied methods in the case of the respiration types that had direct measured observations and were not derived. It might be the case that when fluxes are obtained from derivations, the measurement error will also increase, and the partition of a clear signal existing in the observations is not sufficient for constructing a good model with GEP.

All GEP-generated models underestimated the high respiration fluxes
(Fig.

A more in-depth comparison of all the GEP and conventional respiration
models, based on a timescale-dependent assessment of model–data mismatch

The question is whether the GEP method lacks the ability to build models that
correctly represent the processes and their fast dynamic responses, or
whether the candidate drivers and the observations used for their
representation are simply not sufficient for generating representative
models. In the end, the response of

We believe that the consistent underestimation of fast responses was partly
due to surface moisture affecting litter decomposition and fungal activity,
as soil moisture was only monitored over the average 8 cm surface, with the
top few centimetres most likely presenting the highest activity and partly
due to some potential processes/drivers like lags between GPP and
respiration

Another explanation for missing some of the (high flux) variability could be in our choice of fitness function. As we decided to penalize during the learning process for structures with many parameters, it is likely that some structures were eliminated early on during this process, even though they may be well suited for describing a given process from a modelling efficiency point of view. However, this is a case of trade-off between a good fit and structural simplicity, and in our approach, we decided that simplicity of structure, i.e. the possibility of interpretation, is a very important asset.

We explored as well the possibility of the underestimation of the carbon flux
variability being caused by the

Table

A critical question for the applicability of any ecosystem model is whether the model structure is more important than the parameterization of a given “best” model. For this question to be addressed, however, we need a larger sample of ecosystem types representative of different types of responses where we can explore the importance of the obtained structures and their parameter sets.

The comparison of GEP-generated models and machine learning methods showed a
narrow range of predicted fluxes (Fig.

Overall, our results suggest that the GEP approach is a potentially powerful
tool of reverse engineering, particularly helpful for building ecological
models when there is a minimum of a priori system understanding. We
exemplified this conceptually using artificial data, but also show that GEP
always yields as good or better results compared to conventionally used
models in the case of ecosystem respiration. Based on data from a long-term
monitoring site of different respiratory fluxes, and using GEP as a reverse
engineering tool, we found new structures for modelling

The current study has also revealed methodological aspects that could be
improved. In particular, we found the inclusion of a parameter optimization
step very helpful to further test the transferability of model structures.
But this approach could be potentially integrated into the GEP evolution.
More specifically, we think that the next development of GEP could include
the parameter optimization as an intermediate step before selection during
each evolution generation

All code and data used to produce the results of this paper can be provided upon request by contacting Iulia Ilie or Miguel D. Mahecha.

The authors declare that they have no conflict of interest.

We thank Markus Reichstein for all the useful comments and suggestions.

This work was supported by the International Max Planck Research School for global Biogeochemical Cycles (IMPRS-gBGC), Jena, by the European Union's H2020 research and innovation programme project BACI, grant agreement 640176, and by NOVA grant UID/AMB/04085/2013. The Alice Holt Forest GHG Flux site is funded by the UK Forestry Commission. The article processing charges for this open-access publication were covered by the Max Planck Society. Edited by: Sandra Arndt Reviewed by: two anonymous referees