Error assessment of biogeochemical models by lower bound methods (NOMMA-1.0)

Biogeochemical models, capturing the major feedbacks of the pelagic ecosystem of the world ocean, are today often embedded into Earth System models which are increasingly used for decision making regarding climate policies. These models contain poorly constrained parameters (e.g., maximum phytoplankton growth rate) which are typically adjusted until the model shows a reasonable behavior. Systematic approaches determine these parameters by minimizing the misfit between the model and observational data. In most common model approaches, however, the underlying functions mimicking the biogeochemical 5 processes are non-linear and non-convex. Thus, systematic optimization algorithms are likely to get trapped in local minima and might lead to non-optimal results. To judge the quality of an obtained parameter estimate, we propose to determine a preferably large lower bound for the global optimum, that is relatively easy to obtain and that will help to assess the quality of an optimum, generated by an optimization algorithm. Due to the unavoidable noise component in all observations, such a lower bound is typically larger than zero. We suggest to derive such lower bounds based on typical properties of biogeochemical 10 models (e.g., a limited number of extremes and a bounded time-derivative). We illustrate the applicability of the method based on two real word examples. The first example uses real world observations of the Baltic Sea in a box-model setup. The second example considers a 3 dimensional coupled ocean circulation model in combination with satellite chlorophyll-a.


Introduction
Earth system models are widely used to assess the consequences of climate change and explore climate engineering options (e.g., Brovkin et al., 2009;Keller et al., 2014;Mengis et al., 2015;Cao andCaldeira, 2008, 2010, and many more to follow).In order to capture the development of climate relevant greenhouse gases such as CO 2 and N 2 O a pelagic biogeochemical component, embedded into a numerical ocean model, is essential.
In contrast to ocean physics, which is derived from first principles, current biogeochemical modules are based on empirical relationships.Thus, several studies compare models of different complexities (e.g., Friedrichs et al., 2006).Still there is no consensus yet which complexity is needed to capture the major processes and how exactly the model should be formulated (e.g., Anderson, 2005;Löptien, 2011).Today, various model formulations exist.Popular examples are the BLING model with four prognostic variables only (Galbraith et al., 2010) versus the PICES model, containing 24 prognostic variables (Aumont et al., 2015).Another related major problem, besides model complexity, is generally the multitude of poorly known model parameters which exert crucial control on the model behavior (e.g., Kriest et al., 2010;Löptien and Dietze, 2017).To assess and compare the quality of the different model formulations, it is crucial to chose these parameters such that the fit to observations is as good as possible.Due to the large computational expenses of three-dimensional coupled biogeochemical ocean models, it is common practice to adjust a few parameters "by hand" until the model shows "reasonable" agreement with some observations.More advanced approaches use automatized V. Sauerland et al.: Model error lower bounds optimization techniques to estimate the optimal model parameters.These techniques require an objective metric (e.g., Evans, 2003) that measures the model-data misfit and can be minimized automatically.Due to computational limitations, various studies estimate the model parameters at singular stations and adopt these for the full model (e.g., Kane et al., 2011;Kaufman et al., 2017;Matear, 1995;Schartau and Oschlies, 2003).For this approach, it can be problematic to determine a single parameter set for several sites (e.g., Kidston et al., 2011).Thus, other studies use fast approximations (Kennedy et al., 2006;Khatiwala, 2007) to be able to optimize certain parameters for full three-dimensional models.A drawback is that the latter approaches are generally restricted to the estimation of few parameters only (e.g., Mattern et al., 2012;Kriest et al., 2017;Prieß et al., 2013a, b;Piwonski and Slawig, 2016;Rückelt et al., 2010).In addition, limited data availability (e.g., Lawson et al., 1996) and a deficient representation of certain processes in the underlying ocean circulation model (e.g., Dietze and Löptien, 2013) encumber the optimization process.In summary, the systematic optimization of a 3-D coupled biogeochemical ocean model remains a difficult task and requires the advancement in existing methods (Schartau et al., 2017).
Biogeochemical processes are nonlinear, non-convex, and complexly entangled.Therefore, as stressed by several foregoing studies, associated model-data misfit measures comprise an unknown number of local optima and the results of an optimization provide no proof whether an obtained parameter set is globally optimal or not (e.g., Faugeras et al., 2003;Hurtt and Armstrong, 1996).Many parameter optimization studies invoke deterministic methods that use gradient information about the objective function (the modeldata misfit measure) to iteratively approach a locally optimal set of parameters in an efficient way, starting from some initial guess.Most of these studies calculate gradient information by the adjoint method (introduced for biogeochemical models by Lawson et al., 1995, since it is efficient if there are more parameters than model states) and use the gradient to determine a direction and an efficient step size to change the parameters, often by applying a quasi-Newtonian method (e.g., Fennel et al., 2001;Friedrichs, 2001Friedrichs, , 2002;;Spitz et al., 1998;Tjiputra et al., 2007;Xiao and Friedrichs, 2014).Other attempts focus on stochastic search algorithms which rely on random decisions.Examples for stochastic search algorithms that have been applied to optimize parameters of biogeochemical models are simulated annealing (e.g., Hurtt andArmstrong, 1996, 1999;Matear, 1995;Kidston et al., 2011), genetic algorithms (e.g., Hemmings and Challenor, 2012;Kaufman et al., 2017;Schartau and Oschlies, 2003), and estimation of distribution algorithms (Kriest et al., 2017).Vallino (2000) compares the performance of a couple of optimization algorithms of both types, tuning the parameters of an ecosystem model against mesocosm data.Stochastic search algorithms require more model simulations (computation time) than gradient-based methods to converge but are less likely to get trapped in a "first available" local optimum (see Vallino, 2000), which might possibly be far off the global optimum.On the other hand, several contributions which focus on gradient-based methods aim to increase confidence in the quality of an obtained parameter set by repeating the optimization procedure many times (20-600), while using various random starting points (e.g., Garcia-Gorriz et al., 2003;Hemmings et al., 2004;Schartau et al., 2001).This approach also increases the number of required simulations considerably.
Still, it is crucial to find a global optimum to assess the quality of a certain model formulation.Lacking a proof on the global optimality of chosen parameters, it is difficult to determine whether a model-data misfit is mainly caused by the parameter choice or attributed to other sources of uncertainty, like those concerning model equations or observational data (see, e.g., Faugeras et al., 2003;Spitz et al., 1998;Schartau et al., 2001).Facing this situation, we have a strong interest to estimate the deviation of a model-data misfit for a given parameter set relative to the unknown global optimum.As the minimal accomplishable model-data misfit (i.e., the global optimum) is unknown, a good (i.e., preferably large) and easy-to-obtain lower bound on that value would help to judge the quality of a minimum obtained by an automated optimization algorithm.Provided that such a lower bound is close to the obtained model-data misfit, a continuation of the parameter optimization process would not be necessary.In the present study, we introduce an approach to determine such lower bounds.We suggest considering a surrogate formulation that is easier to solve and determining the global optimum based on this "relaxed" problem.Our approach is based on certain properties of typical biogeochemical models which are likewise fulfilled by non-parametric functions.We propose searching for the best fit to the observations among these functions -which is a much easier and faster optimization problem than minimizing the model-data misfit based on the full biogeochemical model.Optimizing these non-parametric functions provides the desired bounds on the lowest possible misfit of the actual model, since the properties we choose to constrain the generalized optimization problems are satisfied by each solution of the original problem.
The following section focuses on some typical properties of biogeochemical models which lead to the relaxed problems described above.The choice of the respective model properties is also based on the fact that efficient tailored algorithms for solving the associated relaxed problems are readily available.In Sect.3, we examine the proposed method with regard to both characteristics of observational data: their noise level and coverage.For this purpose, we generate synthetic observations by adding random Gaussian noise to samples of a parameterized exemplary model trajectory.In the next step, we compare the results with our lower bound approaches, i.e., with the global optima of the corresponding easier optimization problems.We systematically exam-ine the relation between both values depending on sparseness of the observational data and noise level.We further consider two real-world applications.The first application is based on a box model and investigates a common nutrientsphytoplankton-zooplankton-detritus (NPZD) biogeochemical model in combination with phytoplankton observations in the Baltic Sea.Our second example is based on global satellite observations of chlorophyll a and a coupled biogeochemical ocean general circulation model.

Methods
Comparing model output to observational data requires a criterion to measure the misfit between both data sets.To apply an automated optimization algorithm, such a measure needs to be reduced to a single real number.We provide commonly used measures in the following subsection.In Sect.2.2, we introduce mathematical notation for the optimization problem based on the given measure for the model-data misfit.Additionally, we provide a mathematical formulation for the (simplified) non-parametric approach.We then give specifications of the non-parametric data-fit problem based on frequency limits on the parameterized models (Sect.2.3 and 2.4), bounds on their derivatives (Sect.2.5), and the combination of both (Sect.2.6).These non-parametric relaxations will be used to calculate lower misfit bounds as outlined above.
For the sake of simplicity, we will consider scalar data in the following, assuming that both o and p are univariate.Actually, comprehensive global ocean models and observational data sets both comprise multiple quantities of interest on spatial grids.The presented lower bound methods can be utilized for that multi-variate case by applying them chunk-wise, for each quantity, and summing up the obtained results, optionally using weights for the single terms.
Objective judgment about the differences between observational data and model output requires an associated measure f err that assigns a real number to the model-data misfit.Furthermore, such an objective model-data misfit measure f err has the advantage that it allows applying mathematical optimization algorithms to parametric models, where otherwise only manual parameter tuning can be done until the model output shows reasonable behavior.
There are several possible measures for the model-data misfit that have been used to evaluate biogeochemical models (see, e.g., Evans, 2003;Gregg et al., 2009;Stow et al., 2009).Common measures are the mean absolute error (MAE), and the root mean square error (RMSE), It is sufficient to consider the following expression (sum of squared errors), instead of RMSE as this transformation does not change the ranking of considered model outputs p.We will exemplarily work with RMSE which is the most commonly used misfit measure for biogeochemical models.However, our approach and the corresponding algorithms are transferable to other misfit measures like MAE.

The optimization problem
As mentioned above, we consider scalar observations o = (o 1 , . .., o N ) taken at times t 1 < t that is, we want to determine the minimal sum of squared errors over all possible parameter values.As discussed in the introduction, for global biogeochemical ocean models a full scan of the parameter space is hampered by computationally expensive models that would have to be evaluated several times for differing parameter sets during the optimization.Moreover, we usually know neither if selected parameters s ∈ S correspond to a global optimum of the associated data-fit problem (Eq. 1) nor how good (or bad) these parameters are in relation to a global optimum of Eq. (1).Our objective is to find a value as large as possible which we know to be smaller than the minimum of the optimization problem (Eq.1), i.e., a lower bound.Then, if the minimum obtained by the optimization is close to this value, we may terminate the procedure.In mathematical terms we seek a V. Sauerland et al.: Model error lower bounds number α ∈ R >0 which is as large as possible while satisfying Now, if α satisfies Inequality (2) and it holds for some model parameters s ∈ S that the corresponding model-data misfit is close to α, then s is a good parameter set with respect to the observational data (as well as α is a good lower bound on the unknown optimal model-data misfit).
In order to find such a lower bound α, our approach is to replace the parametric optimization problem (1) using a formulation that can be solved more easily.Therefore, we specify a number of properties of the original model that hold for all parameter sets s ∈ S and which we require the alternative formulation to fulfill.The global optimal value of such a relaxed problem is a lower bound α on the best possible model-data misfit of the original model.If the relaxed problem is convex, in contrast to the original optimization task, its global optimum can be calculated efficiently (see, e.g., Boyd and Vandenberghe, 2004).We also refer to the information box below.Mathematically, our relaxations are modifications of the original optimization task (1) in the sense that the parametric model function ϕ is replaced by a nonparametric function from a class F of all functions that satisfy the considered property.In particular, F contains ϕ(s; •) for all s from the parameter domain S of the actual model.The associated non-parametric optimization problem on the "extended search space" reads (3) The model-data misfit of a global optimum of the relaxed problem (3) satisfies Inequality (2), meaning that it is a lower bound on the model-data misfit for all allowed parameters s of the original problem (1).We refer to Sect. 3 for thoughts on how the lower bound is employed in applications to judge the quality of the optimization outcome.In short, the main idea of the lower bound method can be summarized as follows: -Pick some properties that the model comprises for all parameters s ∈ S.
-Solve the optimization problem detached from the parametric model.Precisely, we minimize the sum of squared errors over all functions ∈ F that fulfill the selected properties.
-The procedure yields a lower bound for the original optimization problem as the set of possible solutions is larger for the relaxed problem and contains the original model output.
In the following sections, we give examples of the properties of the model that we choose.

Terms and background information
Given a function f : R n → R and a subset X of R n , a general mathematical optimization problem is (MP) minimize f (x), subject to x ∈ X.An example is the parameter optimization problem (1).
Convex optimization problem.If f is a convex function and X is a convex set, then (MP) is a called a convex optimization problem (CP).All relaxed problem formulations considered below are CPs.An important property of a CP is that every local optimum of f over X is already a global optimum (see, e.g., Boyd and Vandenberghe, 2004).

Quadratic program.
A special CP is a convex quadratic program (QP).A QP has a convex quadratic objective function f and its function domain X is described in terms of some k linear constraints, i.e., X = {x ∈ R n | g i (x) ≥ 0 for i ∈ {1, . .., k}} , where g i , i = 1, . .., k, are linear functions.The surrogate model formulations ( 4)-( 7) are QPs.Tools to calculate global optima of arbitrary QPs exist, but for most of our surrogates we can apply tailored algorithms which are more efficient.

Bounds for monotonic models
We start with an example that is not directly related to biogeochemical models but which serves as a basis for the approaches in Sect.2.4 and 2.6, respectively.The task here is to fit observations with a monotonically increasing data set.Measuring the model-data misfit by its sum squared error, the associated non-parametric optimization problem can, for example, be stated as a convex quadratic program as follows (4) This yields a vector p ∈ R N with monotonically increasing entries, where p i is the data point that corresponds to time t i .These entries are selected such that the sum of the squared deviations from the observations is minimized.Note that Eq. ( 4) corresponds to the general non-parametric optimization problem (3) if F is the class of all monotonically increasing functions.If we want to work with monotonically decreasing functions instead, we just need to replace "≤" with "≥" in the monotonicity constraints or we can apply Eq. ( 4) to −o instead of o and negate the result.
The optimization problem (4) can be solved efficiently.The pool adjacent violator (PAV) algorithm (Barlow et al.,  As the model frequencies are low, both the model function and its samples take only two local extremes.The segments before, after, and between the extremes (times t j and t k ) are monotonically decreasing/increasing.The respective monotonic fits to the data (drawn in red) are therefore "better" than the model output.
1972) solves it with linear effort, i.e., in less than c • N computer operations for some constant c and all N. Another possibility is to use a general optimization tool for convex quadratic programs like CPLEX or MATLAB quadprog.Clearly, solutions of Eq. ( 4) provide a lower bound on the optimal model-data misfit for every parametric model that is monotonically increasing.

Bounds for periodic models
When simulating periodic systems, the model might (intentionally or un-intentionally) not resolve all frequencies that occur in the corresponding observational data.Models that resolve low frequencies with respect to data frequency (e.g., NPZD models that aim to capture the main characteristics of an annual cycle) take a correspondingly limited number of extreme values within a given time interval, e.g., a seasonal cycle.This situation is sketched in Fig. 1.
The fact that each segment between two subsequent extreme values is monotonically increasing/decreasing allows us to apply the methods introduced in Sect.2.3.A corresponding series p 1 , . .., p N of discrete samples has (at most) the same number of local extremes as the model.For illustration, suppose that the series has exactly two extreme values p j and p k with j < k ∈ [N] as sketched in the example in Fig. 1.These must be one minimum and one maximum.Assume that the time points j and k are known in advance and the minimum appears at position j .Then, an optimal data fit is a solution of a convex quadratic program similar to Eq. ( 4) where the optional last constraint appears if the considered interval represents a full cycle of a periodic model.This yields a vector p ∈ R N with entries that decrease up to entry j , then start to increase, and fall after entry k.At the same time this vector minimizes the deviation from the observational data.The negated solution of Eq. ( 5) applied to −o instead of o is an optimal data fit to observations o that has a maximum at position j and a minimum at position k.Now, if the positions j and k of the extremes are unknown, repeating the optimizations with o and −o for every j < k ∈ [N], the best of all results is an optimal data fit subject to the property that there are (at most) two local extremes in the series.Similar to the case of two extremes, we can consider more than two, say m, extremes.Dealing with all possible combinations of the positions of m extremes would imply a computational effort of c 1 • N m operations (c 1 constant, N arbitrary), but using a tailored algorithm (Demetriou and Powell, 1991) we can calculate a best piece-wise monotonic fit in only c 2 •m•N 2 computer operations.

Bounds for models with bounded derivatives
The change rates of biogeochemical processes like growth and decay have natural limits.In the presence of noise, observational data are very likely to exhibit higher variations than a model that is devoted to comparatively slow interactions.In other words, noise (or unresolved periodic processes with high frequencies and high amplitudes) cannot be well approximated by models that mimic processes of lower variation, i.e., models with small changes in a given time step.These processes are characterized by a small absolute derivative.If we are able to postulate general bounds on the derivatives of a parametric model function ϕ with respect to time, we can try to utilize this property in order to calculate lower bounds on the optimal misfit of ϕ.General bounds on the first time derivative (steepness) of ϕ are given as real numbers D min < D max such that D min ≤ ∂ϕ ∂t (s, t) ≤ D max holds for all allowed parameter sets s and time points t.Using the function space 3), we obtain a relaxation of the parametric problem (1) that can be expressed as the con- A solution of this problem yields a lower model-data misfit bound for all parameter sets s such that ϕ(s, •) satisfies the steepness bounds, D min ≤ ∂ϕ ∂t (s, t) ≤ D max .Here, we approximated the derivative (t) by finite differences which yields It is also possible to add linear constraints to the QP which consider bounds on higher-order derivatives of ϕ in terms of higher-order finite differences.For example, the property , can be accounted for with second-order differences by, for example, posing the (compactly written) constraints The knowledge of tight bounds on derivatives of increasing order allows obtaining increasingly tight lower bounds on the model-data misfit.However, since bounds on higher-order derivatives are more difficult to derive in practice, we restrict our studies to steepness bounds.

Bounds for models with combined properties
Clearly, we can combine model properties into a joint QP, e.g., if the model has two local extremes within a window of interest and bounded steepness.We can apply the combination of Eqs. ( 5) and ( 6) and obtain the joint QP Here, again, j < k are the indices of the unique minimum and the unique maximum, respectively, D min < 0 and D max > 0 are the universal lower and upper bounds on the model's first derivative, and T is the optional period of the model.
Similar to the approach in Sect.2.4, the optimal solution of Eq. ( 7) applied to o and −o for all j < k ∈ [N ] will provide the lower bound on the model-data misfit of the parametric model.As an alternative to a QP solver, we can use an extension of the PAV algorithm that additionally considers steepness bounds with monotonic regression called Lipschitz pool adjacent violators (LPAV; Yeganova and Wilbur, 2009) in order to solve Eq. ( 7).

Method evaluation
We first aim to examine the extent to which the minimum model-data misfit of a parameterized model can deviate from the corresponding minimum misfit of a proposed nonparametric relaxation.Clearly, the difference between both misfits also depends on the characteristics of the observational data, that is, noise level and data density.We therefore derive statistics about that dependency using synthetic observations.

Test statistics
We generate the synthetic observations by adding white noise to N discrete samples (t i ) of a model function : R −→ R, varying both the noise level and the number of samples.Our noise levels will be relative to the range The related RMSE between the synthetic data and this piece-wise monotonic fit is 0.445.We know that this error cannot be larger than the corresponding error between the fix polynomial and the data since any cubic polynomial also takes at most two extremes.Indeed, the latter error is 0.501, which is the RMS of the white noise we added.By solving a convex optimization problem we can efficiently identify the coefficients s * = (s * 0 , s cubic polynomials.Unsurprisingly, the re-optimized polynomial * differs only slightly from the original one and yields a RMSE of 0.497.
For our statistics about the proposed error assessment methods we are interested in the ratio between the lower error bound given by the optimal output of a non-parametric model relaxation p rel and the corresponding data fit with the original parametric model p par .In the above example this ratio is q a = 0.445 0.501 = 0.888 ∼ 89 %.We repeat the calculation of a lower error bound and the corresponding error ratio with two other relaxations assuming only a bounded model steepness (see Sect. 2.5) and a combination of both properties, bounded steepness and the existence of at most two (local) extremes (see Sect. 2.6), respectively.The results are depicted in Fig. 3.
Here, for both relaxations we assume a maximum model steepness of 0.05 which is approximately 28 % more than the maximum steepness of the original polynomial in the interval [0, 365].The resulting RMSEs of the property-based optimal data fits are 0.442 if only the steepness bound is assumed (data fit (b)) and 0.464 if both properties are assumed (data fit (c)).The corresponding error ratios are q b = 0.883 and q c = 0.927.
To derive robust statistics, we repeat the experiment 100 times using different zero-mean white noise with the same standard deviation σ .Now, we do the same for all 6×6 combinations of N and σ ; i.e., we apply the three model relaxations (a), (b), and (c) with regard to each of the 36 data property assumptions to 100 data sets of corresponding synthetic observations.The results are shown in Table 1.The approach to calculate lower bounds on the model-data misfit by using property-based model relaxations stems from the intuition that the overall shape of the optimized parametric model and that of the non-parametric relaxation should be similar if the relaxation describes the main properties of the original model well.The amount of similarity is reflected by the ratios stated in Table 1.Values that are close to 100 % provide evidence that the parametric model is suitably shaped with regard to the corresponding general model property assumptions.Here, by construction of the synthetic data, we already know that the original polynomials are "correctly shaped".Therefore, the numbers in the table actually reflect the tightness of the property-based relaxations and serve as circumstances under which the lower bound approach can succeed.
We observe that the data must be rather dense in order to reach good error ratios, especially with low levels of noise.This dependence is plausible because small numbers of observations and low levels of noise cause small difference quotients o i+1 −o i t i+1 −t i of the observations.However, the explicit steepness bounds, property (a), or implicit steepness bounds, property (b), which we use for the model relaxation must be considerably smaller than the difference quotients in order to provide a lower bound that is close to the model-data misfit of the optimized parametric model.
For example, consider a target ratio of 85 % to be reached for all 100 sets of random observations, i.e., the left (worst case) number in a cell of Table 1 should be greater than 85.For up to N = 50 observations none of our experiments reach the 85 % in the worst case.For N = 100 it is reached with property (b) and noise level 1.0, and with property (c) and noise levels 0.5, 0.7, 1.0 (multiplied by the range of the underlying true process).Regarding the lowest applied noise Table 1.Ratios (times 100) between the misfit of the parametric model (cubic polynomial) to synthetic observations (the model output plus white noise) and the misfit of the corresponding non-parametric regression model.We state the ratios for different noise levels σ and numbers of samples N. The values in each cell are the range of the ratios over 100 trials followed by their average and standard deviation.Non-parametric regression was done by (a) only assuming that at most two local extremes exist, (b) only assuming a steepness bound of 0.05, and (c) assuming both properties.level of 0.1 and property (a), the 85 % ratio is never reached in the worst case but only in the average case and only with N = 300 observations.

A countercheck
Having evidence that the lower bounds on the model-data misfit become tight with sufficiently dense observations, we want to countercheck if an optimized parametric model that slightly differs from the actual process behind the observational data has a significantly worse model-data misfit in comparison with its non-parametric relaxation.This time, we generate 300 synthetic observations by disturbing the sum of two sine waves and start with a noise level of 10 % relative to the range of the function values (σ = 0.1 • r).As the data might be mistaken for noisy measurements of a single sine process at first glance, we use a general sine model to fit the observations.From the data, we estimate that both the frequency and the amplitude of the sine are at most 1.2.This implies a maximum steepness of 1.44 and that the sine model takes no more than two extremes in [2, π ], that is, according to the above notation, we use a type (c) model relaxation.Optimization yields a solution with a RMS model-data misfit  With regard to the data density, the ratio q c = 0.727 of both values is not completely convincing and, indeed, one can recognize a "failure in the model shape".Now, we suppose the more precise process resolving a second sine wave of higher frequency.We further suppose knowledge of the general bounds on both amplitudes |s 2 | ≤ 1.From the data, we can expect that the new model with optimized parameters also only takes two extremes in the interval [0, 2π].However, for the given bounds on the frequency and arbitrary parameters the model can take up to four extremes.Consequently, in addition to the steepness bound on the model, at most four extremes must be assumed to calculate the lower error bound on the best possible model-data misfit.Applying both assumptions, (i.e., using model property (c) from above) the exact optimum value of the model relaxation is ∼ 0.217, while the optimized new parametric model comes down to ∼ 0.193 providing a clearly better model-data misfit ratio q c = 0.891 than the pure sine model.The optimized parametric model curve is shown in Fig. 5.
We repeat the experiment with noise levels of σ = 5 % and σ = 20 % for different numbers of equidistant observations N ∈ {500, 300, 200, 100} and for all three property-based model-relaxation types (a), (b), and (c) used in Sect.3.1.1.Again, we generate 100 different random sets of observations for each combination of σ and N. The results are depicted in Table 2.
The experiments help to identify conditions under which we may distinguish the "truth" from "distortions of the truth".Sufficient conditions are given if the misfit ratio for the true parametric model, say q 1 , is not too small, e.g., q 1 ≥ 0.5, but the ratio for a moderate distortion of the true parametric model, say q 2 , is essentially smaller.Depending on how close q 1 is to 1, we may say that q 2 is essentially smaller than q 1 if either of the fractions q 2 q 1 and 1−q 1 1−q 2 are convincingly less than 1, say q 1 q 2 ≤ 0.75 or 1−q 2 1−q 1 ≤ 0.5.We find that a rather low noise level is necessary to satisfy these conditions.As already observed in Sect.3.1, high noise levels σ provide rather tight lower bounds on the minimum-attainable model-data misfit of the "correct model type" if sufficient observations are available.Unfortunately, the corresponding lower bounds for a less accurate model become similarly close in this case.For properties (b) and (c) and fewer observations, they can even exceed the lower misfit bounds for the "correct model" since we apply different uniform steepness bounds.

Application to real-world observations
We now consider two real-world examples with the aim of fitting chlorophyll a observations.

Baltic Sea observations
Our first example considers observations from the Bornholm Basin in the Baltic Sea at 55.15 • N, 15.59 • E, dubbed station BY5.The data were provided by the Swedish Oceanographical Data Center (SHARK) at the Swedish Meteorological and Hydrological Institute (SMHI).BY5 was repeatedly sampled during 1962-2009.As there are relatively long periods with only sparse data, we merge all data into a climatological seasonal cycle.To derive phytoplankton (in nitrate units) from chlorophyll a, we use a constant ratio of chlorophyll a to nitrate of 1.59 g Chl a (mol N) −1 .The considered seasonally adjusted time series comprises 175 observations of phytoplankton.
We fit a NPZD box model to the data.It is based on a model of Oschlies and Garcon (1999).The original version was set up and tuned for the global ocean, but we consider a simplified version which is described in detail by Löptien and Dietze (2015).Its model equations are given in the Appendix and its free parameters and their assumed limits can be found in Table A1.As there is no temperature dependence in this model version, an average temperature of 10 • C is assumed for the growth period.Further, the "assimilation efficiency of herbivores" parameter is omitted (implicitly set to 1). Figure 6 shows the Baltic Sea phytoplankton data set and simulations of the NPZD model simulations (red curve) using the parameter values, derived for the open ocean, from Oschlies and Garcon (1999).As this model-fit appears to be, as expected, poor, we optimize the parameters with regard to the Baltic Sea observations (these optimized parameter values are also depicted in Table A1).The result is a more adequate model output (blue curve) lowering the associated RMSE model-data misfit from 0.896 to 0.717.Table 2. Ratios (times 100) between the misfit of the parametric model to synthetic observations (the model output plus white noise) and the misfit of the corresponding non-parametric regression model.The ratios are given for different noise levels σ and numbers of samples N .The four entries in each cell are the mean ratio of 100 trials, the standard deviation for the pure sine model, and the corresponding values for the "true" model that uses a sum of two sine waves.Non-parametric regression was done by (a) only assuming that at most two (four) local extremes exist, (b) only assuming a model specific steepness bound (see text), and (c) assuming both properties.

Property
σ mean (sine); SD (sine); mean ("truth"); SD ("truth") In a next step, we assess our result by examining the lower bounds.Following the procedure outlined at the end of Sect.2.2, we need to identify properties of the NPZD model that are satisfied for every credible (allowed) parameter choice and lead to an easily solvable surrogate problem.Ideally, we could give mathematical proof of such properties.However, postulating a model property which is based on sound biological experience is justified, even if this property is not satisfied for all feasible combinations of the parameter values.In this context it is important to note that the relatively simple model structure of our NPZD model with fixed (non-temperature-dependent) rates does not suffice to describe the seasonal cycle after the spring bloom (Fennel and Neumann, 2004, p. 35ff).Generally, model versions which fit the spring bloom satisfactory do not capture the observed chlorophyll increase in autumn.We thus assume only two extremes to determine a compatible lower bound.A practical approach for more complex systems would be to iteratively increase the number of extremes of the non-parametric model relaxation until the obtained lower bound hardly increases anymore (this approach would also require quite dense observational data).Using the algorithm of Yeganova and Wilbur (2009), we find that the best-attainable RMSE misfit between a time series with two extremes and our data is σ a = 0.557, a first lower error bound for the applied NPZD model.The corresponding error ratio between this bound and the error of the optimized model is q a = 0.777.In order to tighten our lower error bound, we additionally postulate a model steepness limit of 0.14, which we justify with the fact that the optimized model curve has two extremes, a plausible position of its maximum, and a maximal steepness of ∼ 0.1.The associated best possible data fit, which we can calculate using the "piecewise monotonic regression" algorithm (Yeganova and Wilbur, 2009), in combination with the LPAV algorithm (Demetriou and Powell, 1991) instead of the classical PAV algorithm (black curve in Fig. 6), has an RMSE of σ c = 0.66 which yields a quite high ratio of q c = σ c 0.717 = 0.921.Thus, it is confirmed that the main portion of the model-data misfit of the optimized NPZD model is not caused by a sub-optimal choice of the parameter set but by other sources of uncertainty.For the sake of completeness we calculate the best data fit with regard to a limited model steepness of 0.14 solely (disregarding its number of extremes) using CPLEX to solve the corresponding formulation in terms of a quadratic program (6).In this case, the RMSE is σ b = 0.619 and the corresponding error ratio is q b = 0.864.Some indication of an even smaller gap between the attained model-data misfit and the globally optimal misfit of the NPZD model is given by the following additional step.The RMSE error of the calibrated NPZD model is the empirical standard deviation σ between model simulations and observations.The lower bounds with regard to the general model properties (a), (b), and (c), i.e., σ a = 0.557, σ b = 0.619, and σ c = 0.66, are approximately 20, 22, and 23.5 % of the range of the model output, respectively.Experiments with random noise might help to further assess the quality of our parametric solution.Similar to the statistics in Sect.3.1.1,Table 1, we generate 100 sets of synthetic observations for each of the 3 standard deviations σ a , σ b , and σ c by simply adding white noise to the model output and calculate the average error of the corresponding optimal non-parametric data fit.The obtained average ratios are q a,emp = 0.719, q b,emp = 0.891, and q c,emp = 0.919 which are encouragingly close to the respective ratios for the true observations.We have to note, however, that the assumed normal distribution property is not actually satisfied by the errors between phytoplankton observations and the optimized NPZD model. .The (black) reference plot is the minimum error data fit with regard to the properties that no more than two extremes are taken and the steepness is at most 0.14.

Global satellite observations
In a second example, we consider global observations of maritime chlorophyll a.We use annual mean chlorophyll a observations in units of mg m −3 , which were derived from Sea-WiFS satellite data from http://seadas.gsfc.nasa.gov(NASA Goddard Space Flight Center, 2011) using 8-day composites binned on a 1/12 • spatial grid (4320 × 2160 boxes).Note that these annual averages might be seasonally biased in regions with sparse data.As we consider annual mean values, we apply our method in space instead of time.We do not consider coastal areas, since these can not be well represented in the coarse-resolution model and are also likely to contain a considerable degree of observational uncertainty.We thus mask out all grid boxes with chlorophyll a concentrations above 1 mg m −3 .We compare the observed chlorophyll a concentrations to simulated values.The simulation is based on the CM2Mc configuration, described by Galbraith et al. (2010).Spinup procedure and boundary conditions follow Dietze et al. (2017) (see Table 1 their FMCD configuration).The resolution of the model comprises 120 × 80 boxes (3 • for longitudes and 2-3 • for latitudes) and is coarse compared to the spatial resolution of the observational data.The annual mean model simulations are interpolated onto the observation grid in order to compute the corresponding RMSE for the model-data misfit.Figure 7 visualizes the observed chlorophyll a and the corresponding model simulations.
The observational data are quite rugged for larger regions of the ocean while the simulations are comparably smooth everywhere, due to the resolution of the model.Therefore, we can expect positive lower bounds on the model-data misfit.
The RMSE model-data misfit is 0.138 mg m −3 .Dealing with two-dimensional data, our one-dimensional lower bound methods can be applied chunk-wise.Here, we traverse each longitude of our spatial grid in chunks of 200 consecutive boxes (where the last chunk for each longitude consists of its ≤ 200 remaining boxes).It provides with us a lower bound on the sum squared errors between observations and simulations with regard to each chunk, say α i,k for the kth of n i chunks of the ith longitude, i ∈ {1, . .., 4320}.A lower bound on the (unweighted) RMSE model-data misfit is then given by where N obs is the total number of considered observation values.Note that the proposed method works equally well on weighted RMSEs.As a general model property for our bound approach we use a slightly higher maximum chlorophyll a variation per distance than the maximum variation of our model simulation results (similar to the Baltic Sea example, we multiply the maximum simulated variation by 1.4 for each chunk).The result is a lower bound of 0.049 mg m −3 which is 35 % of the misfit we achieved with our model simulations.Thus, the lower bound is in the same order of magnitude, but still considerably lower than the actual modeldata misfit.One might conclude that there is room for modelimprovement when it comes to chlorophyll a. Still one needs to keep in mind that the model was presumably never optimized to simulate chlorophyll a as good as possible and focused on many other factors as well.
We repeated our experiment restricted to the Southern Ocean (below 60 • S).Here, the RMSE model-data misfit is 0.170 mg m −3 and the calculated lower bound is 0.108 mg m −3 (63 %).

Discussion
Our aim is to complement research on the calibration of biogeochemical models by calculating lower bounds on their best-attainable model-data misfit.We utilize two general model properties for our purpose; a limited number of extremes and a bounded model steepness.We also consider the combination of both properties.The reason to consider such non-parametric model properties is that they yield efficiently solvable (relaxed) optimization problems whereas optimizing the original parametric model is computationally demanding.

Applicability
In our experiments (Sect.3.1.1),the solitary assumption of a bounded model steepness leads to tight model relaxations (tight lower error bounds) if enough observational data are available and the steepness bound is chosen to be close to the maximum steepness of a calibrated model output.The task of deriving a maximal bound for the steepness of a respective model output can be difficult in practice and relies on (1) model equations and (2) observational data.A rigorous mathematical model analysis, e.g., considering single model parameters like the maximum growth of phytoplankton, provides maximal limits which are valid for the entire parameter domain.However, relying on observation-based experience with the modeled processes, it might be justified to assume a smaller, empirical steepness bound, irrespective of that bound being valid for all permitted parameters.In Sect.3.2.1,we assumed a steepness bound that is ∼ 40 % larger than the maximum steepness of the NPZD model with optimized parameters.In the future, we aim to target iterative procedures to derive tight universal (likely time variate) model steepness bounds, e.g., using some kind of branchand-bound approach.
Our second constraint, a limited number of extremes, is generally relatively easy to determine for common, rather smooth biogeochemical models.An applicable number of extremes can be determined if a regression with more extremes only barely reduces the misfit any further.But here one should also keep the model structure in mind.Simple models can be limited in reproducing specific shapes of the seasonal cycle.Based on the model structure, we assumed only two extremes for our NPZD real-world example in Sect.3.2.1.Note, however, that assuming four extremes yields better fits in this case: the RMSE decreases from 0.619 to 0.559 without bounding the steepness (from 0.66 to 0.62 with steepness bound).Note that the "low number of extremes" condition indirectly implies a bounded (average) model steepness, too.In our Baltic Sea experiments, the assumption about the number of extremes resulted in better bounds than the sole assumption of a bounded steepness.In our experiments with global ocean data, we observed opposite results.
Unsurprisingly, the combination of tight steepness bounds with a limited number of extremes yields even better lower bounds on the minimum-attainable model-data misfit than both properties separately.Finally, all our model relaxations require a rather large number of observations (per chunk) in order to yield convincingly tight bounds (see Table 1).

Generalizations
Our contribution considers the root mean square error (RMSE) as an objective measure of the model-data misfit because it eases the task of formulating certain model properties in terms of convex optimization problems and to resort to Geosci.Model Dev., 11, 1181Dev., 11, -1198Dev., 11, , 2018 www.geosci-model-dev.net/11/1181/2018/corresponding tailor-made efficient algorithms.However, the suggested model properties also allow us to deduce optimization problems which are efficiently solvable if other misfit measures are used.For example, the sum absolute error can be dealt with in terms of linear programs (LPs) by including auxiliary variables and auxiliary linear constraints to express absolute values.Also, the efficient methods of Demetriou (1995) and Yeganova and Wilbur (2009) (we provide RMSE implementations in the Supplement) can be realized with other misfit measures than RMSE.
Concerning the number of local extremes, our proof-ofconcept experiments are restricted to a maximum of two (four) extremes according to the properties of the respective parametric models.However, solutions can even be calculated efficiently if the model output is assumed to take a large maximum number of extremes (Demetriou and Powell, 1991;Demetriou, 1995).As mentioned above, a suitable approach to work with that property is to increase the maximum assumed number of extremes until the corresponding lower bound on the minimum-attainable model-data misfit hardly increases anymore, indicating that further extremes contribute to fit noise rather than processes of interest.

Cautionary notes
Contrary to the fact that a small gap between the misfit of some property-based model relaxation and the misfit of the optimized original model proves that further parameter calibration is not required, a large gap between both misfits does not necessarily mean that the calibration of the chosen model is bad, nor does it mean that the model is an incorrect representation of the processes of interest.Our experiments indicate that a large gap only then tends to prove the inadequacy of a model (calibration) if enough observations are available.Otherwise, the chosen property-based relaxations might fit observations too well.
On the other hand, a small gap between the optimal misfit of a property-based non-parametric relaxation and the misfit of the original parametric model can even be reached with an inappropriate parametric model structure if there is too much noise in the data.The experiments in Sect.3.1.2are setup to estimate conditions that allow us to distinguish the "truth" from a "moderate distortion of the truth".With regard to the experimental results in Table 2, a rather low noise level is necessary to satisfy these conditions.

Conclusions
We presented an approach for proving that a parametric model is well calibrated, i.e., that changes of its free parameters can no longer lead to a much better model-data misfit.The intention is motivated by the fact that calibrating global biogeochemical ocean models is important but computationally expensive.
Generally, the aim is to determine an optimal parameter set such that a predefined metric of the model-data misfit is minimal.To keep the number of required expensive model simulations as small as possible, we suggest calculating "tight" lower bounds on the lowest achievable model-data misfit.Our objective is to utilize properties of the original model that are satisfied for all permitted parameters and lead to easily solvable optimization problems.Here, we focus on two such model properties to derive our lower bounds on the model-data misfit; a maximum time derivative and a maximum number of extremes per time unit.
Indeed, our experiments show that the achieved bounds can come quite close to the optimized misfit of the original model if many observations are available.However, a problem with global observational data (e.g., World Ocean Atlas data) is that it is often sparse in time.For example, if we examine annual cycles of periodic processes with monthly observations, our lower bound approach will only succeed if we overlay (seasonally adjust) measurement data of several years in order to reach the required data coverage.Longterm time series from observing platforms like BATS (Steinberg et al., 2001) provide enough data on the temporal dimension but are limited in space and are only available for certain sites.However, we can also apply our method with data that is dense in space.A suitable global application of our method to biogeochemical models is related with dense satellite observations of chlorophyll a (Volpe et al., 2007;Dogliotti et al., 2009).Section 3.2.2illustrates how our methods can be applied in order to cope with such data.
Assuming the error between model output and observations to be Gaussian distributed noise, an obtained lower bound on the RMSE is also a lower bound on the empirical standard derivation σ of the noise.We suggest the following rule-of-thumb procedure, which is illustrated for a real-world example in Sect.3.2.1: 1. Optimize the model parameters with regard to the corresponding model-data misfit.
2. Calculate lower error bounds on the model-data misfit by using appropriate assumptions about the model properties.
3. Accept if the ratio q between 1 and 2 is close to 1, or consider the lower bound starting from 2 to be the standard deviation σ of the noise in the observations and check whether q corresponds to the empirical ratio q emp that is obtained by adding random noise of level σ to the output of the optimized parametric model and fitting the obtained synthetic observations with the non-parametric relaxation.

Figure 1 .
Figure1.Synthetic data (blue dots) and corresponding output (blue crosses) of a periodic model function (blue curve).As the model frequencies are low, both the model function and its samples take only two local extremes.The segments before, after, and between the extremes (times t j and t k ) are monotonically decreasing/increasing.The respective monotonic fits to the data (drawn in red) are therefore "better" than the model output.

Figure 3 .
Figure3.The synthetic observational data of Fig.2and minimum RMSE data fits with regard to a steepness bound (data fit, (b)) as well as regarding both properties, bounded steepness and the existence of at most two extremes (data fit, (c)).

Figure 4 .Figure 5 .
Figure 4. Synthetic data obtained by adding noise to the function (t) = sin(t)+0.3•sin(2t)and the optimized data fit by a clean sine wave (parametric model) and its property-based non-parametric relaxation (steepness ≤ 1.44, at most two extremes), respectively.

Figure 6 .
Figure6.Bornholm seasonally adjusted observation time series of phytoplankton, and data fits using the considered NPZD model using the parameters which were adjusted for global model configuration (red) and optimized parameters for the local model version (blue).The (black) reference plot is the minimum error data fit with regard to the properties that no more than two extremes are taken and the steepness is at most 0.14.
Figure2.A cubic polynomial, synthetic observational data generated by adding white noise to 300 equidistant samples of the polynomial, and a minimum RMSE data fit with regard to the property that no more than two extremes are taken.