Introduction
Regional climate models (RCMs) are powerful tools to produce regional climate
projections
().
These models take climate states produced by global climate models (GCMs) as
boundary conditions, and solve equations of motion for the atmosphere on a
regional grid to produce regional climate projections. The main advantages of
RCMs over GCMs are increased resolution, more parsimony in terms of
representing sub-grid-scale processes, and often improved modelling of
spatial patterns, particularly in regions with coastlines and considerable
topographic features (e.g. ). Current
computing power is now allowing for ensembles of regional climate models to
be performed, allowing for sampling of model structural uncertainty
().
Along with these ensemble modelling studies, methods for extracting
probabilistic projections have followed
(). While these studies
all take a Bayesian approach, the implementations differ. For example,
and model both the RCM output and the
observations as a function of time. However, this implementation uses too
many parameters to be applicable to short (e.g. 20-year) time series common
in regional climate modelling. Furthermore, the results are affected by
climate model convergence: the output from the outlier models is pulled
towards clusters of converging models. The method is
applicable to relatively short time series; however, convergence still
influences model predictions.
introduced Bayesian model averaging to the RCM model
processing. In their framework, model clustering does not affect the results,
incorporating their belief that clustering can occur due to common model
errors. Furthermore, they provide model weights – a useful diagnostic of
model performance. The weights depend on model performance in terms of trend,
bias, and internal variability. However, their approach still suffers from
shortcomings. Specifically, the observations are modelled as a function of
smoothed model output. However, the smoothing requires subjective choices,
and the uncertainty in the smoothing choice is not explicitly considered.
Second, in the projection stage the implementation does not
fully account for the uncertainty in model biases and in standard deviation
of the model–data residuals.
Several authors have shown that in many regions, future changes are
positively correlated with present-day internal variability in the models:
see and . This means that knowing internal
variability may provide important information and potentially improve future
projections. While previous works have included information from internal
variability in their statistical model, the information was not used to
directly penalise the models for getting the internal variability wrong: see
for example and . was the
first attempt to incorporate this information via penalising model priors.
However, the priors were chosen ad hoc. A fundamental improvement of this
work is weighting the models not just by their performance in terms of the
mean, but also in terms of the internal variability in a principled way.
In this article, we propose a new method to obtain model weights using raw
model output, so the method better accounts for model output uncertainty. Our
framework allows us to compute weights efficiently, simultaneously penalising
for model bias, deviations in trend and model internal variability. One of
the main advantages of the current approach is that improper and vague
priors for the model parameters can be used, which makes implementation of
the method much more straightforward. In the framework,
subjective and informative parameter choices are required. Such choices
impact strongly on the resulting weights and inference. In addition, their
framework cannot accommodate improper priors since they need to be able to
sample directly from the prior.
Below the Bayesian methodology developed is described followed by a Markov
chain Monte Carlo (MCMC) method to obtain solutions for the posterior
distributions. The technique is then applied to a regional climate model
ensemble and compared with results found in previous work
().
Pictorial representation of the weight distribution on μ and
σ.
Posterior predictive weighting
In this section, we introduce the Bayesian methodology for weighting model
output based on current day observations. The framework we describe below is
not limited to any particular distributional form, although the analysis
presented is based on the univariate normal distribution. We have also
implemented the same procedure using the asymmetric Laplace distribution for
median regression to obtain robust estimators for our analyses, but we have
excluded them from presentation as the procedure produced similar results to
that of the normal error assumption (indicating no major violations from
normality). We suppose that current day observations are
denoted as yt, where t=1,…,T is a set of indices for time. We
assume that the present-day observations over time can be described by
yt=ap+bp(t-t1)+ϵt
where ϵt∼N(0,σp), t=t0,…t0+T, and t0 is
the first year that the observation is available, and t1=t0+T/2.
Formulating the equation in terms of t1 allows us to interpret ap as
the mean value of the observations. This model is reasonable for the type of
short time series temperature data that we consider. We assume that the data
yt are independent between observations. Let xtm,t=1,…,T denote
data generated by the mth model over the same time period, where
m=1,…,M, and we assume that each set of model outputs can be
adequately modelled by
xtm=am+bm(t-t1)+ϵt
with ϵi∼N(0,σm). Again, xts are assumed independent.
The parameters am,bm,σm can be obtained under the Bayesian
paradigm by first specifying a prior distribution p(am,bm,σm),
and the posterior distribution given data xm is subsequently obtained via
the Bayes rule,
p(am,bm,σm|xm)∝L(xm|am,bm,σm)p(am,bm,σm),
where L(xm|⋅) denotes the likelihood of obtaining data xm from
model m. In this work, vague priors are used throughout. The use of a
vague prior allows the data to discriminate amongst models, whereas
informative priors reflect the scientist's personal knowledge, and can lead
to more subjective analyses. Vague priors are sometimes considered
preferable when data contain sufficient information or when subjective
knowledge is uncertain. Conjugate analyses for certain classes of models,
including Gaussian error models, are often possible, leading to analytical
forms for the posterior distributions. In this work, we choose to present the
results with non-standard priors, and use MCMC for computation. This approach
is much easier when extending to more complex modelling scenarios.
New South Wales planning regions, the ACT and the state of
Victoria.
Results for the CC region of south-eastern Australia, in the DJF
season. Top row: weights wm of 12 models based on Eq. ()
(L), Eq. (), wm,I (M) and
Eq. () wm,T (R). Each triplet represents a
GCM (MIROC3.2, ECHAM5, CCCMA3.1, and CSIRO-Mk3.0). Middle row and first plot
of last row: fitted observations according to Eq. () (red dashed
line) and fitted model output according to Eq. () for 12 models.
Last row: weighted fit based on wm in solid black line (M), weighted fit
based on wm,I in solid green line and weighted fit based on
wm,T in solid blue line (R).
Results for the FW region of south-eastern Australia, in the DJF
season. Top row: weights wm of 12 models based on Eq. ()
(L), Eq. (), wm,I (M) and
Eq. () wm,T (R). Each triplet represents a
GCM (MIROC3.2, ECHAM5, CCCMA3.1, and CSIRO-Mk3.0). Middle row and first plot
of last row: fitted observations according to Eq. () (red dashed
line) and fitted model output according to Eq. () for 12 models.
Last row: weighted fit based on wm in solid black line (M), weighted fit
based on wm,I in solid green line and weighted fit based on
wm,T in solid blue line (R).
Results for the CWO region of south-eastern Australia, in the MAM
season. Top row: weights wm of 12 models based on Eq. ()
(L), Eq. (), wm,I (M) and
Eq. () wm,T (R). Each triplet represents a
GCM (MIROC3.2, ECHAM5, CCCMA3.1, and CSIRO-Mk3.0). Middle row and first plot
of last row: fitted observations according to Eq. () (red dashed
line) and fitted model output according to Eq. () for 12 models.
Last row: weighted fit based on wm in solid black line (M), weighted fit
based on wm,I in solid green line and weighted fit based on
wm,T in solid blue line (R).
Posterior predictive projections of DJF temperature change in
2060–2079 compared to 1990–2009 for regions in south-eastern Australia.
Black lines correspond to wm weights, green lines to wm,I
weights and blue lines to wm,T weights. Red lines are results
from . Black vertical lines represent 95 % credible
intervals, and red vertical lines represent the 95 % credible intervals
obtained by . Circles represent the difference between the
changes in temperature using the individual models. Black crosses indicate
the simple ensemble mean of the changes in temperature.
Bootstrapped weighted projections of DJF temperature change in
2060–2079 compared to 1990–2009 for regions in south-eastern Australia.
Black lines correspond to wm weights, green lines to wm,I
weights and blue lines to wm,T weights. Red lines are results
from . Black vertical lines represent 95 % credible
intervals, and red vertical lines represent the 95 % credible intervals
obtained by . Circles represent the difference between the
changes in temperature using the individual models. Black crosses indicate
the simple ensemble mean of the changes in temperature.
We would like to weight the models based on the similarity of output xtm
to the observation data. We note that a model that performs well under
recent conditions does not guarantee that it will perform well under future
climate conditions, but we assume that good performance under recent
conditions is an indication of reliable performance in future climates.
This translates to preferring models whose parameters am,bm,σm
are similar to ap,bp,σp. In practice σp has additional
terms, due to instrumental and gridding error associated with collecting
observational data. This additional error is not reflected in the model
output. performed error analyses for 2001–2007 for
Australian climate data, and found that the root mean squared error for
monthly temperature data ranges between 0.5 and 1 K. For our analyses of
seasonally averaged temperature data in Sect. , we set the
additional error to be δ=0.5 K. Resulting weights were largely
insensitive to values of δ between 0.5 and 1.
Finally, we define the weight for each model m to be of the form
wm=∫L(y|am,bm,σm2+δ2)p(am,bm,σm|xm)damdbmdσm
where L(y|am,bm,σm2+δ2) denotes the
likelihood of observational data y, given the parameters of the mth
model, am,bm and σm. The weight wm fully accounts for
the uncertainties associated with the estimates of am,bm and σm, by averaging over the posterior distribution of p(am,bm,σm|xm). Clearly, the right-hand side of Eq. () will be
larger if am,bm and σm2+δ2 are similar to
ap,bp and σp, i.e. if the distributions of y and xm are
similar (up to a difference of observational error δ). We term these
weights the posterior predictive weights. Note that Eq. () is
simply the marginal likelihood p(y|xm), i.e. the probability of observing
data y given xm, averaging over any model parameter uncertainties. The
term am and its deviation from ap in the observation model can be
considered as penalising bias between model output and observation, the
deviation between bm and bp can be thought of as a penalty for trend,
and the terms σm and σp account for the differences of model
and observation internal variability.
The ensemble models can now be combined into a single posterior model, using
the weights
p(aBMA,bBMA,σBMA|x1,…,xM)=∑m=1Mwmp(am,bm,σm|xm).
The above expression gives us an ensemble estimate for the posterior
distribution of the parameters for a, b and σ from the M model
outputs, and we denote these as aBMA, bBMA and
σBMA. Note that the weights should be normalised by
∑m=1Mwm=1.
In order to understand this weight, we suppose for the moment that the data
y come from, say, a N(0,1). Suppose also that xm comes from
N(μ,σ). Then if the posterior distributions of μ and σ
are centered around 0 and 1, xm should be assigned a higher weight. As the
values of μ and σ diverge away from 0 and 1, we should see a
decrease in the respective weights. Figure plots the likelihood of
50 simulated y values from N(0,1) distribution, the left panel shows the
weights for a fixed value of μ=-2,…,2 and σ=1, and the
right panel shows the weights for a fixed value of σ=0.01,…,5
with μ=0. The figure corresponds to a single term inside the weight
Eq. (), where am,I,bm,I correspond to
μ and σm,I2+δ2 corresponds to σ.
See also Eq. () below. The figure shows the changes in the
weight, as parameter values move away from the true values of 0 and 1. In
the case of single fixed values of μ and σ, the weights simply
correspond to the likelihood at these values. In practice, the weights in
Eq. () average over the set of posterior values of μ and
σ.
It is worth noting that even if we specify non-informative priors in
Eq. () for all models, the implied priors used in our approach are
not uninformative. As pointed out by H. R. Künsch, some form of
informative priors must be used because the data available simply do not
contain information for certain parameters of the model for the future (see
for an alternative formulation which also requires some
form of informative prior specifications.) In the current case, our modelling
approach assumes that the relationship between future climate and future
model output behaves in a similar way to the relationship between present-day
climate and present-day model output. We consider that there is a perfect
model that has the same parameters (intercept, slope and standard deviation)
in both the present and the future. We then compute the probability that any
model m is this perfect model, based on present-day data. These assumptions
can be seen as an informative prior on the parameters governing future
observations, although these parameters are not explicitly modelled.
Computation
The procedure for the calculation of weights is designed to be applicable
regardless of the distributional forms chosen to model the data. In most
cases, the posterior distributions p(am,bm,σm|xm) in
Eq. () will be analytically intractable; however, samples from
this distribution can easily be obtained via MCMC. Many software packages
performing MCMC are available. For the analysis in this paper, we used the
MCMCpack library of the R statistical package . MCMC is an iterative algorithm, and it is
necessary to check for convergence and throw away an initial burn-in period
of the chain. For our simulations, we used 5000 chain iterations, throwing
away the initial 500 iterations as burn in, retaining N=4500 MCMC samples
to work with. Default priors from MCMCpack were used throughout this
paper. For the model and data used in this paper, only a routine application
of MCMC was required. However, more complex model and data typically require
advanced knowledge of MCMC; see for more on MCMC.
In addition to obtaining simulations from the posteriors of the M ensemble
models, the weight calculation in Eq. () also involves an
intractable integral, which we can approximate using standard Monte Carlo
wm≈∑am,I,bm,I,σm,IL(y|am,I,bm,I,σm,I2+δ2)
where L(y|am,I,bm,I,σm,I2+δ2) denotes the likelihood of y
under the ith sample of am,I,bm,I and σm,I from the posterior distribution p(am,bm,σm|xm). Thus, the 4500 MCMC samples obtained for each model are then used
to compute the Monte Carlo sum in Eq. (). Again, the weights
should be normalised by the constraint ∑m=1Mwm=1.
Finally, the predictive distribution for the future climate ytf,t=1,…,T′, given future model output denoted as
xf,1,…,xf,m, is defined as
p(y1f,…,yT′f|xf,1,…,xf,M)=∫p(y1f,…,yT′f|aBMAf,bBMAf,σBMAf)p(aBMAf,bBMAf,σBMAf|xf,1,…,xf,M)daBMAfdbBMAfdσBMAf.
Application
Here we consider the same data as – temperature output from
NARCliM (New South Wales/ACT Regional Climate Modeling Project,
). This project is the most comprehensive regional
modelling project for south-eastern
Australia, and the first to systematically explore climate model structural
uncertainties. The NARCliM ensemble downscales four GCMs (MIROC3.2, ECHAM5,
CCCMA3.1, and CSIRO-Mk3.0) with three versions of the WRF modelling framework
(which we call R1, R2, and R3, )
that differ in parameterisations of radiation, cumulus physics, surface
physics, and planetary boundary layer physics. NARCliM output has been
evaluated in terms of its ability to reproduce the observed mean climate
(), climate extremes
(), and important
regional climate phenomena (). These studies
demonstrate that while the downscaling has provided added value
(), a range of model errors are present within the
ensemble. For the analysis, we focus on seasonal–mean temperature
differences as modelled by the inner NARCliM domain RCMs between years
1990–2009 (present) and 2060–2079 (far future). We discard partial seasons
from the analysis.
Here we average the temperatures over south-eastern Australian regions that
include New South Wales (NSW) planning regions, ACT, and Victoria; see
Fig. . Corresponding temperature observations are derived from
the AWAP project . The models are generally cooler than the
observations; however, in many cases the observations span the mean model
climate.
In addition to computing weights of the form in Eq. (), we also
compute two variants of the weight: one based on penalising only the
intercept am and internal variability σm, and an alternative
weight based on penalising only the slope term bm and internal variability
σm. This is achieved by modifying Eq. () to
wm,I=∫L(y|am,bp,σm2+δ2)p(am,σm|xm)damdσm
or
wm,T=∫L(y|ap,bm,σm2+δ2)p(bm,σm|xm)dbmdσm
where wm,I penalises models with large biases and wrong
internal variability, and wm,T penalises models with the wrong
trend and internal variability. Note that our proposed weight wm penalises
bias, trend and internal variability simultaneously. The weights
wm,I and wm,T can be computed by fitting the
observation data to the model in Eq. () to obtain estimates for
ap and bp, and using only the posterior samples of am,bm and
σm to complete the calculation.
Cross validation of weighted projections of DJF temperature change
in 2060–2079 compared to 1990–2009 for region CC in south-eastern
Australia. Black lines correspond to wm weights; green lines correspond to
wm,I weights and wm,T weights. Each plot
represents the weighted posterior predictive distribution of temperature
change using the current ith model output as observation and the remaining
11 models are weighted. Vertical lines represent 95 % credible intervals.
Crosses indicate the actual changes between the future model output and the
current model output of the ith model.
Mean squared error and 95 % coverage probabilities for the three sets of weights.
DJF
MAM
JJA
SON
MSE
Cov
MSE
Cov
MSE
Cov
MSE
Cov
wm
48.43
0.938
14.44
0.958
14.15
0.910
41.86
0.917
wm,I
52.06
0.965
21.34
0.979
17.89
0.944
43.55
0.951
wm,T
56.93
0.993
30.74
0.979
20.45
0.972
39.79
1.000
Figure shows the weight calculation of each model based on
Eq. (), for the CC region in season DJF. We used the observed
data and the corresponding model output for the years 1990–2009. One can see
how the three different types of weights behave relative to the bias and
slope of the model output. For example, in Fig. , models 1,2,3
(left figure, middle row) and 10, 11, 12 (left figure, bottom row) have large
bias compared to the other models; consequently, wm and wm,I
give these models almost no weight. On the other hand these models simulated
the trend well, and are preferred by wm,T.
The weighted fits are shown in the last two plots in the bottom row of
Fig. . The black line is computed using wm, according to
y^t=∑m=1Mwm(am+bm⋅t)
where am and bm are taken as the posterior means of the MCMC samples,
and t=-9.5,…,9.5. A similar calculation is done based on
wm,I, with wm,T shown in green and blue
respectively. The plots here suggest that the weights wm are similar to
wm,I, and better than wm,T in this case. We note
that there are dependencies between the RCMs driven by the same GCM. Our
weight calculation does not model this dependence. So if different GCMs drive
a different number of RCMs, the weights will over-represent some models but
not others. While for most cases, the weights given by wm,I
provide similar weighted fits to wm, Fig. (showing the FW
region for season DJF) demonstrates the instances where the weighted fit
produced by wm,I is clearly worse than wm. The green line in
the final plot shows that wm,I produces a fit which is very
close to the observation at the intercept but fails to capture the trend.
This is unsurprising since this weight penalises deviations of am to
ap. Similarly, the blue line wm,T appears to better capture
the trend, but is clearly underestimating the bias, since it fails to
penalise for bias. The weight wm is a compromise between the two. From the
weight plots in the first row, the models that have non-negligible weights
under wm,I are 6, 7, 11 and 12, corresponding to models whose
intercepts are closest to the intercept of the observation model. The weights
wm,T are more spread out, giving high weights to models 1 and
2, which have large biases but capture the trend well. The weights wm
allocate most weight to models 6 and 7. Both models closely follow the shape
of the observed data. In fact, in terms of trend, the weights
wm,T can capture more of the increase in trend better than
wm, this was the case in some of the regions in the SON season. A more
formal evaluation of the three different weights will be carried out later in
this section.
For seasons JJA and MAM, weights wm and wm,I were quite
similar in all regions. These weights gave very close fits to the observation
model, while wm,T captured the trend well but gave biased fits
to the observation. Generally for these two seasons, fewer models had
non-neglible weights compared with DJF and SON. In DJF and SON, the weights
were distributed more evenly across the models. This suggests that some of
the individual models in JJA and MAM were performing strongly. Interestingly
for MAM, the two models that dominated most regions are models 8 and 9; see
for example the results for region CWO in Fig. . We can see the
goodness of fit of these two models individually (see second row, right
plot), and clearly they were markedly better than the other competing models.
The corresponding posterior predictive distribution of projections of change
in temperature for season DJF over the different regions in south-eastern
Australia are plotted in Fig. . The pdfs show the mean temperature
change in the period 2060–2079 compared to 1990–2009. In order to obtain
the posterior predictive projection pdf, we begin by first fitting MCMC for
each future model output for the period 2060–2079, to obtain the posterior
distribution of p(amf,bmf,σmf|xm).
Here we obtained 5000 posterior samples of amf.bmf
and σmf. We then obtain 10 000 random samples for each
pdf. Each sample is obtained as follows.
With probability wm, randomly select a sample from the posteriors of amfbmf and σmf, say am,Ifbm,If and
σm,If.
Simulate a predictive temperature series ytf according toytf∼N(am,If+bm,If(t-t1),σm,If)for t=2060,…,2079
and t1=2069.5. This process produces the posterior predictive samples
ytf according to Eq. ().
Compute current model estimate y^tm=am+bm⋅(t-t1),
for t=1990,…,2009 and t1=1999.5 where am and bm are
posterior means based on model m and current model output xm.
Compute the mean of the differences between future prediction ytf and y^tm.
This process produces the posterior predictive distributions for the mean
difference between the posterior predictive samples ytf and the
current estimate of climate.
We present the results for season DJF in Fig. . The black lines in
Fig. correspond to the pdf given by wm, the green lines
correspond to wm,I and the blue lines correspond to
wm,T. The red circles indicate the difference between the means
of y^t and y^tf from each of the 12 models; the
cross indicates the mean of these differences. Black vertical lines indicate
the 95 % credibility interval for predictions made with wm (black line).
We can see that the pdfs based on wm and wm,I are similar to
each other, while the ones given by wm,T deviate substantially
from the other two. We also superimposed the pdf obtained in
in red for comparison. The corresponding 95 % credible interval is shown in
red vertical lines. It can be seen that our method generally provides a more
precise prediction interval. In fact, to properly compare the two predictive
distributions, we compute the posterior predictive distribution using the
method described by . Unlike our posterior predictive pdf, the
pdf in was obtained by bootstrapping the errors, and does not
account for the uncertainty in the parameter estimates of am,bm and
σm. To properly compare the effect of the different weights between
our method and that of , we also show in Fig.
the bootstrapped pdf. Here the red line indicates the pdf using
weights with the 95 % credible interval shown in red
vertical lines, and here we can see that generally produce
significantly larger credible intervals than our approach.
The incident of bimodality or multimodality is reduced in our approach
compared to , suggesting a smoother mixing of models induced
by our approach. Our approach generally produced sharper, more definite peaks
in the posterior pdf. This could be due to the fact that our penalisation is
done simultaneously, whereas consider the penalty for bias
and internal variability separately.
In order to assess the ensemble pdf, we performed a series of
cross-validation checks. For each region at a given season, we have 12
current model outputs and 12 future model outputs. We select 1 of the models,
mi, and treat the current model output for mi as the truth, and weigh
the remaining 11 models. We then cycle through all 12 models, setting
mi=1,…,12. Figure shows the weighted projections for
region CC in season DJF, each plot corresponding to using 1 of the 12 models
as truth.
Table shows the empirical coverage probabilities based on 144
sets of cross-validation datasets for each region, DJF, MAM, JJA and SON. The
coverage probabilities are computed by counting the number of times the true
mean change in temperature falls inside the 95 % credibility intervals,
taken as the 0.025th and 0.975th quantile values of the posterior predictive
samples. Each weighting method produces a different set of credibility
intervals. We see from the table that both wm and wm,I
perform quite close to the nominal level at 95 %, but the pdfs given by the
weight wm,T are a little too large. Finally, we also computed
the mean squared error for each season: this is calculated as the average
squared difference between the posterior predictive sample and the true
value. The sums over all regions and all cross-validation sets are reported
in Table . Overall, the weights wm performed consistently well
in this respect. wm outperforms wm,I in all seasons. The
poorer performance of wm,T is largely due to the large biases
in the wm,T models. One possibility of making
wm,T models more useful is to perform some kind of post hoc
bias correction to the weighted estimates.