An alternative way to evaluate chemistry-transport model variability

. A simple and complementary model evaluation technique for regional chemistry transport is discussed. The methodology is based on the concept that we can learn about model performance by comparing the simulation results with observational data available for time periods other than the period originally targeted. First, the statistical indicators selected in this study (spatial and temporal correlations) are computed for a given time period, using colocated observation and simulation data in time and space. Second, the same indicators are used to calculate scores for several other years while conserving the spatial locations and Julian days of the year. The difference between the results provides useful in-sights on the model capability to reproduce the observed day-to-day and spatial variability. In order to synthesize the large amount of results, a new indicator


Introduction
Chemistry-transport models (CTMs) aim at simulating the atmospheric composition where humans and the environment can be affected by air pollution. Air pollution results from the presence of chemical compounds emitted into the atmosphere due to anthropogenic activities and natural sources (biogenic emissions from vegetation, soil erosion, sea salt, volcanic activity and wildfires). CTMs are used to represent the dynamical and chemical processes that drive spatial and temporal features of the atmospheric composition.
To estimate the quality of CTMs, model output results are usually compared with available observations. These comparisons have been performed for as long as the models have existed; they are crucial for quantifying the ability of models to reproduce particular events or a general behavior. The quantification of the model quality is performed in every research work. It depends on the case being studied, the modeled variables and the spatial and temporal resolutions. The comparison between observations and model outputs is a complex task and has to take into account numerous factors such as the spatial representativeness of the monitoring stations (Valari and Menut, 2008;Solazzo and Galmarini, 2015). For many years, the best approach to evaluate a model's results has been discussed and, in the field of atmospheric composition, numerous methods were proposed. It is not possible to give an exhaustive list of all validation studies and we present some examples here. Baldridge and Cox (1986) and Cox and Tikvart (1990) proposed the use of error statistics like correlation, bias and root mean squared error (RMSE) in the specific framework of air quality, i.e., the atmospheric composition when crite-ria pollutant concentrations exceed predefined limit values. Chang and Hanna (2004) also proposed an evaluation framework dedicated to air quality model performance and explained that there is not "a single best evaluation methodology" and how important it is to use as many evaluation criteria as possible to really understand model results well. Later, and in order to ensure the use of systematic procedures in the evaluation process, dedicated tools were developed for the model evaluation. For example, Appel et al. (2011) and Galmarini et al. (2012) proposed complex statistical modules to extract all possible information related to the capability of a model to reproduce an observed event. In parallel, some studies were dedicated to revisit the way to evaluate models such as Thunis et al. (2012), dedicated to air quality in a policy framework. In this study, the authors proposed the "target diagram" to have the bias and the RMSE on the same plot. Complementary to the definition of performance indicators to be used, Simon et al. (2012) used these indicators to compile photochemical model performance for a large set of data over several years of simulation. This kind of evaluation may also be done in dedicated projects, such as the recent AQMEII (Air Quality Model Evaluation International Initiative), comparing chemistry-transport models running both in Europe and North America (Vautard et al., 2012;Campbell et al., 2015); the EURODELTA project ; and the EMEP (European Monitoring and Evaluation Programme) context in the framework of the United Nations Convention on Long-range Transboundary Air Pollution (Prank et al., 2016). Using comparisons between observations and model outputs, some studies proposed methodologies to decompose the statistical scores in order to estimate the main source of errors (Solazzo and Galmarini, 2016). Finally, other studies also use observations to adjust the result by implementing methods to unbias simulation without changing the model, as in Porter et al. (2015) for ozone over the United States. The common point of all these studies is that they are always using, as best as possible, the observations corresponding in time and location to the model grid cell.
In the present study, a simple method is proposed to add information about the model performance with a focus on its spatial and temporal variability. To reach this objective, we propose to use observations corresponding to the modeled period and geographical domain but also to use observations for the same domain but other periods. In this way, we want to extract the information about the model variability and to answer the following question: is the performance of the model satisfactory because the model is accurate or just because the model is able to reproduce a situation which is recurrent from year to year? The issue to be solved and the tools developed are presented in Sect. 2. The new methodology with the presentation of the indicator developed for this study are presented in Sect. 3. The results and discussions to point out the drivers of model errors are presented in Sects. 4 and 5 for the new indicator.

Methodology
In the present study, a simple method is developed to improve the evaluation of model variability and to identify the processes responsible for discrepancies of model outputs vs. observations. The methodology is general and could be applied to all types of models. In this study, the methodology is presented for the specific case of the regional atmospheric composition modeling: a topic mixing meteorology and chemistry, with a high spatial and temporal variability, thus having a good potential to test the relevance of our methodology.

Regional chemistry-transport modeling
In chemistry-transport modeling, several processes are involved, some of them directly influencing the others. When studying both meteorological and chemical variables, the dependencies between all variables are helpful to better interpret the model results.
The boundary conditions prescribe the concentrations of chemical species which may enter the simulation domain. Usually for large domains, they are issued from global models as monthly climatologies. They correspond to averaged values suitable to characterize the background concentrations of long-lived species such as ozone, carbon monoxide and mineral dust. Anthropogenic emissions are prescribed from databases and the influence of meteorology is limited in the model. Vegetation, fire and mineral dust emissions depend both on land-use data and meteorology. These emissions are not measurable; it is almost impossible to directly assess their quality.
The meteorological variables influence transport and mixing processes, with a direct effect on gas and aerosol plume locations and their vertical distribution. Cloudiness and temperature impact the photolysis efficiency; the boundary layer height impacts the surface mixing of pollutants; rainfall impacts the wet deposition. Moreover, meteorology also has an impact on emissions: wind variability is the prevalent driver for dust emissions, and it has also a major impact on wildfire emissions. Both temperature and solar irradiance influence the magnitude of biogenic emissions from vegetation. The spatial variability of land-use data also has a strong impact on all these natural emissions.
The chemistry-transport model is a numerical integration tool of all forcings and processes. The chemical mechanism handles the life cycle of chemical species (production and loss) when the deposition processes are the only net sink of species. In the model, the spatial (horizontal and vertical) and temporal resolutions are also prescribed, directly impacting the simulation representativeness and thus the quality of the modeled air pollutant concentrations when they are compared to available observations.

The studied case
The study focuses on the summer 2013 period (1 May to 31 August) over the European Mediterranean region. This period is called the "reference period" in this paper. This case has already been modeled (using the same models, WRF and CHIMERE) and the results were discussed in Menut et al. (2015). The same simulation is used in this study; all parameters are identical.
The observational data come from different sources depending on the variables (see Table 1). In this region, where the monitoring networks are dense enough, comparisons are performed with observations from surface stations that provide hourly O 3 , NO 2 surface concentrations for gases and PM 2.5 and PM 10 (particulate matter with mean mass median diameter lower than 2.5 and 10 µm, respectively) for particles. Complementary to surface concentration data, evaluated using the EBAS database (Tørseth et al., 2012), the meteorology is also evaluated for 2 m temperature (T 2 m ), 10 m wind speed (U 10 m ) and precipitation rates (in mm day −1 ) from the BADC (British Atmospheric Data Centre). In order to quantify the transport of aerosols in dense plumes aloft, observations from the AERONET (AErosol RObotic NETwork) program are used for the aerosol optical depth (AOD) and the Ångström exponent. In this study, all variables are used as the daily mean (except for precipitation corresponding to daily cumulated values) in order to (i) have homogeneous scores between the variables and (ii) be able to separate the systematic and the day-to-day variabilities. The use of an hourly time frequency was ruled out to avoid a too strong weight of the diurnal cycle in the temporal variability.

Proposed methodology
As discussed in the introduction, many statistical indicators (SIs) exist to quantify the model ability to simulate observed pollution events. The correlations (temporal and spatial), the RMSE, its normalized expression nRMSE and the bias (the difference between observations and modeled values) are widely used in regional air pollution modeling. The correlations are able to split the relative contributions of systematic meteorology or source-related variability and day-to-day variability. The RMSE and the bias are a direct quantification of the model error.
The main goal of this study is to separate the contributions due to systematic and sporadic events. The systematic events correspond to yearly phenomena, while the sporadic events correspond to the events observed during one year but not the others. In addition, complementary to the model variability quantification, the model error is also important to estimate. The key points of this study are to (i) study the model variability which is statistically represented by the correlations and (ii) add complementary information on the model  (I mv ) calculation, using one modeled year and several years of observations. SI stands for "statistical indicator" and is related to spatial and temporal correlation.
errors, which could be represented here by the RMSE (or the nRMSE). First, as presented in Fig. 1, the SIs are calculated between observation data and model outputs for the simulation year (i.e., the reference year). Second, the SIs are calculated between the observation data for other years and the model output for the reference year. Logically, the scores calculated for the reference year for observations and model outputs would give the better results. By examining the difference with the scores calculated for other years (with the observations only), we expect to conclude whether the model is able to catch the observed variability for the correct reasons. Using this approach, the goal is to give complementary information to those usually obtained when using only SIs calculated for a single year (the studied year).
We apply this methodology for the simulation of the year 2013 and using observation data for years ranging from 2008 to 2013. In order to give some synthetic answers, the different SI scores are aggregated into a single indicator called I mv and presented in detail in the next section. Of course, it seems awkward to evaluate a model day by day with observational data from another year. For a given station, at a given day of the reference year, air concentrations will be affected by a different local meteorology, emissions and long-range transport of chemical species. However, we can consider that to take the same date for another year is strictly the same as randomly choosing a date in the same season. This trivial method can emphasize how a model is affected by large-scale patterns and long-term temporal cycles.

Calculation of correlations and nRMSE
In this study, we focus on three statistical indicators: the spatial correlation, the temporal correlation and the normalized RMSE. For these three indicators, it is important that, for all years of validation, the same list of stations with valid measurements is used.
The correlation used in this study is Pearson's correlation. Each correlation provides specific information on the quality of the simulation. The temporal correlation, noted R t , is Table 1. List of measurement data used for the statistical comparison with the model results. All data used are issued from surface stations representative of their own environment. Originally provided hourly or 3-hourly, they are used as daily averages in this work. The abbreviation "ad." is used to indicate dimensionless units.
The temporal correlation R t,i for each station i is calculated as The mean temporal correlation, R t , used in this study is thus with I the total number of stations. The spatial correlation, noted R s , uses the same formula type except it is calculated from the temporal averaged values of observations and model for each location where observations are available. A good correlation shows that the model correctly locates the largest horizontal gradients as known sources and long-range transport plumes. The spatiotemporal averaged value is estimated as and the spatial correlation is thus expressed as The normalized RMSE is expressed as for all stations i and all times t.

Definition of the I mv indicator
For the specific purpose of the model variability (and not the model error), we define an indicator, I mv , dedicated to express in one value the results obtained with the temporal and spatial correlations. The goal of this indicator is to quantify how the correlation between measurement data (for different years) and model outputs (for the reference year) evolves from one year to another. This indicator does not replace the usual statistical indicators but aims at providing complementary information about the variability between years. We first define the differences, D, between all years as with s N the score of the indicator for the reference year being modeled and s i the score of the indicator computed using observations corresponding to other meteorological years (from 1 to N − 1 if there are N − 1 other available years for the observations). We now aim to develop a simple indicator, called I mv , which is a combination of the statistical indicator for the reference year and the differences between years. This I mv corresponds, in fact, to the SI itself weighted by the differences between the SI scores of all years. We expect that I mv follows the following rules: -I mv has the same evolution as the studied SI. If the correlation increases, I mv also increases.
-I mv is bounded between 0 and 1, like the correlation. This enables us to compare the results for different variables (with different metrics).
-In the case of a high correlation value found for the studied year, the obtained s N value is close to 1. This value may be lower for the following reasons: -If the differences between the other years are low (D tends to 0), it means that the model is correct for Geosci. Model Dev., 10, 1199-1208, 2017 www.geosci-model-dev.net/10/1199/2017/ the studied year, but possibly because it reproduces a recurrent phenomena. In this case, we want I mv to decrease and tend to 0.
-If the differences between the other years are high (D tends to 1), it means the model gives good results for the studied year, but it is not because it simulates a systematic event. In this case, we want I mv to remain close to the indicator value. With s N ≈ 1 and I mv ≈ 1, we can conclude that the model is very good for the studied year and this is not due to a recurrent process.
-In the case of a low correlation value, and whatever the magnitude of differences between years, the model is not correct. I mv must be low, as it is the indicator value.
These constraints allow us to define an indicator having this kind of formulation: This means that I mv always has, as a maximum, the value of the indicator itself. The power of 4 is here defined to have a specific shape for I mv , respecting the rules presented below. Finally, this expression gives an indicator variability presented in Fig. 2. Considering the state of the art of chemistrytransport modeling, the model is considered accurate, having an acceptable variability for I mv > 0.4: this means that the correlation is at least 0.5 and the differences are also at least greater than 0.5.
Finally, this indicator is not calculated for nRMSE and bias. Two reasons explain this choice: the first reason is that, contrarily to correlations, RMSE and bias are not bounded between 0 and 1. This leads to indicator values possibly varying a lot between several years and thus being difficult to compare between years. The second reason is that the goal of the indicators is to extract a message from the model variability of the studied year compared to the other years. In this case, the correlations constitute a statistical indicator which is more appropriate for this evaluation.

Time series of statistical indicators
The calculations of differences are performed for the correlations and the nRMSE. These values are calculated for all variables described in Table 1 for the years 2008 to 2013. For each year, it is noted that only the May to August period is considered. Results are presented as time series in Fig. 3 and discussed in the following sections. Note also that some values discussed in these sections are also reported in the synthetic Table 4.

Meteorological variables
The meteorological variables are T 2 m , u 10 m and the precipitation rate. The values of the statistical scores are provided, year by year, in Fig. 3. As an example, the same values are reported for T 2 m in Table 2. T 2 m is a meteorological variable, constraining processes both for meteorology and chemistry. Its diurnal cycle is strong, as well as its latitudinal variability (for large model domains), often ensuring a good spatial correlation. In general, this variable is the least uncertain of all modeled meteorological parameters. The spatial correlation is good for all years, ranging from 0.57 (2009) to 0.62 (2011). For the studied year (2013), the score is 0.60, slightly lower than for 2011. Even if the correlation for the selected year is good, it is not significantly better than for the other year, with D = 0.02. This means that the model reproduces fairly well a spatial pattern that is observed every year. Indeed, the simulation domain is large and the temperature has a latitudinal variability larger than between each measurement station. The temporal correlation ranges from 0.25 to 0.91 (2013).  The variability of nRMSE is lower than for the correlations, with values ranging from 0.22 (2013) to 0.34 (2010). The lowest value is found for 2013, highlighting the fact that the model error is the lowest for the reference year. The model is thus performing well in capturing the day-to-day variability for T 2 m for the correct reasons. From Fig. 3, the calculation of u 10 m also gives satisfactory results with R t = 0.60. The spatial correlation, R s = 0.09, is poor and very variable from one year to another. As for T 2 m , we also have an effect of the model resolution and the representativeness of the variable.
Scores for the precipitation are correct, with a very good spatial correlation that is always exceeding 0.6. As for the temperature, the latitudinal effect plays a major role in the variability. Both the spatial and temporal correlations increase significantly for the reference year. The nRMSE is not on the plot, with the values being larger than 1.2. The model is biased in absolute values and overestimates the amount of daily precipitation. However, the day-to-day variability is correct and such variability is the most important feature for atmospheric composition modeling (the lower atmosphere is scavenged when a precipitation occurs, whatever its value).
For the meteorological variables, these scores showed that the meteorological forcing is well captured, and always better for the year being considered compared to other years.

Optical properties
The optical properties are directly linked to the atmospheric composition of aerosol and may be quantified using the AOD and the Ångström exponent (ANG).
For the AOD, the spatial correlation is very good for 2013, with R s = 0.97, but it is as good or better for other years. This means that we model a rather recurring phenomenon: every year, the same stations are, on average, exposed to aerosol plumes. The temporal correlation is lower with R t = 0.45 but much better than for other years. This indicates that the model partly reproduces the observed temporal variability but the events are changing from one year to another and the model captures these changes well. In the studied region, the AOD is sensitive to desert dust outbreaks in summer. This means that large-scale systems are driving the aerosol plumes; they are spatially recurrent and temporally better captured for the year being considered than for other years. For the ANG, the spatial correlation is very good, with R s = 0.91, but also persistent in time. The temporal correlation is much better for 2013 than for other years. This is probably due to a size distribution that is not necessarily well simulated from one day to another (shown by AOD and explained in Menut et al., 2016) but the relative contributions of fine and coarse aerosol atmospheric load are fairly reproduced. This feature highlights the high sensitivity of the AOD calculation to the modeled aerosol size distribution, although the overall mass emitted and transported is realistic.
Globally, the AOD and ANG reflect the model's ability to retrieve the long-range transport of long-lived aerosols, which depends on several processes (emissions, transport and deposition). These scores show that the model is able to retrieve these yearly recurrent plumes but the model size distribution of particles clearly requires improvement.

Surface concentrations
For the surface concentrations of gaseous and aerosol species, the variability is much more related to local effects. As an example, the detailed values of the statistical indicators and the differences between years are extensively presented for NO 2 .
NO 2 is both primary and secondary in origin. Mostly emitted in urbanized areas, the diurnal cycle of this species is well constrained. Depending on meteorological conditions, its lifetime may vary significantly from hours to days. Modeling this species with CTMs is challenging because several uncertainties are acting at the same time, including the spatial representativeness of the model cell. The scores show whether the sources are properly located and whether the photochemistry and transport processes have been well simulated. In general, at coarse model resolution, the model results for this species are worse than for ozone. The spatial correlation gives a score of R s = 0.88 for 2013. This corresponds to the best correlation compared to the other years. The anthropogenic emissions are strongly related to industrial activities and road traffic, and since these activity sectors are fixed in space, the good spatial correlation is more due to anthropogenic sources that vary in space, such as biogenic and vegetation fires. The temporal correlation is low for 2013, R t = 0.22, but is closer to 0 for other years and therefore significantly better for the reference year compared to the others. These two correlation values show that the model certainly captures the right location of emission sources (low variability of R s ). The nRMSE is large and shows that the concentrations are overestimated by the model. However, this overestimation appears for all years and can be due to the representativeness of the surface measurements compared to the size of model cells.
The spatial correlation is good for O 3 , NO 2 and PM 10 , with R s = 0.69, 0.88 and 0.81, respectively. For PM 2.5 , this correlation is low, with R s = 0.16. The PM 10 shows that the largest particles are well modeled over the whole domain, and this was also the conclusion for the AOD and ANG. The low score for PM 2.5 indicates that for the aerosol distribution the fine mode is not as well modeled as the coarse mode. This is confirmed by the scores of the aerosol inorganic species, ammonium, sulfate and nitrate, which contribute to a large part of the fine fraction of particles. Except for sulfate (with R s = 0.51), the spatial correlations are 0.15 for nitrate and 0.20 for ammonium. Thus, the fine part of the aerosol is not well modeled mainly due to a deficiency in the modeling of nitrates.
The temporal correlations have a completely different behavior than the spatial correlations. The values are generally low, from R t = 0.09 for nitrate to R t = 0.32 for O 3 . Surprisingly, the PM 10 concentrations display a good spatial correlation but a poor temporal correlation. This is due to the long lifetime in the atmosphere of nonreactive species such as mineral dust: plumes are correctly modeled over large areas but the day-to-day variability needs improvement. Another point is the good spatial correlation for NO 2 but its low temporal correlation with R t = 0.22. In this case, this means we have a correctly spatialized anthropogenic emission inventory (mainly for NO 2 sources), but difficulties to model the day-to-day chemistry still exist.
For the surface concentrations, we can conclude that O 3 , NO 2 and PM 10 concentrations are spatially well modeled, and this is not due to a recurrent behavior. For particles, the problem is more related to the fine mode, where PM 2.5 concentrations are not well located. This modeling problem is highlighted by the low correlations and I mv values for the inorganic species. For the temporal correlations, the scores are always lower than for the spatial correlation but also always higher for the reference year than for the other years.

Estimation of the I mv indicator for all variables
To summarize the results obtained for each statistical indicator and the values of differences between all years, we apply the I mv formulation. This enables us to have one value for   Table 4 and are also displayed on single plots in Fig. 4. In Table 4, the I mv values larger than 0.4 are highlighted. This threshold is clearly subjective but mentioned here to better highlight the variables being well modeled and with a correct variability from one year to another. As discussed in detail, the best scores are obtained for the meteorological variables and are better for the temporal variability than for the spatial variability.
In Fig. 4, the x axis represents the correlation (spatial or temporal) and the y axis represents the differences between all years D. For each studied variable, their values are reported on the figure, where the colors represent the values of I mv . The interpretation of these results follows the quality criteria presented in the academic schematic of Fig. 2. This presentation shows an important spread for the spatial correlation results. If the relative differences D range from 0 to 0.6, the correlations range from 0.09 (for the 10 m wind speed) to 0.97 (for AOD). The common point is that there is no variable with differences above 0.5. This means that, spatially, the studied problem shows systematic patterns from year to year. The low values of correlations show that some variables are systematically poorly estimated. This means that some meteorological structures (for u 10 m ) or emission sources (contributing to the PM 2.5 surface concentrations) are systematically mislocated.
The representation of temporal correlations shows a specific linear pattern. The largest correlation values are positively correlated with differences. This temporal correlation represents the day-to-day variability at each location. This means that the studied problem is based on high day-to-day variability without similar consecutive days (in this case, one would have high correlations but low differences). This illustrates the fact that the studied problem is primarily an issue of sporadic events and the model is able to correctly find this variability from one day to another.

Conclusions
At first glance, using a different year than the simulated one for the day-to-day evaluation seems awkward. However, we can learn more about the performance of chemistry-transport models than by using a single year for the usual statistical indicators. Of course, this approach will never replace a strict evaluation of a pollution case analysis using time series, vertical profiles and usual error statistics. However, it offers a Geosci. Model Dev., 10, 1199-1208, 2017 www.geosci-model-dev.net/10/1199/2017/ very fast and integrated vision of the strengths and weaknesses of a model with very little calculation. This methodology can also be deployed in intercomparison exercises.
To answer the questions presented in the introduction, for this particular model and simulated period, the following conclusions can be drawn. The model always simulates the studied year better than any other meteorological year and it is able to reproduce the day-to-day variability for high concentrations of pollutants.
The spatial correlation is good for 2 m temperature and precipitation rate but not for wind speed: this highlights the fact that the modeled domain is large and the resolution is not optimized for small-scale processes. The spatial correlation is also very good for the long-range transport of particles, as demonstrated with R s = 0.97 and 0.90 for AOD and ANG. However, since this feature occurs every year, this leads to low I mv values. This means that, for a large domain, the main spatial patterns of particle concentrations are recurrent and well modeled. The chemical species that are best modeled are either species with a long atmospheric lifetime (PM 10 ) or species spatially well constrained on the domain (such as NO 2 , mainly due to anthropogenic emissions). For particles, the results depend on the size distribution: the coarse particles are better simulated than the fine ones.
The conclusions are different for the temporal correlation. The scores are calculated using daily observations and modeled outputs. Thus, these scores reflect the ability of the model to retrieve the day-to-day variability. As for the spatial correlation, scores are good for the meteorological variables. For the aerosol, and mainly for the long-lived species (such as mineral dust), the temporal correlation is also correct as the I mv values: I mv = 0.33 and 0.49 for AOD and ANG, respectively. However, for the short-lived species, the temporal correlation and the I mv values are low. This means that improvements are required in priority for the day-to-day variability compared to the locations of emissions. This may probably be due to the atmospheric transport, with the spatial variability of 10 m wind speed being poorly simulated. However, overall, the temporal correlation is better for the studied year than for the others, showing that the problem is highly variable from year to year, but the model is able to capture the evolution of atmospheric composition.
Code and data availability. This study presents a methodology using existing data and models; all required information is already included in this article.
Competing interests. The authors declare that they have no conflict of interest.