Spatiotemporal evaluation of EMEP4UK-WRF v4.3 atmospheric chemistry transport simulations of health-related metrics for NO 2 , O 3 , PM 10 , and PM 2 . 5 for 2001–2010

. This study was motivated by the use in air pollution epidemiology and health burden assessment of data simulated at 5 km × 5 km horizontal resolution by the EMEP4UK-WRF v4.3 atmospheric chemistry transport model. Thus the focus of the model–measurement comparison statistics presented here was on the health-relevant metrics of annual and daily means of NO 2 , O 3 , PM 2 . 5 , and PM 10 (daily maximum 8 h running mean for O 3 ) . The comparison was temporally and spatially comprehensive, covering a 10-year period (2 years for PM 2 . 5 ) and all non-roadside measurement data from the UK national reference monitor network, which applies consistent operational and QA/QC

Whilst policies and legislation have been put in place to limit and mitigate the impacts of air pollution (Heal et al., 2012), there is increasing recognition that more effective protection of human health may be achieved by not focusing on individual pollutants, but by taking a multi-pollutant approach (Dominici et al., 2010). Compared with the traditional single pollutant focus (WHO, 2006), an approach based on pollution mixtures has the advantage of enabling the complexity of exposures and health effects to be characterized more fully: it can help identify harmful emission sources, and it has the potential to provide a more effective framework for air-quality regulation, for example by focusing on sources and pathways that influence several pollutants at once. There are analytical complexities in assessing the potential interactions between combinations of pollutants (Kim et al., 2007;Mauderly and Samet, 2009), including the paucity of measured exposure data, which are typically derived from relatively sparse monitoring sites that may measure different combinations of pollutants at different locations. Furthermore, monitor networks are usually established for compliance with legislation (e.g. deliberately sited close to, or away from, pollution sources), and so may lack representativeness for characterizing population exposure (Duyzer et al., 2015), leading to bias in air pollution epidemiology (Sheppard et al., 2012).
Modelling can increase the availability of air pollution data (Jerrett et al., 2005). The current gold standard for air-quality modelling are process-based, deterministic atmospheric chemistry models (Colette et al., 2014). These seek to simulate the multitude of complex factors that govern the spatial and temporal variability in air pollutant concentrations, including the distributions of different emissions sources, local and long-range dispersion processes, in situ photochemistry, and dry and wet deposition processes.
As part of a multi-institution project on the health impacts of exposure to multiple pollutants, we have derived UK-wide distributions of surface air pollution at hourly temporal resolution over multiple years (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010), at 5 km × 5 km horizontal resolution, using the EMEP4UK-WRF atmospheric chemistry transport model (ACTM) (Butland et al., 2016). This represents a unique dataset of ACTM simulations at this spatial and temporal resolution over this geographical coverage and time duration. The EMEP4UK-WRF model (Vieno et al., 2010(Vieno et al., , 2014) is a regional application of the European Monitoring and Evaluation Programme (EMEP) MSC-W model . The EMEP model frame-work has been evaluated and used for many years in scientific support (Fagerli et al., 2015), in, for example, evaluation of emissions regulations within the UNECE framework (e.g. the Gothenburg Protocol) and the European Commission's Clean Air for Europe (CAFE) programme (http://www.emep.int/).
The high temporal and spatial resolution output from the EMEP4UK-WRF model has many advantages for air pollution studies, including (i) provision of data at times and locations where monitoring data are not available; this has the dual benefit of increasing effective sample size in multipollutant health epidemiology and of reducing reliance on the assumption that a single monitor is representative of species concentrations over a large area; (ii) provision of data on individual particle chemical components in addition to the aggregated mass concentration of PM that is measured; (iii) the facility to explore many related aspects such as geographical or demographic differences in exposures to air pollutant mixtures (and related issues of environmental justice); and (iv) the impacts of potential future emissions scenarios.
It is important to have an understanding of the performance capabilities of any model, relevant to the use to which the model output is to be put. Much has been written on air quality model evaluation (see, for example, Vautard et al., 2007;Dennis et al., 2010;Derwent et al., 2010;Rao et al., 2011;Thunis et al., 2012Thunis et al., , 2013Pernigotti et al., 2013), including publications arising out of international collaborative programmes such as AQMEII (Air quality modelling evaluation international initiative, http://aqmeii-eu.wikidot.com) and FAIRMODE (Forum for air quality modelling in Europe, http://fairmode.jrc.ec.europa.eu). The literature ranges from discussion of epistemological categories of evaluation to development of specific metrics and criteria for comparison between modelled and measured concentrations. Detail is not repeated here, other than to note that there are fundamental limitations to agreement between model and measurements, which include: uncertainties intrinsic to the measurements; limitations in model input data (e.g. emissions) and in other aspects of model descriptions of physical processes; and that models simulate a volume-average concentration whilst monitors measure at a specific location.
The objective of this paper is to record detailed assessment of the modelled surface concentrations of O 3 , NO 2 , and PM 2.5 and PM 10 using metrics of these pollutants relevant to air pollution epidemiology and health burden assessment, namely the daily (i.e. 24 h) mean for PM and NO 2 and the maximum daily 8 h running mean for O 3 . The measurements are taken from the UK's Automatic Urban and Rural Network (AURN) of "real-time" reference monitors. The key emphasis in this work is comprehensiveness and consistency: the model-measurement evaluation is UK wide, over an extended time period (10 years), and based on measurements subject to a single set of operational and QA/QC procedures for each pollutant. Two important statistics for evaluation of air quality for health studies -correlation and bias (see Discussion) -together with root mean square error, were eval-uated by type of monitor location, year, month, and day-ofweek.

Model data
The EMEP MSC-W regional Eulerian ACTM is described in Simpson et al. (2012) and at http://www.emep.int/. The EMEP4UK model providing data in this work (Vieno et al., 2014 Vieno et al. (2014). Both WRF and EMEP4UK models use 20 vertical layers, with terrain following coordinates, and resolution increasing towards the surface (centre of the surface layer ∼ 45 m). The vertical column extends up to 100 hPa (∼ 16 km). The boundary conditions for the inner domain were taken from 3-hourly output from the European domain in a one-way nested set-up, whilst for the European domain they were measurement derived and adjusted monthly (Vieno et al., 2010). Ground-level modelled species concentrations were calculated hourly at 3 m above the surface vegetation or other canopies by making use of the constant-flux assumption and definition of aerodynamic resistance .
Anthropogenic emissions of NO x , NH 3 , SO 2 , primary PM 2.5 , primary PM coarse (where PM coarse is the difference between PM 10 and PM 2.5 ), CO, and non-methane VOC for the UK for each modelled year were taken from the National Atmospheric Emission Inventory (NAEI, http://naei.defra. gov.uk) at 1 km 2 resolution and aggregated to 5 km × 5 km resolution. For the outer domain, the model used the EMEP 50 km × 50 km resolution emission estimates provided by the Centre for Emission Inventories and Projections (CEIP, http://www.ceip.at/). The annual total emissions were temporally split using prescribed monthly, day-of-week, and diurnal hourly emissions factors (the latter differing between weekdays, Saturdays, and Sundays) for each pollutant and for each of the SNAP (Selected Nomenclature for Sources of Air Pollution) sectors . Methane concentration was prescribed. Emissions estimates for international shipping were those from ENTEC UK Ltd. (now Amec Foster Wheeler) (ENTEC UK Limited, 2010). Daily emissions from biomass burning were derived from the Fire INventory from NCAR version 1.0 (FINNv1) (Wiedinmyer et al., 2011). Natural emissions of isoprene, monoterpenes, dimethylsulfide (DMS), wind-induced sea salt, and NO x from soils and lightning were as described in . Natural emissions of dust included Saharan dust uplift, but not of wind-blown dust within the model domain.
The default EMEP MSC-W photochemical scheme was used, which contains 72 gas-phase species and 137 reactions; the gas/aerosol partitioning formulation was the Model for an Aerosol Reacting System (MARS) (Binkowski and Shankar, 1995). Simulation of secondary organic aerosol (SOA) formation, ageing, and partitioning was via the 1-D volatility basis set (Donahue et al., 2006) with its implementation in the model as described by Bergström et al. (2012). The EMEP4UK model output for PM 2.5 comprised the sum of the PM 2.5 fractions of elemental carbon (EC), "other" primary PM in the emissions inventories (encompasses material such as flyash, and brake and tyre wear), sea salt, mineral dust, primary and secondary organic matter (OM), ammonium (NH + 4 ), sulfate (SO 2− 4 ), and nitrate (NO − 3 ). PM 10 is the sum of PM 2.5 plus the PM coarse fractions of EC, "other" primary PM (as above), sea salt, dust, OM, and NO − 3 . The split of NO − 3 into PM coarse and PM 2.5 uses a parameterized approach dependent on relative humidity, as described by Simpson et al. (2012). It is acknowledged that this split is somewhat uncertain, as discussed in Vieno et al. (2014). Despite the comprehensiveness of PM composition simulation, some known contributions are missing, in particular windblown dust. Traffic-induced road dust resuspension is likely underestimated. Also, as described in the next section, different measurement techniques and conditions incorporate different proportions of the ambient PM water content. Because of uncertainty in what measurements measure, and variability in measurement techniques employed through the time period of interest, we chose to use as model output the dry mass of PM. This contributes some unquantifiable variable negative model bias for PM 2.5 and PM 10 .

Measurement data
Hourly measurements of the concentrations of NO 2 , O 3 , PM 10 , and PM 2.5 at the AURN stations during 2001-2010 were downloaded and processed using R package "openair" (Carslaw and Ropkins, 2012) from the R workspaces provided and updated daily by Ricardo-AEA. Because of the emphasis in this study on data for health-related applications, the model-measurement comparisons were principally based on the daily pollutant metrics recommended by the World Health Organisation (WHO, 2006), i.e. daily mean concentrations for NO 2 , PM 2.5 , and PM 10 (NO 2 _daymean, PM 2.5 _daymean, and PM 10 _daymean), and daily maximum running 8 h mean for O 3 (O 3 _max8hmean).
A data capture threshold of 75 % was applied throughout the process of calculating statistics from the hourly measurements, as is standard protocol for EU data reporting (http://acm.eionet.europa.eu/databases/airbase/aggregation_ statistics.html). For example, daily mean concentrations of NO 2 , PM 2.5 , and PM 10 were only calculated when there were at least 18 hourly measurements in a day. For O 3 , there had to be at least six hourly measurements in any 8 h window for an 8 h rolling mean to be calculated, and at least 18 8 h rolling means for a daily maximum 8 h mean to be valid.
Comparison with model output was only undertaken for AURN sites with a ≥ 75 % data capture rate over the whole 10-year period. This means that at least 2739 out of 3652 pairs of daily measured and modelled values were required for inclusion. For PM 2.5 , there were only four sites meeting the 75 % data capture requirement over the 10 years, so comparisons for PM 2.5 were restricted to the period 2009-2010.
AURN monitoring sites are classified according to their general location and proximity to particular sources of air pollution (https://uk-air.defra.gov.uk/networks/site-types). Sites classified as suburban background (only one or two sites per pollutant), suburban industrial (one site), and urban industrial (four sites or fewer depending on pollutant) were excluded from the model-measurement comparison as being insufficient in number to provide meaningful comparison for these site classifications. Model-measurement comparison therefore focused on potential differences between rural background (RB) and urban background (UB). The numbers of each type of AURN site contributing data to this model-measurement comparison are summarized in Table 1. The names, coordinates, classifications, and pollutant data captures of all sites supplying data for this work are given in Supplement Table S1. Measurements at urban traffic sites were not included in the comparisons reported in the main paper because these are deliberately located close to strong sources of NO x and PM and not at all representative of air in the wider area simulated in a model grid.
The coordinates of each AURN station with valid measurements during the period 2001-2010 were used to locate the 5 km × 5 km grid of the EMEP4UK domain whose centroid was closest to the station. The WRF-modelled hourly 2 m surface temperature data at each AURN site were also extracted and converted to daily means.
Measurements from the UK AURN adhere to EU Directives on reference instrumentation and QA/QC procedures. Concentrations of NO 2 and O 3 are derived from chemiluminescence and UV-absorption analysers, respectively. The "real time" measurement of PM mass concentrations is technically more challenging than for O 3 and NO 2 , and the instrumentation used in the UK varied during the 2001-2010 period. After about 2008, the majority of measurements of PM 10 and PM 2.5 have been made by TEOM-FDMS (Tapered Element Oscillating Microbalance Filter Dynamics Measurement System) which has been demonstrated as equivalent to the EU reference method (Harrison, 2010). The TEOM-FDMS system records a value for both "volatile" and "nonvolatile" PM and it is the sum of these values that is used in this work. All the 2009-2010 PM 2.5 measurement data in this study are derived from TEOM-FDMS instruments. However, for PM 10 , prior to the introduction of the auxiliary FDMS unit, measurements were derived using the TEOM instrument alone. The inlet and element of these instruments were held at 50 • C to limit condensation of water, but this caused loss of some volatile components of PM 10 . All TEOM values were therefore multiplied by 1.3 before archiving to provide an estimate of the average loss of volatile components, as recommended by the EC Working Group on Particulate Matter (EC, 2001). PM 10 values from the few TEOM-only instruments remaining in the AURN after the general introduction of FDMS units in 2008 have been scaled using the more sophisticated Volatile Correction Model (Green et al., 2009), rather than the single 1.3 scaling factor, to account for the loss of volatile components. PM 10 data from the few Beta-Attenuation Monitor (BAM) instruments present in the AURN have been scaled by 1.3 if they had a heated inlet and 0.83 if they did not have a heated inlet.
The objective of all these external scaling processes for these PM measurements has been to provide the best practical measure of "reference equivalent" PM 10 (and PM 2.5 ) mass concentrations spatially and temporally across the AURN. Nevertheless, these instrumental issues introduce considerable additional uncertainty to the PM measurement data: first, scaling factors, where applied, are an average scaling in time and space, whereas the real scaling that would have been required would have varied between sites and for different times at an individual site; secondly, there may be a discontinuity in the PM 10 time series associated with instrument change at a particular site, and dates of instrument change varied across the network. Uncertainty in measurement-model comparison is also introduced by the use of dry mass PM as the model output.
Irrespective of these changes to PM 10 instrumentation, all PM, NO 2 , and O 3 instruments in the AURN are maintained and calibrated in accordance with the QA/QC protocol for the UK ambient air quality monitoring network (http:// uk-air.defra.gov.uk/networks/network-info?view=aurn), and all data are subject to the network data review and ratification process before "ratified" archiving.

Evaluation of spatial aspects of model performance
The coherence between long-term spatial patterns of modelled and measured concentrations was investigated through the correlation across sites of the 10-year (2-year for PM 2.5 ) means of the daily pollutant metrics at each site.

Evaluation of temporal aspects of model performance
The daily pollutant metrics were grouped by day-of-week, month-of-year, and year of the 10-year period. Statistics were then calculated on the grouped pairs of daily model simulations and measurements for each pollutant at each site, and summarized by site type. Of the various statistics proposed for quantifying the performance of air-quality models, correlation, bias, and RMSE are consistently cited for evaluation against policy-relevant metrics of pollutant concentration (USEPA, 2007;Derwent et al., 2010;Thunis et al., 2012). The first two statistics in particular are important for application to health studies (see the Discussion).
In each of the following, the index i runs over the n pairs of model (M i ) and observation (O i ) concentrations per time series at each site. The term "observation" is used, in this section only, synonymously with the term "measurement" used elsewhere in this paper, to avoid the ambiguity of an M label for model and for measurement.
Pearson's correlation coefficient: M andŌ are the means of the modelled and observed concentrations, respectively, and s M and s O are their respective sample standard deviations.
. The FAC2 statistic, the proportion of all pairs of modelled and observed concentrations that are within a factor of 2 of each other, was also calculated. This statistic provides an additional general indication of overall model skill.

Evaluation of spatial aspects of model-measurement statistics
Scatter plots of the individual-site model versus measurement 10-year means of NO 2 _daymean, O 3 _max8hmean, PM 10 _daymean, and 2-year means for PM 2.5 _daymean, by site type, are shown in Fig. 1 and illustrate the extent of model-measurement spatial correlation across the UK. The data in these plots are additionally categorized according to the latitude of the monitor site. The numerical values of model-measurement correlation, FAC2, NMB, MB, and RMSE associated with each plot in Fig. 1 are presented in Table 1. The correlation between the normalized bias and the latitude across all sites in a given panel of Fig. 1 is given in Table 2. This table also presents the correlation between normalized bias and modelled 10-year mean temperature by site  type and pollutant. The equivalent of Fig. 1 with data categorized by mean temperature is shown in Supplement Fig. S1.

NO 2
Figure 1a shows excellent model-measurement agreement in 10-year mean NO 2 across RB sites (spatial correlation coefficient of 0.98, regression slope and intercept of 1.10 and 0.0045 µg m −3 , n = 7). This is further emphasized by the low bias for 10-year mean NO 2 at these seven RB sites: MB = 0.7 µg m −3 , NMB = 0.06; and low scatter: RMSE = 1.05 µg m −3 , FAC2 = 1.00 (Table 1). Spatial correlation between modelled and measured 10-year mean NO 2 was also high at UB sites (r = 0.68, n = 37) (Fig. 1a), although modelled NO 2 concentrations were, on average, lower than measured concentrations at urban sites (MB = −9.5 µg m −3 , NMB = −0.31, FAC2 = 0.84, RMSE = 11.9 µg m −3 ) ( Table 1). The negative model bias at urban sites can be attributed to either or both underestimation of NO x emissions and the instantaneous dilution of NO x emissions into a 5 km × 5 km model grid cell irrespective of where the monitor is positioned with respect to emissions of NO x in reality. If air at the urban monitor is more influenced by NO x emissions than represented by the model grid average, then the model value will underestimate the contributions at the monitor from both primary emitted NO 2 and secondary NO 2 formed by reaction between primary NO and O 3 . This model grid dilution effect will be more pronounced the closer the monitor is sited to strong sources of NO x .
For urban sites, model-measurement agreement was generally better at lower latitude sites, i.e. for sites in the south of the UK compared with sites in the north (Fig. 1a). The slight increase in model negative bias for NO 2 in the north does not appear to be related to the absolute concentration of NO 2 since the differential is similar across a range of NO 2 concentrations at sites in the south and north. Normalized bias was significantly positively correlated with temperature (Table 2, Fig. S1b), i.e. less negative at higher temperature, which is consistent with the smaller negative bias for the southern UK, since average temperature decreases with increasing latitude in the UK. Figure 1b shows that the modelled 10-year mean of daily max 8 h mean O 3 concentration was greater than measured at all sites except the coastal RB site at Weybourne.

O 3
As for NO 2 , the model-measurement statistics for the 10year mean O 3 at RB sites were very good (NMB = 0.08, MB = 5.8 µg m −3 , FAC2 = 1.00, RMSE = 8.7 µg m −3 , n = 17) and better than at the UB sites (NMB = 0.27, MB = 15.1 µg m −3 , FAC2 = 1.00, RMSE = 15.9 µg m −3 , n = 30) ( Table 1). The positive model bias for O 3 at UB sites is presumably driven by the same issue as the negative model bias for NO 2 at the UB sites: the dilution of model NO x emis- Table 2. Correlation of the normalized bias between model and measurement 10-year means of pollutant daily metrics (2-year mean for PM 2.5 ) at a site with the latitude or with the 10-year mean modelled temperature at that site. Correlations significant at p<0.05 are highlighted in bold. RB, rural background; UB, urban background. No data for PM 2.5 (RB) since only n = 2 sites. Also presented in the penultimate column is the maximum bias between model and measurement that forms the model quality objective (MQO) for long-term averaged values of the pollutant concentration at the given site type, as calculated using the formulae published in FAIRMODE project WG 1 documents at http://fairmode.jrc.ec.europa.eu (as at March 2017) and allowing for variable measurement uncertainty using the measured values from this study. These values define the positions of the green lines in each panel of Fig. 1. The final column gives the number (and %) of sites that satisfy this MQO, i.e. which lie within the green lines of Fig. 1 sions in urban areas into the 5 km × 5 km model grid means that the model insufficiently simulates the reactive removal of O 3 by NO close to the urban monitor. The lack of model-measurement spatial correlation in 10year mean O 3 concentration across all RB sites (r = 0.21, p = 0.428, n = 17) (Fig. 1b) is driven solely by the outlying model-measurement comparison at the Weybourne site, the cause of which is unknown. When this site is excluded, there is highly significant spatial correlation between model and measurement across all remaining RB sites (r = 0.81, p<0.001, n = 16) ( Table 1). There was also highly significant spatial correlation between modelled and measured O 3 concentration at UB sites (r = 0.73, p<0.001, n = 30) (Fig. 1b, Table 1), although the lower than unity gradient indicates a trend for a less positive bias at higher O 3 concentrations. This is again a reflection of the NO + O 3 reaction: higher O 3 at an UB monitor is likely because the monitor is sited further from immediate sources of primary NO and so less susceptible to the localized (sub-model-grid) effect. Normalized bias in 10-year mean O 3 was not correlated with latitude or long-term temperature at either RB or UB sites ( Table 2, Fig. 1b, and Supplement Fig. 1b).
In general there were no strong associations between model-measurement bias for 10-year mean PM 10 and latitude, although there was significance for smaller bias at UB sites with higher latitude (r = −0.48, p = 0.031) (Fig. 1c, Table 2) and, correspondingly, a tendency for smaller bias in cooler areas (r = 0.40, p = 0.078) (Supplement Fig. 1c, Table 2). Figure 1d shows that all 2-year mean modelled PM 2.5 concentrations were within a factor of 2 of the corresponding site measurements, but that at nearly all sites the model yielded lower PM 2.5 concentrations than were measured. (Even for the shorter time period used for PM 2.5 comparisons there were only two RB sites with PM 2.5 monitors, so no further comment is made on these data.) Although the mean bias at UB sites was negative (NMB = −0.27, MB = −3.5 µg m −3 , FAC2 = 1.00, n = 28) (Table 1), there was a trend for model underestimation to be greater at sites with higher PM 2.5 concentrations (Fig. 1d). This trend is likely for the same reason as given above: that the regional model cannot fully capture the localization of urban emissions. The lower biases in model simulations of PM 10 compared with PM 2.5 are, at least in part, due to a positive model bias in the simulation of the sea salt component of PM coarse , which is an important component of background PM coarse in the UK (AQEG, 2005). In contrast to the other sites, there was a positive model bias at the RB site at Auchencorth Moss in Scotland. However, the long-term average concentration of PM 2.5 at this site is very low (∼ 5 µg m −3 ) and only about half the next lowest measured PM 2.5 concentration. Accurate measurement of these very low concentrations of PM 2.5 is a considerable challenge (AQEG, 2012). Model-measurement spatial correlation of PM 2.5 across UB sites was moderate but statistically significant (r = 0.58, p = 0.001, n = 28). As with PM 10 , there was no strong association between model bias for PM 2.5 and geographical location ( Table 2, Fig. 1d, and Supplement Fig. 1d), although there was a tendency for smaller bias with higher latitude (r = −0.28, p = 0.141) and in cooler areas (r = 0.43, p = 0.022). This may indicate a negative bias in simulating secondary PM components that have smaller concentrations in the north of the UK compared with the south, which is more influenced by transport of these components and of their precursors from continental Europe (Vieno et al., 2014).

Evaluation of temporal aspects of model-measurement statistics
3.2.1 Statistics for daily metrics across the full simulation period The temporal variability in daily NO 2 and O 3 over the 10 years was well captured by the model at both RB and UB sites. The median (25th percentile, 75th percentile, no. of sites) model-measurement correlation coefficients for NO 2 _daymean across RB and UB sites were 0.75 (0.73, 0.78, n = 7) and 0.70 (0.63, 0.77, n = 37), respectively, whilst for O 3 _max8hmean they were 0.73 (0.72, 0.76, n = 17) and 0.76 (0.74, 0.78, n = 30), respectively. Model-measurement NMB for NO 2 and O 3 at RB sites was also small. The median (25th percentile, 75th percentile) NMB across RB sites for the 10 years of NO 2 _daymean and O 3 _max8hmean was 0.08 (0.02, 0.12) and 0.11 (0.08, 0.12), respectively. The corresponding NMB data across UB sites were larger, −0.29 (−0.40, −0.12) and 0.26 (0.18, 0.32) for NO 2 _daymean and O 3 _max8hmean, respectively, with the explanations for the negative and positive bias values for NO 2 and O 3 , respectively, at urban locations as described above. Table 3 shows that the agreement between modelled and measured temporal variability in daily PM 2.5 over the 2 years of available data was also reasonable. The median (25th percentile, 75th percentile, no. of sites) model-measurement temporal correlation coefficients for PM 2.5 _daymean across RB and UB sites were 0.65 (0.64, 0.65, n = 2) and 0.69 (0.67, 0.73, n = 28), respectively. The correlations for PM 10 _daymean were poorer, with corresponding data for correlation coefficients across RB and UB sites for the 10 years of available data of 0.47 (0.46, 0.48, n = 4) and 0.50 (0.45, 0.55, n = 20). However, although temporal correlation was acceptable for PM 2.5 _daymean there was substantial bias, with median (25th percentile, 75th percentile) NMB values at RB and UB sites of 0.38 (0.18, 0.59) and −0.26 (−0.33, −0.22), respectively (but note that only two sites featured in the RB comparison). Figure 2 shows box-whisker plots summarizing the individual site model-measurement r, FAC2, NMB, and RMSE statistics for daily mean NO 2 , with the daily data grouped by year, by month, and by day-of-week. All box plots indicate substantial inter-site variability in model-measurement statistics, but also differences in these statistics between site type and, in some instances, between the individual blocks of time over which the data are averaged.

NO 2 _daymean grouped by different periods of time
By year. Figure 2a shows there were no long-term trends in the model-measurement correlations of daily mean NO 2 across the years, for rural or for urban sites. At RB sites, a high fraction of modelled daily mean NO 2 was within a factor of 2 of the measurements, without an inter-annual trend (10-year mean of the median FAC2 each year = 0.85) (Fig. 2b). There was some inter-year variation in the modelmeasurement NMB at RB sites which, although near zero on average for years 2001-2003 and 2007-2010 (mean of median NMB = 0.03), was positive in years 2004-2006 (mean of median NMB = 0.18) (Fig. 2c). The model accuracy at urban sites showed a slight trend to lower FAC2 (Fig. 2b) and greater negative NMB (Fig. 2c) in years 2008-2010. The larger model-measurement bias in the latter, whilst similar values of correlation are retained, is potentially indicative of shortcomings in emissions totals in these latter years of the study. Data for RMSE (Fig. 2d) suggest slightly greater imprecision in these latter years also. RMSE was consistently greater at UB sites than at RB sites.
By month. The model-measurement statistics for daily mean NO 2 exhibited some seasonal variability (Fig. 2e-h). Figure 2e shows that there was a similar small seasonal variation in model-measurement correlation at both site types, with higher correlation coefficients on average in autumn and winter, and lower correlation coefficients in spring and summer. Correlation was fairly similar across RB and UB sites. The RMSE values were smallest in spring and summer when correlation was lower (Fig. 2h) and largest in winter months when correlation was greatest. Model bias was smallest at RB sites, and whilst FAC2 at RB sites was fairly constant between months (Fig. 2f), the median NMB at RB sites varied between a median of −0.07 in March and a median of 0.21 in October (Fig. 2g). In contrast, in urban areas, model-measurement difference was least in winter months, December-January-February (mean of median FAC2 = 0.72, mean of median NMB = −0.28, for UB sites), and largest in late spring and early summer (mean of median FAC2 = 0.67, mean of median NMB = −0.33, over May, June, and July for UB sites) ( Fig. 2f and g).
These seasonal variations may have a variety of causes. In terms of chemical and meteorological effects, the NO + O 3 titration effect already described will be greater in summer than in winter, and the model grid dilution effect will be exacerbated in summer by greater convective boundarylayer mixing. Some part of the explanation for poorer modelmeasurement accuracy in summer may also be due to shortcomings in the values of the monthly emissions factors used in the model to disaggregate the annual emissions totals of NO x (and VOC). The more consistent temporal correlations across site types compared with bias are again consistent with issues with the specification of amount and dilution of local emissions into the 5 km model grids rather than issues with describing the meteorology.
By day-of-week. Model-measurement correlation for daily mean NO 2 was similar for all days of the week at both site types (Fig. 2i). On the other hand, there were pronounced differences in NMB between weekday and weekend for both RB and UB sites (Fig. 2k). NMB was more positive at weekends at RB sites than during weekdays, and NMB was similarly less negative at weekends compared with weekdays. There was less weekday-weekend contrast in RMSE (Fig. 2l). The invariant day-of-week correlation but weekday-weekend differences in NMB again indicate that general meteorology is captured well by the model but that there may be shortcomings in the day-of-week factors applied in the model to disaggregate the annual local NO x (and VOC) emissions totals. statistics (n = 7 and 37 for RB and UB sites, respectively), which are summarized in the superimposed box plot whose shading demarcates the interquartile range (IQR) and whose whiskers extend to the largest and smallest values within 1.58 × IQR from the box hinges.

O 3 _max8hmean grouped by different periods of time
As with daily mean NO 2 , Fig. 3 reveals some trends in model-measurement statistics for daily maximum 8 h mean O 3 for data grouped by year, month, and day-of-week.
By year. Figure 3a-d show that there were no long-term trends in the O 3 _max8hmean model-measurement statistics at RB and UB sites over the years 2001-2010. Modelmeasurement correlations were similar at both types of sites (mean of median r = 0.76 and 0.77 for RB and UB sites, respectively) (Fig. 3a), but bias was less at RB sites than at UB sites (mean of median FAC2 = 0.98 and 0.87, mean of median NMB = 0.10 and 0.33, respectively) ( Fig. 3b and c). Error was likewise less at RB than at UB sites (mean of median RMSE = 16.7 and 23.0 µg m −3 , respectively) (Fig. 3d).
By month. Model-measurement correlation exhibited a pronounced seasonal variation (but which was similar for both RB and UB sites), with much better correlation in winter and summer than in spring and autumn (Fig. 3e). On the other hand, model bias was generally lower in spring and summer than in autumn and winter, with the smallest bias in June and the greatest in October (Fig. 3g). This seasonal variation in bias was more pronounced at UB sites than at RB sites. There was smaller seasonal variation in RMSE (Fig. 3h) than for other model-measurement statistics. As discussed above for NO 2 , the seasonal trends in O 3 model biases may be due to shortcomings in assigning seasonal trends to emissions of NO x and reactive VOC that together impact on regional O 3 concentrations. However, many factors influence surface concentrations of O 3 , acting on different temporal and spatial scales (Royal Society, 2008), so the seasonal patterns in correlation and bias are likely the net consequence of a number of drivers.
By day-of-week. Model-measurement correlation at both types of background sites did not show variation with dayof-week (mean of median r = 0.74 and 0.76 for RB and UB sites, respectively) (Fig. 3i). Correlation was much poorer at the Weybourne RB site (r =∼ 0.29), but, as noted above, the Weybourne comparison (which is only for O 3 ) is clearly anomalous. Model-measurement bias at RB sites was largely similar across day-of-week (mean of median FAC2 = 0.97, mean of median NMB = 0.11), with slightly reduced positive bias on weekend days ( Fig. 3j and k). At UB sites, bias was greater during Tuesday-Friday (mean of median NMB = 0.30 and mean of median FAC2 = 0.87), but mean NMB decreased to 0.15 on Sundays and mean FAC2 increased to 0.95 ( Fig. 3j and k). The RMSE was also lower at weekends than weekdays (Fig. 3l). The positive model bias at the urban sites, plus the improved model bias over the weekend, both indicate the issue of dilution into the 5 km × 5 km model grid of urban NO x emissions and the consequent lack of capture of the NO reaction with O 3 at sites influenced by traffic emissions (which are lower in the model at weekends).

PM 10 _daymean grouped by different periods of time
By year. Model-measurement correlations of daily mean PM 10 , grouped by year, did not show any inter-annual trend across the 10-year evaluation period or across the site types (Fig. 4a), except for enhanced correlations, on average, in 2003. Annual averages of model-measurement accuracy in daily PM 10 showed some inter-annual variabilities (Fig. 4b and c for FAC2 and NMB) but no trends across the 10 years. Annual averages of RMSE decreased slightly across the 10 years, although inter-site variability in RMSE was somewhat greater in 2010 (Fig. 4d). By month. Model-measurement comparison statistics for daily mean PM 10 displayed strong seasonality at both types of sites (Fig. 4e-h). Correlations were similar for the RB and UB sites, with the best correlation in summer and the worst in late autumn and winter (Fig. 4e). In terms of bias, at RB sites PM 10 concentration was best simulated in late summer (mean of median NMB = 0.04 for July and August), and most overestimated in late autumn (NMB = 0.69 for October) (Fig. 4g). A similar seasonal pattern was apparent at the UB sites, but superimposed on a lower bias on average: PM 10 concentration was underestimated in late summer, but overestimated in late autumn and winter, with better accuracy on average in the summer half of the year. The RMSE values were similar at both RB and UB sites, but at both site types there was strong seasonality with substantially lower RMSE values during spring and summer (Fig. 4h), when correlation was also better (Fig. 4e), than during autumn and winter.
By day-of-week. Patterns in day-of-week modelmeasurement statistics for daily mean PM 10 ( Fig. 4i-l) showed some similarity to those for daily mean NO 2 ( Fig. 2i-l). Model-measurement correlations were fairly consistent throughout the week and similar at both site types (Fig. 4i) (a small reduction in correlation on Wednesdays at RB sites is likely simply a statistical artefact, as observed also for RMSE values on a Wednesday and a Tuesday, Fig. 4l). There was no significant variation in model accuracy at RB with day-of-week ( Fig. 4j and k), although there are only four sites for this comparison. At UB sites, PM 10 concentration was simulated most accurately on weekdays (mean of median NMB = 0.01, mean of median FAC2 = 0.87) (Fig. 4j and k), but was overestimated at RB sites (mean of median NMB = 0.41). The positive bias at RB sites was probably due to the overestimation of sea salt, as mentioned above. At weekends, positive bias in PM 10 concentrations increased at UB sites (Fig. 4k), yet RMSE did not change (Fig. 4l), suggesting that the day-of-week emissions factors used in the model might not adequately reflect actual weekday-weekend differences in emissions.
Again, the general consistency in temporal correlation with site type and time period, compared with the variation in bias, is consistent with the main driver of model shortcoming being in accuracy of emissions (totals and temporal disaggre-  20 for RB and UB sites, respectively), which are summarized in the superimposed box plot whose shading demarcates the interquartile range (IQR) and whose whiskers extend to the largest and smallest values within 1.58 × IQR from the box hinges. gation) rather than in simulation of atmospheric chemistry and transport processes.

PM 2.5 _daymean grouped by different periods of time
By year. Figure 5a-d summarize the model evaluation statistics for PM 2.5 daily means for the 2-year period of available monitor data (2009)(2010). The PM 2.5 model-measurement comparison statistics are generally poorer in 2010, but 2 years is insufficient to draw any conclusion on inter-annual trends. As for PM 10 daily mean comparisons, there was a positive bias for daily mean at RB sites (mean of median NMB = 0.39) and a negative bias at UB sites (mean of median NMB = −0.26) (Fig. 5c). However, PM 2.5 was measured at only two RB sites, and at one of these, Auchencorth Moss in Scotland, the PM 2.5 concentrations were substantially lower than at any of the other measurement sites. At least half of the modelled PM 2.5 daily mean concentrations were within a factor of 2 of the measurements at all sites, except the RB site of Auchencorth Moss (Fig. 5b). Of the two RB sites, the model accurately simulated daily mean PM 2.5 concentration at Harwell (mean NMB = −0.02, mean FAC2 = 0.90), but there was substantially positive bias at Auchencorth Moss (mean NMB = 0.81, FAC2 = 0.43). As noted above for PM 10 , RMSE was, for unknown reasons, greater in 2010 (Fig. 5d). By month. Model-measurement correlation was generally better in the summer half of the year than in the winter half (e.g. mean of median r = 0.76 and 0.68, respectively, at UB sites) (Fig. 5e). Similarly, there were greater values of FAC2 in spring and summer than in autumn and winter, particularly at UB sites (mean of median FAC2 = 0.86 and 0.78, respectively) (Fig. 5f). Although model-measurement bias did not vary substantially with season ( Fig. 5g), as for PM 10 there was a seasonal correspondence of lower RMSE values (Fig. 5h) and higher correlation (Fig. 5e) during spring and summer, and vice versa during autumn and winter.
By day-of-week. In contrast to the other three pollutants, there were no obvious differences in model-measurement statistics between weekdays and weekend at either site type (Fig. 5i-l), but there are substantially fewer comparison data for PM 2.5 than for the other three pollutants (2 years rather than 10 years).

Hourly model-measurement statistics
The focus in this work was model-measurement comparisons at daily and annual averaging resolutions, but concentration data were available at hourly resolution and the Supplement presents figures and discussion of the comparison statistics for NO 2 and O 3 averaged by hour-of-day. These data support the general observations presented above for the longer averaging periods, in particular that correlations between model and measurement hourly data were gener-ally consistent throughout the day but that bias and RMSE showed systematic variation, which is interpreted as error in the hour-of-day emissions factors used to disaggregate the annual NO x emissions totals in the model (and to overdilution of the NO x emissions into the model grid compared to the siting of the monitor at urban sites).

Discussion
The work presented here was motivated by the use of the EMEP4UK-WRF model output for air pollution epidemiology and health burden assessment; therefore the modelmeasurement comparison focused on health-relevant metrics for the most important ambient air pollutants: specifically the annual and daily means for PM 10 , PM 2.5 , NO 2 , and O 3 (the daily maximum 8 h mean for O 3 ) (WHO, 2013a). The model-measurement comparison was comprehensive; all available data from all non-roadside monitors in the UK's national automated urban and rural network for 2001-2010 were used, which span the range of ambient environments in which people are exposed to air pollution in the UK. Focus was placed on two important statistics for evaluation of air quality model output against health-relevant standards -correlation (temporal and spatial) and bias (e.g. USEPA, 2007;Derwent et al., 2010;Thunis et al., 2012) -and also on the RMSE statistic, as discussed further below.
Even for a well-specified Eulerian model (in terms of input data, transport, chemistry, etc.), model-measurement agreement may not be perfect for (at least) the following two reasons: (i) the model simulates a volume-averaged concentration, whereas the monitor records the composition of the air in one part of that volume, which may or may not reflect the average concentration for the whole volume over the relevant time-averaging period; and (ii) the measurement may be in error. A rural background monitor in homogenous terrain and well away from local sources may be anticipated to be sampling air that is more homogenous over the 5 km × 5 km model grid in which it is located than an urban background monitor. The representativeness of an urban background monitor for the air in the model grid in which it is located will be dependent on the extent of urban area within that grid (and hence to some extent dependent on the absolute size of the particular urban area), as well as the distance of the monitor from specific local pollutant emission sources.
The presence of measurement uncertainty constrains the extent to which model-measurement statistics can be used to evaluate the performance of a model. The FAIRMODE project (fairmode.jrc.ec.europa.eu) has developed a series of relationships, published in Thunis et al. (2012Thunis et al. ( , 2013, Pernigotti et al. (2013), and in documents on the FAIR-MODE website, that define minimum values for modelmeasurement statistics, given values for the measurement uncertainty, U , for example, RMSE < 2U , |NMB| < 2U/Ō and statistics (n = 2 and 28 for RB and UB sites, respectively), which are summarized in the superimposed box plot whose shading demarcates the interquartile range (IQR) and whose whiskers extend to the largest and smallest values within 1.58 × IQR from the box hinges. r > 1 − 2(U/σ O ) 2 (with the O nomenclature representing observations, i.e. measurements). Values for these statistics (termed model performance criteria) can be derived from the dataset of measurements (observations) at each site in two ways. First, it can be assumed that the uncertainty in each measurement is at the maximum level for uncertainty specified in the EU Air Quality Directive for measurement at the limit value of the respective pollutant. These are 15 % for daily maximum 8 h mean O 3 , 25 % for daily mean PM 10 and 25 % for daily mean PM 2.5 (EC Directive, 2008). Secondly, the above publications also provide formulae and associated variable values that allow for the measurement uncertainty to vary as a function of the concentration of the metric being measured, i.e. to allow for greater measurement uncertainty than specified in the EU data quality objectives when measuring concentrations lower than the relevant air quality limit value, as is the case for the majority of ambient concentrations. In these circumstances, the calculated MPC for r are lower and the |NMB| and RMSE values are greater than those for the constant relative measurement uncertainty case. The mean MPC values for r, |NMB| and RMSE per site type and pollutant metric, calculated using both approaches of assigning uncertainties to the actual datasets of measurements, are presented in Table 3 for comparison against the r, NMB, and RMSE values derived in the present modelmeasurement comparison. No MPC values are presented for NO 2 because the FAIRMODE data relate to quantification of NO 2 as an hourly average, whereas the present study was based on daily average NO 2 .
The intention here is to provide an overview of how the EMEP4UK-WRF model-measurement statistics presented here compare with threshold criteria for evaluation of an airquality model in the European air-quality context. It is recognized that satisfying the MPC is a necessary but not sufficient part of model validation. Nevertheless, Table 3 shows that in all instances the site-mean model-measurement r, NMB, and RMSE values from the EMEP4UK-WRF modelling described here are better than their respective MPC values derived assuming concentration-dependent measurement uncertainty, except for RMSE values for daily PM 10 at RB sites. In most cases, the EMEP4UK-WRF modelmeasurement r, NMB, and RMSE values are also better, or comparable with, the MPC values derived assuming constant relative measurement uncertainty. For example, the sitemean model-measurement correlations of 0.73 and 0.76 for daily maximum 8 h mean O 3 at RB and UB sites exceed the estimated mean MPC r values of 0.42 and 0.69 for these sites, respectively, and the site-mean model-measurement NMB values of 0.11 and 0.26 are less than the estimated MPC NMB values of 0.31 and 0.33, for RB and UB sites, respectively. The site-mean model-measurement RMSE values of 17.1 and 21.8 µg m −3 for O 3 at RB and UB sites are similar to the estimated MPC values of 21.8 and 18.1 µg m −3 , respectively, at these sites. The pattern is similar when considering model-measurement metrics for PM 10 and PM 2.5 against their respective MPC values derived assuming constant relative measurement uncertainty.
Although MPC values cannot be calculated here for daily mean NO 2 , example values published for hourly mean NO 2 (Thunis et al., 2012;Pernigotti et al., 2013) suggest that MPC values for daily mean NO 2 are likely to be roughly similar to those for daily maximum 8 h mean O 3 . If so, then Table 3 shows that the model-measurement statistics for daily mean NO 2 are also generally in line with or better than their respective MPC values.
FAIRMODE also outlines an approach to defining a model quality objective for bias relative to long-term average pollutant concentration measurement. The absolute values for this MQO bias, calculated using the measurements relevant to this study, are presented in Table 2 for each pollutant and site type, and are also demarcated by the green lines in the scatter plots of modelled versus measurement long-term means in Fig. 1. Minimum model performance is satisfied if ≥ 90 % of sites have a model-measurement bias less than the value given in Table 2, i.e. have data markers within the two green lines in the panels of Fig. 1. The proportion of sites meeting this condition per pollutant and site type is also given in Table  2. The condition is satisfied in all cases except for NO 2 and O 3 at UB sites, and PM 10 at RB sites. For PM 10 at RB sites, the one site (out of four) that has bias outside the MQO has a bias very close to the MQO (Fig. 1c). Similarly, the biases at 20 % (i.e. six) of the RB O 3 sites not meeting the MQO are all very close to the MQO (Fig. 1b). Non-compliant modelmeasurement biases at UB NO 2 sites are somewhat greater than the MQO, but it should be noted that satisfactory model performance allows for 10 % of sites (i.e. four sites in this instance) to be outside the MQO. Examining Fig. 1a for the UB NO 2 sites reveals that when the four sites showing greatest bias are excluded, the biases for the remaining non-compliant sites are again generally quite close to the MQO.
The UK AURN operates as a single network subject to standardized QA/QC procedures (as described in the Sect. 2), so measurement uncertainty might be lower than the values derived by the FAIRMODE project for measurement across multiple networks. On the other hand, the MPC values in Table 3 show that allowing for increasing measurement uncertainty at lower concentrations very considerably relaxes the threshold of an MPC. Also, as described in Sect. 2.2, instrumentation for "real-time" measurement of PM 10 and PM 2.5 in the UK has varied and in some instances has necessitated post hoc application of correction factors, which provides additional unaccounted for measurement uncertainty for these species compared with measurement of NO 2 and O 3 . It should also be remembered that the above analysis of magnitudes of model-measurement statistics does not allow for uncertainty arising from lack of spatial representativeness of the measurement location within its model grid, as discussed already.
Although the EMEP4UK-WRF model-measurement statistics reported in Tables 2 and 3 are for the most part in line with or better than anticipated model performance criteria, there were also instances of trends in statistics with site type, month-of-year, and day-of-week. (In general there were no obvious inter-annual trends across the decade of comparisons.) Bias was least overall for rural sites (e.g. median normalized mean bias values for O 3 and NO 2 of 0.08 and 0.11, respectively), reflecting the smaller likelihood of sub-grid variations in sources, dispersion, and deposition perturbing concentrations at the monitor location away from the model grid average. There was a tendency for positive model bias for O 3 at UB sites (median NMB = 0.26) and for negative model bias in NO 2 (−0.29) and PM 2.5 (−0.26) at these sites. The negative biases may reflect both underestimation of primary emissions of NO x and PM and a tendency for air at urban background monitor locations to be more influenced by the primary emissions in the vicinity than simulated by the model which effectively averages all emissions evenly across the 5 km × 5 km grid in which the monitor is located. Unless the urban area is very largegreater than a few kilometres in linear dimension -then the air even at a background site in the centre of that urban area is likely to be more influenced by local primary emissions than peripheral (suburban) parts of the urban area included in the model grid average. A further contributor to model negative bias for PM are known omissions in the model of some PM components, including particle-bound water and some sources of dust resuspension, as noted in Sect. 2.1.
The positive model bias for O 3 at UB sites is consistent with the explanations given above for the negative model biases for NO 2 (and PM 2.5 ). The dilution of the NO x emissions in urban areas into the 5 km × 5 km model grid means that the model underestimates the reactive removal of O 3 by NO in the vicinity of the urban monitor, an effect that cannot be resolved even by the comparatively high resolution of the EMEP4UK-WRF ACTM.
Instances of trends in model-measurement bias with month or day-of-week are described in the Results section. The generally good daily temporal correlations discussed already indicate that the model captured the day-today changes in air mass movements which are the strongest influences on surface concentrations of pollutants at this temporal resolution. The observed seasonal and weekdayweekend variations in bias (and of diurnal variations in bias -see the Supplement) are therefore strongly suggestive of shortcomings in the monthly and weekday-weekend (and hour-of-day) emissions factors applied in the model to disaggregate the annual total emissions supplied by the emissions inventories.
As stated at the outset, the motivation here was use of the EMEP4UK-WRF model output for health studies. In the context of use of concentration data for epidemiology, in the broadest terms correlation is more important than bias, and for the model output reported here, model-measurement correlations (both temporal and spatial) were generally considerably better, particularly for the gaseous pollutants, than bias statistics. Epidemiological studies of association of ambient air pollution with health require an estimate of exposure for each subject, most usually from measurements from monitors, but increasingly from models. The difference between the estimates and a hypothetical gold standard, for example concentration outside the residence of each subject, is called exposure measurement error. (It is assumed here that it is the association of ambient pollution with health outcome at the small-area level that is important, because of the link to regulation (Dominici et al., 2000), rather than exposure at the level of the individual, and therefore issues of disparity between the concentration at a location and true personal exposure are not considered.) The consequences of measurement error are to reduce the power of the study to detect an association and to bias the magnitude of the association (Sheppard et al., 2005(Sheppard et al., , 2012Armstrong and Basagaña, 2015).
The agreement statistics determining the magnitude of this "blunting" depends on the specific context. Study power is simplest, depending only on the correlation between the true and estimated exposure. Of the two main types of epidemiological studies of air pollution: in "spatial studies" power is diminished according to the correlation of long-term true and estimated means over space; in "time series studies" it depends on correlations of daily values over space. Thus the model-measurement correlations reported in Sect. 3.1 and 3.2 have a fairly direct implication for study power in those two study types, except that errors in the measured values as estimates of the mean over the population in the grid square (or wider area) are not allowed for. Because of this, the power of studies using modelled concentrations would be somewhat better than implied by the correlations reported (Butland et al., 2013).
Low correlation of "true" and estimated exposures also often reduces estimated size of association (e.g. relative risk per unit exposure), but other aspects of the error distribution also matter, notably the extent of Berkson or classical type (Butland et al., 2013;Armstrong and Basagaña, 2015). It is difficult and beyond the scope of this paper to separate Berkson and classical error, but in the absence of this it would be reasonable to consider the model-measurement correlations as broad guides to bias in association as well as power. Perhaps surprisingly, additive bias (e.g. estimating concentration 10 units too high on average) has little effect in epidemiological studies, at least if the exposure-health association is assumed linear, as it usually is (although bias in association is also dependent on relative magnitudes of variance in "true" and estimated exposures).
As well as the good temporal correlations for daily pollutant metrics, the good spatial correlations between longterm averaged modelled and measured concentrations across urban sites for all four pollutants selected encouragingly suggest that the EMEP4UK-WRF modelled pollutant concentration may broadly reduce exposure measurement error caused by using pollution measurements from air pollution monitors far from the population under considera-tion. On the other hand, a bias error in the simulations contributes to uncertainty in the investigation of any threshold in concentration-health effect, and in health impact assessments that apply concentration-response functions to estimated concentrations of exposure.
This study has worked with the EMEP4UK-WRF v4.3 model. Model-measurement statistics will be different for other models. However, other ACTMs are similarly constructed, and so the broad discussion points relating to intrinsic limitations to monitor versus grid-volume comparison statistics, unresolved sub-grid variabilities, and shortcomings in magnitudes and temporal trends in emissions are generalizable. Local dispersion models can better represent the sources and dispersion at high spatial resolution, but these can only be configured for specific urban areas at a time, are similarly constrained by the accuracy of the spatiotemporal emissions data and require provision of boundary conditions of meteorology and atmospheric composition (often supplied by an ACTM). Dispersion models have also been combined with land-use regression models (Wilton et al., 2010;Michanowicz et al., 2016) but again for individual areas only. Some progress is being made in combining measurement (both ground-based and satellite) and model data through data assimilation (e.g. MACC-II: Monitoring Atmospheric Composition and Climate -Interim Implementation (http://www.gmes-atmosphere.eu/about/); Singh et al., 2011) and data fusion (Berrocal et al., 2011;Zidek et al., 2012;Friberg et al., 2016), but these approaches are computationally demanding, particularly for reactive species, and can only be applied to historic data. National-scale air pollution modelling as described here, despite acknowledged limitations for health studies (Butland et al., 2013), has the benefit of providing self-consistent chemical concentration fields, data for air pollutant components that are either not, or only sparsely, measured and provide the capacity to investigate the potential effects of alternative possible futures.

Conclusions
This study was motivated by the use in air pollution epidemiology and health burden assessment of data simulated at 5 km × 5 km horizontal resolution by the EMEP4UK-WRF v4.3 atmospheric chemistry transport model. A spatially and temporally comprehensive set of model-measurement comparison statistics is presented for daily and annual concentrations of NO 2 , O 3 , PM 10 , and PM 2.5 across the UK for a 10-year period.
In general for epidemiology, capturing correlation is more important than bias and RMSE, and in this study modelmeasurement temporal correlation of daily concentrations generally exceeded minimum performance values calculated from methods reported in the literature that take into account potential measurement uncertainties. Model-measurement bias varied according to monitor site classification, with gen-erally less bias at rural background compared with urban background sites, but bias was again better (i.e. smaller) than values that take account of uncertainties in the measurements. The greater consistency in temporal correlation with site type and across months and day-of-week, compared with variations in bias, is strongly indicative that the main driver of model shortcoming is inaccuracy of emissions (totals and the monthly and day-of-week temporal factors applied in the model to the totals) rather than in simulation of atmospheric chemistry and transport processes.
Despite discussed limitations, these detailed analyses support use of model data such as these in air pollution epidemiology. Air pollution modelling at the spatial coverage and spatial resolution described here has the benefit of increasing study power, of providing data for air pollutant components that are either not, or only sparsely, measured, and of enabling investigation of the potential effects of alternative future scenarios.
Code and data availability. This study used output from the EMEP4UK-WRF model, which is a regional application of the European Monitoring and Evaluation Programme (EMEP) MSC-W model (available at http://www.emep.int/, version vn4.3 used here) driven by meteorology from the Weather Research and Forecast model (http://www.wrf-model.org) version 3.1.1. As described and referenced in Sect. 2.1, the EMEP4UK model has increased spatial resolution over a British Isles inner domain and uses national emissions data for the UK. All EMEP4UK modifications are included in the official EMEP model. The model and measurement data used to derive the statistics presented in this work are archived at the University of Edinburgh at doi:10.7488/ds/2001 (Lin et al., 2017).
The Supplement related to this article is available online at doi:10.5194/gmd-10-1767-2017-supplement.