A regional air quality forecasting system over Europe : the MACC-II daily ensemble production

. This paper describes the pre-operational analysis and forecasting system developed during MACC (Monitor-ing Atmospheric Composition and Climate) and continued in the MACC-II (Monitoring Atmospheric Composition and Climate: Interim Implementation) European projects to provide air quality services for the European continent. This system is based on seven state-of-the art models developed and run in

V. Marécal et al.: A regional air quality forecasting system over Europe precursors) over eight vertical levels from the surface to 5 km height. The hourly analysis at the surface is done a posteriori for the past day using a selection of representative air quality data from European monitoring stations.
The performance of the system is assessed daily, weekly and every 3 months (seasonally) through statistical indicators calculated using the available representative air quality data from European monitoring stations. Results for a case study show the ability of the ensemble median to forecast regional ozone pollution events. The seasonal performances of the individual models and of the multi-model ensemble have been monitored since September 2009 for ozone, NO 2 and PM 10 . The statistical indicators for ozone in summer 2014 show that the ensemble median gives on average the best performances compared to the seven models. There is very little degradation of the scores with the forecast day but there is a marked diurnal cycle, similarly to the individual models, that can be related partly to the prescribed diurnal variations of anthropogenic emissions in the models. During summer 2014, the diurnal ozone maximum is underestimated by the ensemble median by about 4 µg m −3 on average. Locally, during the studied ozone episodes, the maxima from the ensemble median are often lower than observations by 30-50 µg m −3 . Overall, ozone scores are generally good with average values for the normalised indicators of 0.14 for the modified normalised mean bias and of 0.30 for the fractional gross error. Tests have also shown that the ensemble median is robust to reduction of ensemble size by one, that is, if predictions are unavailable from one model. Scores are also discussed for PM 10 for winter 2013-1014. There is an underestimation of most models leading the ensemble median to a mean bias of −4.5 µg m −3 . The ensemble median fractional gross error is larger for PM 10 (∼ 0.52) than for ozone and the correlation is lower (∼ 0.35 for PM 10 and ∼ 0.54 for ozone). This is related to a larger spread of the seven model scores for PM 10 than for ozone linked to different levels of complexity of aerosol representation in the individual models. In parallel, a scientific analysis of the results of the seven models and of the ensemble is also done over the Mediterranean area because of the specificity of its meteorology and emissions.
The system is robust in terms of the production availability. Major efforts have been done in MACC-II towards the operationalisation of all its components. Foreseen developments and research for improving its performances are discussed in the conclusion.

Introduction
The chemical composition of the air close to Earth's surface, generally referred as "air quality" (AQ), directly affects human and animal health and also the vegetation. For instance, ozone has a known impact on the respiratory system (e.g. WHO, 2004) and on the vegetation development (e.g. Fuhrer and Booker, 2003). Recently, the World Health Organization reported that in 2012 around 3.7 million deaths were attributable to ambient air pollution (http://www.who. int/phe/health_topics/outdoorair/databases/en/). This is why air quality has become a major concern, starting in the 1970s, particularly in Europe (e.g. WHO, 2013). Since the Helsinki Protocol in 1985, many regions and countries, including the European Union countries, have progressively put in place tools to regulate and to control the emissions of the main air pollutants. This has led to an important effort to monitor the air composition near the surface but also to develop air quality forecasting systems in experimental or operational modes (see reviews by Ebel et al., 2005;Menut and Bessagnet, 2010). These tools can be used in cases of high pollution episodes to inform people and to take emergency measures to prevent harming effects. They can also be used for policy makers for the regulations on air pollutant emissions and for monitoring the effect of these regulations on air quality (episodes and also background pollution).
The main pollutants under focus for air quality are ozone, nitrogen oxides (NO x = NO 2 + NO), sulfur dioxide (SO 2 ), volatile organic compounds (VOCs), ammonia (NH 3 ), particulate matter, heavy metals (Pb, Cd, Hg) and persistent organic pollutants (POPs, e.g. pesticides and dioxin). Ozone is a secondary pollutant, meaning that it is not emitted but produced from gaseous precursors (mainly VOCs and NO x ) originating from both natural and anthropogenic sources. Particulate matter (PM) corresponds to small size aerosols. PM is categorised as PM 10 (size < 10 µm), PM 2.5 (size < 2.5 µm) and PM 1 (size < 1 µm). These categories were chosen because of their known effects on health. In PM, the distinction between primary (dust, sea salts, black carbon and organic carbon) and secondary aerosols formed from gaseous precursors such as SO 2 , DMS (dimethyl sulfate), H 2 S, NH 3, NO x and VOCs is ignored when considering mass or number concentration only.
Besides the development of surface measurement networks for these main pollutants, there has been a sustained research effort on the atmospheric chemistry modelling for air quality forecasting purposes. Regional and local air quality forecasting systems (Kukkonen et al., 2012;Zhang et al., 2012) rely on limited area models that can be based either on an off-line or an on-line approach to take into account the effect of meteorological conditions on air composition. Offline chemistry models, known as chemistry-transport models (CTMs), use the meteorological parameters from the analyses or the forecasts provided by a separate numerical weather prediction model. On-line models are meteorological models in which chemical variables and processes are included (Baklanov et al., 2014). On-line models have the capability to represent the feedback of the chemical composition on meteorological parameters but they are computationally demanding by design. This is why CTMs are generally preferred for operational air quality forecasting systems.
The chemical composition of air depends on many processes that need to be well represented in models in Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ order to provide reliable air quality forecasts (e.g. Rao et al., 2011). The composition near the surface is very much driven by emissions but also by chemical processes (gaseous/heterogeneous reactions and photolysis) including the production of secondary pollutants, by the advection by winds, by the diffusion in the planetary boundary layer, by the scavenging by rain and by the dry deposition at the surface. Each of these processes has its own uncertainty. These uncertainties come, on the one hand, from the limit of our current knowledge and, on the other hand, from the need to simplify the process representation in models because of computational constraints. In meteorology and climate studies, and more recently in atmospheric dispersion and chemistry modelling, the approach based on a multi-model ensemble of forecasts has been developed to provide better information by combining information from different models. The methods vary from very simple such as the average or the median to more elaborated such as weighted averages based on past scores, Bayesian models or spectral methods (e.g. Delle Monache et al., 2006;Riccio et al., 2007;Potempski et al., 2010;Galmarini et al., 2013). The European Union is very much involved in air quality issues not only through a series of protocols on emissions and consecutive political actions but also by supporting research activities aiming at developing tools for air quality monitoring in Europe. These activities were initiated in the GEMS (Global and regional Earth-system (atmosphere) Monitoring using Satellite and in situ data, FP6, 2005FP6, -2009Hollingsworth et al., 2008) and PROMOTE (ESA PROtocol MOniToring for the GMES Service Element: Atmosphere, 2006Atmosphere, -2009://www.gse-promote.org/) projects and pursued in the MACC (Monitoring Atmospheric Composition andClimate, FP7, 2009-2011), MACC-II (Monitoring Atmospheric Composition and Climate: Interim Implementation, FP7, 2011FP7, -2014 and MACC-III (Monitoring Atmospheric Composition and Climate-III, H2020, 2014-2015) projects. One of the major achievements accomplished in GMES (Global Monitoring for Environment and Security), MACC and MACC-II for European AQ objectives is the development and the exploitation of a pre-operational analysis and forecasting system run on a daily basis. This system is based on the combined use of an ensemble of seven air quality models. The general objective of this system is not to provide air quality forecasts and analyses for precise local situations but at the pan-European scale. For this purpose, the horizontal resolution chosen for the individual models is between 10 and 20 km, thereby representing large scale phenomena and background air pollution. GEMS involved 10 research and operational models. Evolving towards a preoperational system, the MACC/MACC-II/MACC-III ensemble is, since 2009, based on the following seven state-of-theart regional CTMs, which are all developed and run in Europe and that have been extensively evaluated: CHIMERE (Menut et al., 2013a), EMEP (European Monitoring and Evaluation Programme; MSC-W version; Simpson et al., 2012), EURAD-IM (European Air pollution Dispersion Inverse Model;Haas et al., 1995;Memmesheimer et al., 2004), LOTOS-EUROS (Long Term Ozone Simulation -European Ozone Simulation; Schaap et al., 2008), MATCH (Multiscale Atmospheric Transport and Chemistry; Robertson et al., 1999;Andersson et al., 2015), MOCAGE (Model Of atmospheric Chemistry At larGE scale; Josse et al., 2004;Dufour et al., 2004) and SILAM (System for Integrated mod-eLling of Atmospheric coMposition; Sofiev et al., 2008). They are used to produce a multi-model ensemble for major monitored pollutants. Although each of these models can perform very well on particular days in particular areas, the ensemble approach aims at providing, on average, forecasts and analyses of better quality than any of them individually. It also gives an indication of the uncertainties through the spread between the models. Similarly to meteorological forecasts, the quality of the AQ forecasts needs to be routinely evaluated to provide information to users about its reliability. The performance of the individual and ensemble forecast products is evaluated on a daily basis from comparisons with available surface observations by the European AQ station network. Additionally, the system has been providing birch pollen forecasts at the surface during the pollen season since 2013. All the forecast and analysis numerical data are publicly available.
The objectives of the paper are, firstly, to provide a description of the pre-operational analysis and forecasting system in place within MACC and MACC-II to provide AQ services for the European continent and, secondly, to document and analyse the performance of the multi-model ensemble.
Since the system continuously evolves with time, we present here the configuration at the end of the MACC-II project (summer 2014) with a brief description of recent upgrades included before the end of 2014. An overview of the analysis and forecasting system, including the seven models and the ensemble median, is provided in Sect. 2. Section 3 is devoted to the system performance for case studies and on a seasonal basis. Section 4 gives a summary and the perspective on short-and mid-term developments of the MACC-II system and associated research.

General description of the system
The MACC-II air quality system aims at providing analyses and forecasts of the main pollutants at the regional scale over the European continent: from 25 • W to 45 • E and from 30 to 70 • N. Each of the seven models is run at its own horizontal and vertical resolutions, with the horizontal resolutions varying between ∼ 20 and ∼ 10 km. This range of resolutions is not designed to reproduce local aspects of air pollution but to provide concentrations of pollutants at the regional scale that can then be used in particular as boundary conditions for AQ forecasts at finer resolution. The range of the forecasts is 96 h from 00:00 UTC on Day0 with hourly outputs on eight vertical levels (surface, 50, 250, 500, 1000, 2000, 3000 and 5000 m). Day0 is defined as the day when the forecast is run. The forecast initial time/date is Day0 at 00:00 UTC and final time/date is Day3 at 24:00 UTC. For each timestep (1 h), the individual model fields are interpolated on these vertical levels and on the same regular 0.1 • latitude by 0.1 • longitude grid over the MACC-II European domain. It is from these re-gridded fields that the ensemble median and verification products are calculated. Before mid-May 2014, only the surface, 500, 1000 and 3000 m levels were produced. The forecast species include O 3 , NO 2 , SO 2 , CO, PM 10 and PM 2.5 , which are called core species hereafter. The core species are monitored in near-real time (NRT) by European air quality stations and forecasts can therefore be evaluated routinely against these observations. Forecasts of birch pollen concentrations at surface are also produced during the pollen season (1 March-30 June) since 2013. This product is not discussed in this paper since its description and validation is detailed in . Additionally, since mid-May 2014, the production has been extended to other species or aggregation of species (NO, NH 3 , PAN+PAN precursors, total non-methane volatile organic compounds -NMVOCs). Additional species are provided primarily for the use as initial and/or boundary conditions mainly for finer-scale models designed for local AQ purposes.
The analysis at the surface for Day0-1 (the day before Day0) is run daily a posteriori on Day0 using the assimilation of the hourly data from the AQ monitoring stations available in Europe between 00:00 and 23:00 UTC on Day0-1. Like for the forecasts, Day0 is defined as the day when the analysis is run. . The analysis initial time/date is Day0-1 at 00:00 UTC and final time/date is Day0-1 at 23:00 UTC. Similarly to the forecasts, the hourly individual model fields are interpolated on the same 0.1 • latitude by 0.1 • longitude grid. The analyses are only produced at the surface level. Table 1 gives the portfolio of the regional data products. All the additional species and vertical levels are not yet available from all models but this is planned to be completed in 2015. Table 2 gives the current times of delivery of the ensemble numerical data products. These production times have been shifted earlier since summer 2014 in order to fulfil the users' needs, in particular Day0 and Day1 forecasts, which are the mostly used products, are now available at 07:00 UTC. This has been made possible by an earlier delivery of the forecasts of each of the seven models and by replacing the bulk 96 h processing of the ensemble by processing 24 h segments. The delivery time of the analysis has also been shifted earlier in June 2014.
The NRT hourly observations of O 3 , NO 2 , SO 2 , CO, PM 10 and PM 2.5 from the European AQ monitoring stations are used for model assimilation to produce the daily analyses and also for the forecast and analysis evaluation. From 2009 until recently, they were gathered country by country through bilateral agreements with the project. Since 2014, a new system has been put in place to gather these observations from the centralised AirBase database maintained by the European Environment Agency (EEA). The database collects the NRT data and validated data from the European countries bound under Decision 97/101/EC to engage in a reciprocal exchange of information (EoI) on ambient air quality. The delivery time of the observations to EEA takes place earlier and there is on average more data available than when gathering them bilaterally country by country, although there is a large variability from one day to another in the number of data available. For the use in the production of the analyses, we chose after a dataflow monitoring of the EEA database a cut-off time at 07:00 UTC on Day0 for the data set covering Day0-1. At this time of the day, more than 90 % (on average) of all data are available. The 07:00 UTC cut-off time is therefore a compromise between having enough data available for the model assimilation and a reasonable production time for the ensemble analysis that was at 14:30 UTC at the end of MACC-II. This production time is still too late for the forecasts to be initialised from the analysis, meaning that the forecast and the analysis products are currently run in two separate chains for each model. For the product evaluation, the observations covering Day0-1 available in the EEA database at 23:00 UTC on Day0 are used since there is less constraint on the time of delivery of evaluation products. On average there is about 10 % more data available at 23:00 UTC than at 07:00 UTC. As shown in a MACC-II report (D16_3; http://www.gmes-atmosphere.eu/ documents/maccii/deliverables/obs/), the additional data collected at 23:00 UTC compared to 07:00 UTC are mainly data from the end of the previous day. This is because there is a significant number of stations that do not send their late afternoon and evening Day0-1 data before 07:00 UTC on Day0. This means that the 23:00 UTC data set used for verification is homogeneous with approximately the same number of observations in the morning, afternoon and evening.
Because the NRT AQ observations used are not validated data, sorting procedures are applied to reject unrealistic observations through a blacklist. The blacklist includes stations identified as unrealistic, such as for instance stations giving the same concentration for each hour of the day. Moreover, only the data representative of the horizontal resolution of the regional models (10-20 km) are selected. There is currently no uniform and reliable metadata on site representativeness available for all regions and countries of Europe. This is why we chose to follow the work that has been done by Joly and Peuch (2012) to build an objective classification of sites, based on past validated measurements available in the AirBase database (EEA). Stations are classified between 1 and 10 depending on the characteristics of their series of measurements (diurnal cycle, "weekend effect" and high frequency variability with periods lower than 3 days). The orig- This leads to a typical number in summer 2014 of ∼ 600 sites for ozone, ∼ 500 sites for NO 2 , ∼ 150 sites for SO 2 , ∼ 40 sites for CO, ∼ 400 sites for PM 10 , ∼ 150 sites for PM 2.5 . All these data are used for the verification of the forecast products. For the verification of analyses, the developments done during MACC-II were only put into place after the end of the project. This verification is done in the following way: a list of stations not used for the assimilation is kept aside for each pollutant for verification. This list is the same every day and it has been determined so that the stations are well spread inside the domain. The ratio of observations that are kept aside for the verification of analyses is roughly 20 % of the total amount of observations that are downloaded at 23:00 UTC. The plots of forecasts and analyses from the seven models and the ensemble median, as well as of their scores against observations are available daily at http: //macc-raq.copernicus-atmosphere.eu/. Numerical data are publicly available and can be accessed at http://www. gmes-atmosphere.eu/request_regional_data/. The full set of numerical data as listed in Table 1 is made available as soon as it is produced on the Météo-France FTP (file transfer protocol) server. A subset of these data can also be interactively accessed through the Deutsche Zentrum für Luft-und Raumfahrt (DLR) World Data Center.
Major sources of uncertainties in the regional AQ forecasts and analyses are the quality of the emissions used, the meteorological forcings, the representation of the atmospheric physical and chemical processes, the initial and boundary conditions for the chemical species and the uncertainties in observations and assimilation methods impacting the analysis. The approach chosen in MACC-II is to use the best available emissions over Europe, high quality meteorological forecasts and chemical boundary conditions in all seven chemistry-transport models. Therefore, the variability between the forecasts of the seven models used in the ensemble comes mainly from differences in the models in the treatment of the chemical processes (homogeneous and heterogeneous, photolysis), the advection, the convective transport, the turbulent mixing and the wet and dry depositions. Other differences stem from the use of different vertical and horizontal grids. For the production of the analysis, each model uses its own assimilation system.
The inventory used for anthropogenic emissions was built primarily for modelling purposes in the frame of the MACC-II project (Kuenen et al., 2014). This is an updated version of the MACC inventory (Kuenen et al., 2011). Its resolution is 1/8 • longitude × 1/16 • latitude, which is approximately 7 km × 7 km, and covers the UNECE (Economic Commission for Europe) countries for the years 2003-2009. The 2009 inventory is currently used in the MACC-II daily production. An important upgrade of the MACC-II inventory compared to the earlier MACC inventory is the provision of a particulate matter split between elemental carbon, organic carbon, SO 4 , Na and other aerosols. More details on this inventory can be found in Kuenen et al. (2014). For the biogenic sources, each model deals with its own emissions based on dynamical parameterisations and/or inventories that are detailed in the following individual model description subsections. Additionally, emissions from fires are taken into account using the GFASv1.1 product  available daily at 0.1 • × 0.1 • resolution. GFASv1.1 is based on fire radiative power retrievals from data of the Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Terra and Aqua satellites. The GFAS product for Day0-1 is available around 06:00 UTC on Day0. This is soon enough to be used in the daily analysis of individual production chains. At the time the individual forecasts begin for Day0, only the fire emissions from Day0-2 are available. To have a smaller time gap between the fire emissions and the starting time of the regional forecast runs (usually around 20:00 UTC), an additional fire emission product available around 20:30 UTC on Day0-1 using satellite observations from 15:00 UTC on Day0-2 to 15:00 UTC on Day0-1 is currently under testing. In the forecasts, a persistence of the fire emissions of 3 days is assumed. This is a rounded average of the fire duration obtained by Turquety et al. (2014) from the Euro-Mediterranean region from the MODIS MCD64 product (Giglio et al., 2010)   For the analysis on Day0-1, the IFS forecast starting at 00:00 UTC on Day0-1 is used. The regional domain boundary conditions for the aerosols and gaseous species are provided by the MACC-II global assimilation and forecasting system. This forecasting system is an extension of the ECMWF meteorological IFS running at lower resolution, providing concentrations of dust, sea salt, organic matter, black carbon and sulfate aerosols Benedetti et al., 2009) that are used to force the aerosols in the regional CTMs at the boundaries. At the end of MACC-II project (summer 2014), for the chemical species the IFS was two-way coupled to the off-line MOZART (Model for OZone And Related chemical Tracers) global CTM. This allowed for assimilation of satellite data for O 3 , NO 2 , and CO in the IFS itself, while the detailed chemical processes were handled in the MOZART model (Flemming et al., 2009;Stein et al., 2012;Inness et al., 2013). Since 18 September 2014, the MACC-II global assimilation and forecasting system has been upgraded to a fully integrated system for aerosols and chemical species. Instead of the coupling with the MOZART model, the chemistry is now treated on-line in the IFS using chemistry modules based on the TM5 model (Huijnen et al., 2010). This new system is named Composition-IFS (C-IFS) and is further described in Flemming et al. (2015). The chemical mechanism in the TM5 operational version of C-IFS is based on a modified version of the Carbon Bond 5 (CB05) scheme Yarwood et al., 2005).
Based on all the inputs described above, each of the centres in charge of the seven models runs its production locally and transfers its forecast and analysis files to Météo-France (referred to central production centre hereafter). The general organisation of the MACC-II air quality forecasts and analysis system is summarised in Fig. 1. Tables 3 and 4 give the general features of the seven individual models and of their analysis system. A short description of the seven individual models and of the ensemble median is given in the following sections. More details can be found in the MACC-II 6-month reports (http://www.gmes-atmosphere.eu/documents/maccii/ deliverables/ens/).

CHIMERE forecast and analysis system
CHIMERE is an Eulerian chemistry-transport model able to simulate concentration fields of gaseous and aerosols species at a regional scale (Menut et al., 2013a). The model is developed under the General Public License licence (http://www. lmd.polytechnique.fr/chimere/). CHIMERE is used for analysis of pollution events, process studies, Beekmann and Vautard, 2010), experimental and operational forecasts (Rouïl et al., 2009), regional climate studies and trends (Colette et al., 2011), among others. CHIMERE calculates and provides the atmospheric concentrations of tens of gas-phase and aerosol species over local (e.g. urban) to continental domains (from 1 km to 1 • resolution). Vertically, the model is able to simulate the whole troposphere. The gaseous species are calculated using the MELCHIOR2 scheme and the aerosols using the scheme developed by Bessagnet et al. (2004). This module takes into account species such as sulfate, nitrate, ammonium, primary organic matter (POM) and elemental carbon (EC), secondary organic aerosols, sea salt, dust and water. These aerosols are represented using eight bins, from 40 nm to 40 µm, in diameter. The life cycle of the aerosols is completely represented with nucleation of sulfuric acid, coagulation, adsorption/desorption, wet and dry deposition and scavenging. This scavenging is both represented by coagulation with cloud droplets and precipitation. The formation of SOA (secondary organic aerosol) is also taken into account .
Biogenic emissions are calculated using the MEGAN (Model of Emissions of Gases and Aerosols from Nature) emissions scheme (Guenther et al., 2006) which provides fluxes of isoprene and monoterpenes. The mineral dust emissions are calculated using the (Alfaro and Gomes, 2001) scheme, forced by satellite soil and surface data (Menut et al., 2013b). The CHIMERE assimilation system for operational products is based upon hourly optimal interpolation processing of surface observations for O 3 and PM 10 (Honoré et al., 2008). During MACC-II, an ensemble Kalman filter was also developed for ozone analysis (Gaubert et al., 2014).
CHIMERE is fully dedicated to regional air pollution modelling. It includes a comprehensive representation of the aerosol with SOA and secondary inorganic aerosols (SIA). CHIMERE has a chemical scheme specifically designed to reproduce the photochemical activity in the lower part of the troposphere (for air quality purposes). In terms of points that may need to be improved, the vertical resolution is composed of eight levels up to 500 hPa, meaning that the models need to be fed with realistic top conditions. The assimilation is thus far limited to O 3 and PM 10 and for the surface layer.

EMEP forecast and analysis system
The EMEP/MSC-W model (hereafter referred to as "EMEP model") has been developed at the EMEP Meteorological Synthesizing Centre-West at the Norwegian Meteorological Institute. The model has been publicly available as opensource code since 2008, and a detailed description is given in Simpson et al. (2012).
The numerical solution of advection is based on Bott (1989). The turbulent diffusion coefficients are calculated for the whole 3-D model domain on the basis of local Richardson number, and the planetary boundary layer (PBL) height is calculated using methods described in Simpson et al. (2003). Dry deposition uses a resistance analogy combined with stomatal and non-stomatal conductance algorithms (Simpson et al., 2003;Tuovinen et al., 2004), whereas wet deposition uses scavenging coefficients applied to the 3-D rainfall, including both in-cloud and sub-cloud scavenging of gases and particles. The chemical scheme couples the sulfur and nitrogen chemistry to the photochemistry using about 140 reactions between 70 species (Andersson-Sköld and Simpson, 1999;Simpson et al., 2012).
The methodology for biogenic emissions builds on maps of 115 forest species generated by Köble and Seufert (2001). Emission factors for each forest species and for other land classes are based on Simpson et al. (1999), updated with recent literature (see Simpson et al., 2012, and references therein), and driven by hourly temperature and light using algorithms from Guenther et al. (1995). Other natural emissions include marine emissions of dimethyl sulfide and SO 2 from volcanoes.
The standard model version distinguishes two size fractions for aerosols, fine aerosol (PM 2.5 ) and coarse aerosol (PM 10 excluding PM 2.5 ). The aerosol components presently accounted for are sulfate, nitrate, ammonium, anthropogenic primary particulate matter, sea salt and desert dust. Aerosol water is also calculated. The parameterisation of dry deposition for aerosols follows standard resistance formulations, accounting for diffusion, impaction, interception, and sedimentation. Wet scavenging is treated with simple scavenging ratios, taking into account in-cloud and sub-cloud processes. For SOA the so-called "EmChem09soa" scheme is used, which is a slightly simplified version of the mechanism described by Bergström et al. (2012).
The EMEP data assimilation system (EMEP-DAS) is based on the 3DVar (3-dimensional variational) implementation for the MATCH model (Kahnert, 2008(Kahnert, , 2009). The background error covariance matrix is estimated following the socalled NMC (National Meteorological Center) method (Parrish and Derber, 1992). Currently, the EMEP-DAS delivers analyses for NO 2 , using NO 2 columns of OMI (Ozone Monitoring Instrument) and in situ measurements of NO 2 surface concentrations. The assimilation window is 6 h, 4 times per day.
The EMEP model performs well especially for particulate matter, as it includes carefully evaluated representations of both primary and secondary organic aerosols, in addition to inorganic aerosols, elemental carbon, sea salt, mineral dust and water. Another strength is that its domain extends throughout the whole troposphere, thus taking accurate account of long-range transport of pollutants in the free troposphere. As the EMEP model is designed mainly for background concentrations, urban increments have not been implemented as in some other models with equally coarse resolution, leading to somewhat lower performance in urban and sub-urban areas. However, being one of the main research tools under the UN LRTAP (Long-range Transboundary Air Pollution) convention, the EMEP model is evaluated continuously against measurements of a large range of chemical parameters (including air concentrations, depositions, and trends) ensuring modelling capability with very good overall performance (e.g. Jonson et al., 2006;Fagerli and Aas, 2008;Genberg et al., 2013). A weakness of the analysis chain until the end of 2014 was that only NO 2 was assimilated. However, since early 2015 ozone has been assimilated.

EURAD-IM forecast and analysis system
EURAD-IM is an Eulerian meso-scale chemistry transport model involving advection, diffusion, chemical transformation, wet and dry deposition and sedimentation of tropospheric trace gases and aerosols (Hass et al., 1995, Memmesheimer et al., 2004. It includes 3DVar and 4DVar chemical data assimilation  and is able to run in nesting mode. EURAD-IM has been applied on several recent air pollution studies (Monteiro et al., 2013;Zyryanov et al., 2012;Monteiro et al., 2012;Elbern et al., 2011;Kanakidou et al., 2011).
The positive definite advection scheme of Bott (1989) is used to solve the advective transport. An eddy diffusion approach is used to parameterise the vertical sub-grid-scale turbulent transport. The calculation of vertical eddy diffusion coefficients is based on the specific turbulent structure in the individual regimes of the PBL according to the PBL height and the Monin-Obukhov length (Holtslag and Nieuwstadt, 1986). A semi-implicit (Crank-Nicholson) scheme is used to solve the diffusion equation.
Gas-phase chemistry is represented by the Regional Atmospheric Chemistry Mechanism (RACM; Stockwell et al., 1997) and an extension based on the Mainz Isoprene Mechanism (MIM; Geiger et al., 2003). A two-step Rosenbrock method is used to solve the set of stiff ordinary differential equations (Sandu et al., 2003;Sandu and Sander, 2006). Photolysis frequencies are derived using the FTUV (fast tropospheric ultraviolet-visible) model according to Tie et al. (2003). The radiative transfer model therein is based on the TUV model developed by Madronich and Weller (1990). The modal aerosol dynamics model MADE (Modal Aerosol Dynamics Model for Europe; Ackermann et al., 1998) is used to provide information on the aerosol size distribution and chemical composition. To solve for the concentrations of the secondary inorganic aerosol components, a FEOM (fully equivalent operational model) version, using the HDMR (high dimensional model representation) technique (Rabitz et al., 1999;Nieradzik, 2005), of an accurate mole-fraction-based thermodynamic model (Friese and Ebel, 2010) is used. The updated SORGAM module (Secondary Organic Aerosol Model; Li et al., 2013) simulates secondary organic aerosol formation. Biogenic emissions are calculated in the EURAD-IM CTM with MEGAN (Guenther et al., 2012).
The gas-phase dry deposition modelling follows the method proposed by Zhang et al. (2003). Dry deposition of aerosol species is treated as size dependent, using the resistance model of Petroff and Zhang (2010). Wet deposition of gases and aerosols is derived from the cloud model in the EPA Models-3 Community Multiscale Air Quality (CMAQ) modelling system (Roselle and Binkowski, 1999).
The EURAD-IM assimilation system includes (i) the EURAD-IM CTM and its adjoint, (ii) the formulation of both background error covariance matrices for the initial states and the emission factors, (iii) the observational basis and its related error covariance matrix, and (iv) the minimisation including the transformation for preconditioning. The quasi-Newton limited memory L-BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm described in Nocedal (1980) and Liu and Nocedal (1989) is applied for the minimisation. Following Weaver and Courtier (2001) with the promise of a high flexibility in designing anisotropic and heterogeneous influence radii, a diffusion approach for providing the background error covariance matrices is implemented.
One of the EURAD-IM strengths is that it includes a comprehensive treatment of aerosol dynamics and chemistry. Parameterisations of the formation of secondary particles are temperature dependent for both the inorganic and organic components. However, the complexity of the aerosol components of EURAD-IM is as yet not supported by sufficiently known emission rates of particle types, nor for organic gaseous precursor compounds, especially from biogenic sources. Another strength of the EURAD-IM system is its ability to assimilate chemical data from a wide range of instruments ranging from surface or airborne in situ data to retrievals from several satellites, which are then defining the initial values.

LOTOS-EUROS forecast and analysis system
The 3-D chemistry-transport model LOTOS-EUROS  is developed by the Dutch institutes TNO (www.tno.nl), RIVM (www.rivm.nl) and, more recently, KNMI (www.knmi.nl). It is used for regional-scale air quality forecasts in Europe and the Netherlands (De Ruyter de Wildt et al., 2011). The LOTOS-EUROS model has participated in several international model intercomparison studies addressing ozone (Van Loon et al., 2007;Solazzo et al., 2012a) and particulate matter Vautard et al., 2007;Stern et al., 2008;Solazzo et al., 2012b). These studies have shown that the model has a performance comparable to other European regional models. In the past 3 years, three major updates of the LOTOS-EUROS model have been implemented, moving from version 1.7 to version 1.10. Detailed update information can be found on the model's web page, http://www.lotos-euros.nl. Since the end of MACC-II, the latest update to v1.10 implemented operationally consists of changes in the SO 2 to SO 4 conversion rate, use of AQMEII (air quality model evaluation international initiative) conventions for the fine/coarse dust assignment, update of resistances for e.g. ozone (leading to an overall ozone increase), and improvement of the treatment of fire emissions.
The model extends up to 3.5 km above sea level, with three dynamic layers and a fixed 25 m thick surface layer. The lowest dynamic layer is the mixing layer, followed by two reservoir layers. The height of the mixing layer is obtained from the ECMWF meteorological input data used to drive the model. Transport is based on the monotonic advection scheme developed by Walcek (2000). Gas-phase chemistry is described using the TNO CBM-IV scheme . Hydrolysis of N 2 O 5 is described following Schaap et al. (2004). Aerosol chemistry is represented using ISORROPIA-2 (Fountoukis and Nenes, 2007). The aerodynamic resistance is calculated for all land use types separately. Below, cloud scavenging is described using simple scavenging coefficients for gases (Schaap et al., 2004) and particles (Simpson et al., 2003). Dry deposition is based on the well-known resistance approach, with the DEPAC pa-rameterization for gases (Wichink Kruit et al., 2012) and the Zhang et al. (2001) parameterization for particles.
Biogenic isoprene emissions are calculated following the mathematical description of the temperature and light dependence of the isoprene emissions, proposed by Guenther et al. (1993), using the actual meteorological data. For land use the CORINE/Smiatek database has been enhanced using the tree species map for Europe made by Koeble and Seufert (2001). Total PM 10 in the LOTOS-EUROS model is composed of chemically unspecified PM in the fine and coarse modes, black carbon, dust, ammonium, sulfate, nitrate and sea salt (Na in the fine and coarse modes).
The LOTOS-EUROS model is equipped with a data assimilation package with the ensemble Kalman filter technique (Barbu et al., 2008;Timmermans et al., 2009;Curier et al., 2012). Data assimilation for the MACC-II daily analyses is performed with surface ozone observations (Curier et al., 2012). An extension to other surface and satellite data is foreseen in the near future.
The LOTOS-EUROS model has been designed as a model of intermediate complexity, to favour short computation times. For this, the vertical top of the operational model version is limited and covers only the boundary layer and reservoir layers (up to 3.5 km); effectively, the model therefore employs only four dynamic layers. Concentrations from the free troposphere are taken from the global boundary conditions, and therefore fully incorporate the knowledge, assimilations, and validation efforts present in the global model. A major weakness is that secondary organic aerosols are currently not included; instead, a bias correction for total PM is used to account for the missing aerosols. In spite of the limited complexity, the model performs well in simulation of O 3 (Curier et al., 2012) and has a skill to forecast the observed variability in PM 10 (De Ruyter de Wildt et al., 2011). Apart from the relative short run-through time, the strength of the model is in the detailed description of anthropogenic emissions, given the close cooperation with the developers of the TNO-MACC emission inventory; this is for example shown by excellent simulation of boundary layer NO 2 (Vlemmix et al., 2015).

MATCH forecast and analysis system
The MATCH model has been developed at SMHI over the past 20 years and is applied for emergency purposes as well as for regional-scale chemistry modelling (Langner et al., 1998;Robertson et al., 1999).
The transport is described by a Bott-like mass conservative scheme (Bott, 1989;Robertson et al., 1999). For the vertical diffusion an implicit mass conservative scheme is used where the turbulent exchange coefficients for neutral and stable conditions are parameterized following Holtslag and Moeng (1991). In the convective case the turbulent Courant number is directly determined from the turnover time in the atmospheric boundary layer.
The dynamical core of the model contains initialization and adjustment of the horizontal wind components based on a procedure proposed by Heimann and Keeling (1989). This is important to ensure mass conservative transport for interpolated input weather data, specifically for the transport scheme used.
Boundary layer parameterization is determined from surface heat and water vapour fluxes as described by Van Ulden and Holtslag (1985) for land surfaces, and Burridge and Gadd (1977) for sea surfaces. The boundary layer height is calculated from formulations proposed by Zilitinkevich and Moronov (1996) for the neutral and stable case and from Holtslag et al. (1995) for the convective case. These parameterisations drive the formulations for vertical diffusion and dry deposition where for the latter a resistance approach is used (Andersson et al., 2007). In-cloud and sub-cloud wet deposition is implemented following Andersson et al. (2007). The photochemistry scheme is to large extent based on the EMEP chemistry scheme (Simpson et al., 1993), with some updates where a modified production scheme for isoprene is the most notable based on the so-called Carter-1 mechanism (Carter, 1996;Langner et al., 1998).
Aerosols are described for four bins and only for secondary inorganic aerosols, dust and primary organic compounds at the moment. Inclusion of SOA is under testing. Sea salt emissions are dynamically described following Foltescu et al. (2005). A module for wind driven dust emissions is under testing that follows Schaap et al. (2005).
A 3-D variational data assimilation scheme is used with spectral transformation (Kahnert, 2008). The limitation then is that background covariance structures are described as isotropic and homogeneous, however, not necessarily the same for different wave numbers and derived from the socalled NMC method (Parish and Derber, 1992). The advantage though is that the background error matrix becomes block diagonal and there are no scale separations as the covariance between spectral components are explicitly handled. The block diagonal elements are the covariance between wave components at different model layers and chemical compounds.
The strength of the MATCH model is that it spans vertically the troposphere and makes use of the same vertical layers as provided from the IFS model up to 300 hPa. This means about 50 layers in the vertical and the lowest one just 20 m thick and about 15 in the boundary layer. Using the same vertical resolution as the IFS model is an advantage because no vertical interpolation is required. Nevertheless, since the MATCH model has been developed mainly using HIRLAM (HIgh-Resolution Limited Area Model) data with a coarser vertical resolution, the use of the high-resolution vertical levels from IFS may lead to less accurate chemistry forecasts compared to the HIRLAM version. A weakness is missing SOA and wind-blown dust in the PM description.

MOCAGE forecast and analysis system
The MOCAGE model (Josse et al., 2004;Dufour et al., 2004) has been developed at Météo-France since 2000. Its assimilation system has been developed jointly with CERFACS. This model and its assimilation system have been successfully used for tropospheric and stratospheric research (e.g. Bousserez et al., 2007;Barré et al., 2013Barré et al., , 2014Lacressonnière et al., 2014) and also for operational purposes (Rouïl et al., 2009).
MOCAGE uses the semi-lagrangian advection scheme from Williamson and Rasch (1989) for the grid-scale transport, the parameterization of convective transport from Bechtold et al. (2001) and the turbulent diffusion parameterization from Louis (1979). Dry deposition is based on the approach proposed by Wesely (1989). The wet deposition by the convective and stratiform precipitations follows Mari et al. (2000) and Giorgi and Chameides (1986). MOCAGE includes the RACM scheme for tropospheric chemistry (Stockwell et al., 1997) and the REPROBUS scheme for stratospheric chemistry (Lefèvre et al., 1994). Biogenic emissions in MOCAGE are fixed monthly biogenic emission from Guenther et al. (1995).
The aerosol module of MOCAGE follows a bin approach and includes so far the primary aerosols: dust (Martet et al., 2009), sea salts, black carbon (Nho-Kim et al., 2005) and organic carbon. Recent updates of the primary aerosol module and corresponding evaluation can be found in Sič et al. (2015).
MACC-II operations use a variational assimilation system based upon MOCAGE and the PALM coupler, which has been developed during the ASSET European project (Geer et al., 2006;Lahoz et al., 2007). The system, recently renamed VALENTINA, has been used to compute global and regional re-analyses of atmospheric composition in multiple studies (El Amraoui et al., 2008;Massart et al., 2009;Barré et al., 2013Barré et al., , 2014Emili et al., 2014). The assimilation algorithm employed for MACC-II analyses is a 3DVar with assimilation windows of 1 h length (Jaumouillé et al., 2012), corresponding to the frequency of surface measurements. The assimilation has first been set for surface ozone analyses and in MACC-III it has been extended to surface NO 2 . The specification of the background and observation errors is done based on the evaluation of historical time series of observations and model values. The horizontal error correlation has a Gaussian shape and its typical length is set to 0.4 • for ozone and 0.1 • for NO 2 , to account for the larger variability of NO 2 at fine spatial scales. The vertical error correlation length is set to one model grid point for all species (∼ 100 m). As a consequence, assimilation increments linked to surface observations are confined in the planetary boundary layer.
The strength of MOCAGE is that it simulates the air composition of the whole troposphere and lower stratosphere. Thus, it provides a full representation of transport processes, in particular boundary layer-troposphere and troposphere-stratosphere exchanges, and the time evolution of stratospheric conditions for accurate photolysis rate calculations at the surface. The MOCAGE assimilation system in its MACC configuration produces robust analyses for both O 3 and NO 2 as illustrated in the annual re-analysis reports (http://www.gmes-atmosphere.eu/documents/maccii/ deliverables/eva/). At the end of the MACC-II project, the main weakness of MOCAGE was the lack of secondary aerosols. Inorganic secondary aerosols have been developed recently and will be included in the next MACC operational version . This new feature is also used in the current development of PM 10 assimilation.

SILAM forecast and analysis system
SILAM is a meso-to-global-scale dispersion model (Sofiev et al., 2008), see also the review Kukkonen et al. (2012), http://silam.fmi.fi) that is used for atmospheric composition, emergencies, composition-climate interactions, and air quality modelling purposes. The model has been applied with resolutions ranging from 1 km up to 3 • , incorporates eight chemical and physical transformation modules and covers the troposphere and the stratosphere. The model is publicly available since 2007 and is used as an operational and research tool.
The model has two dynamic cores: Lagrangian (Sofiev et al., 2006), primarily used in emergency-type applications, and Eulerian (Galperin, 2000;Sofiev, 2002) used in atmospheric composition, climate, and air-quality-related applications, including MACC-II. The MACC-II operational SILAM v.5.2 uses the simple dry deposition scheme of Sofiev (2000) for gases and a new approach for aerosols Kouznetsov and Sofiev, 2012), which covers particle sizes from 1 nm up to ∼ 50 µm of effective aerodynamic size. The wet deposition scheme used in MACC-II simulations calculates the 3-D removal coefficient and distinguishes between sub-and in-cloud scavenging, large-scale and convective precipitations, as well as between rain and snow (Sofiev et al., 2006). Boundary layer parameterization follows (Sofiev et al., 2010), whereas in the free troposphere and the stratosphere turbulence is computed following the IFS approach and corresponding turbulent length scale.
Two chemical schemes are used: the CBM-4 gas-phase chemistry mechanism and own development for heterogeneous chemical transformations and inorganic aerosol formation after Sofiev (2000). Aerosols in SILAM are represented via sectional approach with species-specific size spectra. The aerosol species include primary anthropogenic aerosols, divided into PM 2.5 and PM 10 , secondary inorganic aerosols (sulfates, nitrates and ammonia), and sea salt aerosols.
The forecasts utilise the BVOC (biogenic volatile organic compound) emission term based on the NatAir project results (Poupkou et al., 2010) and own development for the sea salt emissions (Sofiev et al., 2011). The data assimilation system of SILAM consists of 3DVar and 4DVar modules (Vira and Sofiev, 2012). The MACC-II near-real time analysis suite uses the 3DVar method and assimilates hourly surface observations of NO 2 , O 3 and SO 2 . PM observations have been assimilated in re-analysis simulations . The 4DVar methodology is utilised in re-analysis mode for pollen.
The model evolution from the MACC-II v.5.2 towards v.5.4, which will become operational in early 2015, includes several important updates. The dry deposition scheme will follow the resistance analogy with extensions after (Simpson et al., 2003). Wind-blown dust will be included via lateral boundary conditions in the next release of operational SILAM v.5.4, together with a secondary organic aerosol module and fire emission.
A strong point of SILAM is the extensive treatment of secondary inorganic aerosol formation, which is reproduced quite well, according to several evaluation exercises and model intercomparisons. Together with the detailed deposition scheme, this leads to good scores for PM 2.5 , especially in winter when inorganic aerosols are dominant. The current limitation of the model is the secondary organic aerosols formation that makes use of the volatility-based model but it is not yet incorporated in the operational simulations, being tested in research projects. A workaround of this limitation is included in the data assimilation modules, which allow for assimilation of both in situ and remote-sensing measurements of gaseous and particulate species. The module now allows for the PM and aerosol optical depth observations being assimilated into an unspecified particulate matter, which is then treated as inert aerosol, thus compensating for the lack of secondary organic particles.

ENSEMBLE forecast and analysis system
To process the ensemble median, all seven individual models are first interpolated to a common 0.1 • × 0.1 • horizontal grid. For each grid point, the ensemble model (referred to as ENSEMBLE hereafter) value is calculated as the median value of the individual model forecasts or analyses available. The median is defined as the value having 50 % of individual models with higher values and 50 % with lower values. This method is rather insensitive to outliers in the forecasts or analyses and is very efficient computationally. These properties are useful from an operational point of view. The method is also little sensitive if a particular model forecast or analysis is occasionally missing. The performances of the ensemble median are discussed in Sect. 3. For the forecasts, the ENSEMBLE is produced for all levels and all species (core and additional). For the analyses, the individual assimilation systems provide only analyses at the surface level and do not produce analyses for all species yet. At the end of MACC-II, ozone was the only species that was produced by six of the models. For other species, analyses from less than five models were available. This is why the ENSEMBLE analy-Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ sis in MACC-II was only calculated for ozone. It has been extended to NO 2 in 2015 since more models will produce NO 2 analyses.

General description
The evaluation of the performances of a forecast system is a necessary step for rating its quality and thus proving its usefulness. The MACC-II air quality forecasts are evaluated against the NRT AQ surface monitoring data detailed in Sect. 2.1. Note that this set of data is fully independent of the forecast since the analyses assimilating the NRT AQ data are produced too late to be used to initialise the forecasts. The tools to assess the performances of the analyses are not yet in place but this is planned to be ready in 2015. Since the focus of the MACC-II regional system over Europe is on air quality, meaning air composition close to the surface, no column observations (ground based or from satellite) or upper air in situ measurements (i.e. on board aircraft) are used operationally to evaluate the system performances.
The forecast performances are measured using the five statistical indicators detailed in the Appendix A: the mean bias (MB), the root mean square error (RMSE), the modified normalised mean bias (MNMB), the fractional gross error (FGE) and the correlation. These statistical measures, when taken together, provide a valuable indication of the model performances. Taylor diagrams are also used to combine root mean square errors and correlations.
The performances of the MACC-II regional AQ forecasts are assessed operationally by several means: -on a daily basis with plots of statistical indicators and charts available on the MACC-II regional website (http: //macc-raq.gmes-atmosphere.eu/), -on a 6-month basis in reports including plots of statistical indicators over two periods of 3 months (winter+spring or summer+autumn) and analysis of these indicators (http://www.gmes-atmosphere.eu/ documents/maccii/deliverables/ens/).
Additionally, on a 6-month basis, reports are especially dedicated to the scientific analysis of the forecasts of the seven models and of the ENSEMBLE in the Mediterranean area (http://www.gmes-atmosphere.eu/documents/ maccii/deliverables/ens/). The Mediterranean area is recognised as challenging for models, in particular under summer conditions with very active photochemistry and because of its large variety of emission sources. The performances of the NRT analysis are not presented in this paper since there is only an ENSEMBLE production of one species (ozone) and the daily verification procedure against an independent data set was not yet in place at the end of MACC-II project.

Availability statistics
The MACC-II regional air quality forecasting and analysis system is currently under a pre-operational status that can be seen as the demonstrator of a future operational system. The proper function of the seven model chains and of the EN-SEMBLE chain is monitored on working hours only since, at this stage, there had been no funding yet for a 7-day/7-day, 24 h/24 h control. Nevertheless, in their pre-operational configuration the production chains are reliable with availability in time (see Table 2) of the seven individual forecasts and analysis generally above 85 % during MACC-II. During the past year, the production suffered from failures because of the many changes that were applied to the individual and central systems to fit with fully operational standards (data format, file transfer, databases, processing softwares, etc.). With the operationalisation being nearly fully settled, the reliability has been improved since the end of MACC-II (generally above 90 %). The ENSEMBLE forecast and analysis productions have been available 100 % of the time since September 2012. This high performance was achieved because the EN-SEMBLE can be produced even if all seven models are not available.

Example of the forecast of two ozone episodes between 10 and 13 June 2014
In this section, we illustrate the performances of the MACC-II AQ forecasts for a case study of ozone pollution events that took place between 10 and 13 June 2014. A more in-depth analysis of the individual model and of the ENSEMBLE performances is done over longer time periods in Sect. 3.4. During the case study period, there were two regional areas with high ozone concentrations (> 120 mg m −3 ) occur-ring at the same time, one over Austria and surrounding regions (south of Germany and Hungary), and one over the south-east of France and the north of Italy. This is illustrated by Fig. 2 displaying the maps of the 15 h forecasts, for the June at 15:00 UTC, of ozone at the surface from the EN-SEMBLE together with the available observations. Note that unfortunately there are no observations from Italy available during the time period considered. Even if the comparison is limited by the missing observations, Fig. 2 shows that the ensemble median captures the two ozone episodes.
For illustration of the system performance, the surface station measurements are compared in Fig. 3 to the forecast. We plot the model forecasts using EPSgrams that give a graphical representation of the spread of the seven models and therefore an estimate of the uncertainty over the 4 days of the forecast. Operationally, EPSgrams are built daily for 40 major cities in Europe and made available on the MACC regional website. Here, EPSgrams are calculated and plotted for the same locations as the measurements (Fig. 3) from the forecast started on 10 June at 00:00 UTC. Note that, in Fig. 3, EPSgrams are 3-hourly while observations are hourly. In the observations (left panel), the "Austrian" episode is highest on 10 June in Fechenheim (Germany) and on 11 June in Hallein (Austria) and Sopron (Hungary) with values reaching 200 µg m −3 at Sopron. The "French" episode peaks at 250 µg m −3 at Sausset (France) with daily maxima over 150 µg m −3 from 10 to 13 June for all three stations.
For the ozone peak event around Austria there is generally a good consistency of the day-to-day trend provided by the seven models compared to the AQ station observations. For the Sopron station there is a main peak in the model on 11 June as in the observations. For Fechenheim, the forecast gives highest concentrations on 10 June as measured and also reproduces the anomaly recorded in the observations in the morning of 11 June. For this "Austrian" ozone episode, the spread between the seven models is very reasonable (generally less than 30 µg m −3 for the 25-75 % range), showing the good consistency between the models with slightly larger spread between the forecasts for the highest peak times. There is an exception in Fechenheim on 10 June where the seven models exhibit a large spread. This can be explained by the effect of complex topography combined with specific meteorological conditions that lead to different behaviours of the models which have different horizontal grids and orography.
Even if the day-to-day trend is well reproduced by the models, ozone median values are often lower during daytime peaks than the observations by 30-50 µg m −3 but the maxima of the seven models are nevertheless close to those observed. There are also cases when the ensemble median forecasts higher peaks than measured as in Fechenheim on 10 June. This can also be seen in the map in Fig. 2 where some observations are lower than the ensemble median. During night-time, the ozone median is close to the observed values.
For the ozone event in the south of France, the comparison shows also a good consistency between the diurnal variations of the models compared to observations. Most of the seven models are over 120 µg m −3 for each of the 4 days forecasted. Nevertheless, at Sausset the very high ozone peak measured on 10 June (over 240 µg m −3 ) is underestimated in all seven forecasts. The very small spread of the models indicates a possible error in the meteorological forecast for this day and/or in the emissions. Sausset is located on the Mediterranean coast, west of Marseille, an industrial city. On this particular day, IFS forecasts an eastern wind with high NO x from Marseille limiting the daytime production of ozone by the models compared to observations. For Plan d'Aups, there is a very large spread of the models, particularly on 10 June. This can be explained by the effect of sea and land breezes on this date combined with steep orography and the presence of a pollution plume nearby with NO x titrated ozone that leads to model differences. In this case (Plan d'Aups), models do not reproduce observations. While for the other stations, the behaviour of the ensemble median is good. For St Rémi, the models perform well with a small spread and diurnal variations close to the measurements. In particular, the increase of the night minimum concentration from day to day is well forecasted. Similarly to the Austrian area, the ozone median concentrations are more often lower than observed, but not always, as shown at Sausset on 13 June and St Rémi on 10 June. In the case of underestimation of the ensemble median, the maximum of the individual models are generally close to those measured at the French stations. Note that for both the "Austrian" and "French" ozone episodes there is no significant degradation of the forecast skills at Day3 and Day4, indicating that uncertainty in ozone forecasts is more driven by inherent uncertainty in chemistry-transport models and part of its input than by uncertainty of the meteorological forecast.
For further evaluation, Fig. 4 displays the MB, MNMB, RMSE, FGE and R (defined in Appendix A) of ozone of the seven individual models and the ENSEMBLE calculated using the representative observations available over the whole European domain. These statistics are based on seven consecutive 96 h forecasts run every day from 9 to 15 June. This figure shows that there is a spread of the seven models and that the ENSEMBLE generally gives the best scores with MNMB between 0.2 and −0.1, FGE between 0.15 and 0.4 and correlations up to 0.75 during daytime. All the models, including the ENSEMBLE, exhibit a diurnal cycle with higher correlations and lower RMSE and FGE during daytime (when ozone is high) than during night-time. Five of the models have a positive MB on average and the other two a negative MB on average. The statistics are only calculated here over 1 week but there is a good consistency with scores based on longer time series, as shown and analysed in detail in Sect. 3.4.
To illustrate the behaviour of the MACC-II system in the case when some of the seven models are missing for the Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ production of the ENSEMBLE, we selected as an example the period of 9-15 June 2014 corresponding to the ozone episodes discussed in Sect. 3.2 and we compared the following ensembles: -"MEDIAN 7", the operational ensemble method which is the median of the seven models (i.e. ENSEMBLE) as presented in Fig. 4; -"MEDIAN 5", built on five individual models, after filtering out the best and the worst models, according to the criterion described below; -"MEDIAN 3", built on three individual models, after filtering out the two best and the two worst models, according to the criterion described below; -"1BEST", the best model.
By removing at the same time the best (or two best) and the worst (or two worst) models we estimate an "average situation". Since the relative performances of individual models vary in time and space, the criterion to order the seven individual models from worst to best is measured by their RMSE over the seven days of the verification between 12:00 and 18:00 UTC (ozone peak time). This criterion is chosen on the basis that we look for the model best reproducing the high daytime ozone levels. RMSE is seen as the most objective criterion since MB and MNMB can include compensating effects and since there is a low spread between the models in the FGE. From this, the best model is displayed in purple in Fig. 4c and the worst in brown.
Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ Results of the sensitivity experiments are shown in Fig. 5. This figure confirms that the ensemble median (MEDIAN 7) using all seven models performs generally better than the best model on all statistical indicators. When only five models (excluding the best and the worst) are available to calculate the ensemble, all scores show only very slight differences with the ENSEMBLE (MEDIAN 7) based on seven models.
Going to only three models to calculate the ensemble (ME-DIAN 3) leads to statistical indicators degraded compared to the ensemble from seven (MEDIAN 7) or five (MEDIAN 5) models but performs generally better than the best model (1BEST). This indicates that using an ensemble of models, even if reduced, is more useful than using a single model event of very good quality. This also shows that with five models available (which may happen in case of problems of production of two of the seven models), the ensemble median is still robust compared to observations. In our tests we disregarded the worst (or two worst) and best (or two best) models on a RMSE criterion, but Kioutsioukis and  showed that there is an impact of the quality of the models chosen on AQ ensemble performances. To go a step further, a more comprehensive study is done in Sect. 3.4 over a longer time period.

Statistical performances of the forecasts on a seasonal basis
In addition to the production of daily skill scores, statistical indicators are calculated for ozone, NO 2 and PM 10 at the surface on a seasonal basis since September 2009 for each of the seven models and for the ENSEMBLE. These skill scores and the analysis of their seasonal and year-to-year evolutions are presented in 6-month reports, each including two seasons. The model's statistical indicators are calculated against measurements from the European AQ surface station network available in NRT and selected as detailed in Sect. 2.1. So far, the data provision in NRT is not fully operational. Therefore, there is some variability with time of the number of data available and of their location. Also, the spatial coverage of the surface AQ network in Europe is very inhomogeneous with a high density of stations in France, Germany, UK, Belgium and the Netherlands. Thus, the statistical indicators are more representative of the system skills for these countries.
Here we only focus on the performances of the system for the last year of MACC-II to document its status at the end of the project. We only analyse the two main pollutants for the season during which exceedances of regulatory levels are more likely to be encountered: ozone in summer (1 June-1 September 2014) and PM 10 in winter (1 December 2013-1 March 2014). We do not show scores for previous years since the use of a different set of surface observations from one year to another does not allow for a fair comparison of the model skills.
MB, MNMB, RMSE, FGE and R for ozone in summer 2014 from the seven models and the ENSEMBLE are shown in Fig. 6. One main feature, which is common to all models including the ENSEMBLE, is that there is no day-today degradation of MB, MNMB, and FGE indicators from Day0 to Day3 and a slight increase of the RMSE around 15:00 UTC (about 1 µg m −3 ). This indicates that the values of the surface ozone concentrations are not affected on average by the day-to-day degradation of the meteorology but are rather driven by other processes such as emissions of precursors and chemistry. However, correlations (Fig. 6e) tend to decrease from Day0 to Day3. This tendency was also found in scores calculated for previous years. Correlations give a measure of the ability of each model to fit the time variations of ozone regardless of concentration biases. Therefore, correlations are more sensitive to the meteorological forecast skills than MB, MNMB, RMSE and FGE. Nevertheless, the decrease of the correlation with forecast day is slow.
In Fig. 6 there is a marked diurnal cycle of all statistical indicators for all models which leads to a similar diurnal cycle in the ENSEMBLE scores. MNMB, FGE and R show best performances peaking at 15:00 UTC and worst peaking at 06:00 UTC for each of the 4 days of the forecast. This means that all models are able to simulate the ozone daytime photochemistry with the given setup of MACC-II (IFS forecasts for meteorology, C-IFS for chemical boundary conditions and GFAS and TNO emissions). For all models, the diurnal cycle in the statistical indicators can be at least partly explained by uncertainties in the diurnal cycle of the emissions of ozone precursors used in the individual models. This is illustrated with CHIMERE correlation at night, which is better than most of the other models. CHIMERE has developed diurnal factors for traffic emissions based on an objective analysis of NO 2 measurements in the different countries in Europe, which improves ozone titration at night . Other reasons of the diurnal cycle in the model scores could also be errors in the diurnal cycle of the boundary layer height and associated vertical diffusion. For instance, the boundary layer in the LOTOS-EUROS simulations is described with a single model level, with a diurnal variation in the boundary layer height obtained 3-hourly from the ECMWF forecasts. This differs from the description of vertical mixing in the other models and may be responsible for the low correlation feature at around 09:00 UTC. MATCH shows the largest diurnal variability that can be partly related to a combination of chemistry, deposition and the vertical resolution, where the latter is inherited from the IFS model with a rather shallow lowest model layer (∼ 20 m). The ozone depletion processes at the surface appear too strong and not enough compensated by the vertical diffusion. The MB is then more pronounced during night-time, and a modification of the vertical diffusion has shown to improve MATCH's skill. Figure 6 also shows that there is generally a positive bias (both in MB and MNMB) of the ENSEMBLE for each of the 4 days of the forecast except around the end of the afternoon when there is a slight ozone underestimation. The ENSEM-BLE MB varies from −6 to 15.5 µg m −3 . This is consistent with the behaviour of the individual models, most of them having a positive bias on average. Only the MATCH model shows a negative bias (MB and MNMB) which has the same explanation as described above. The SILAM model has the highest positive ozone bias (MB and MNMB) on average. High ozone concentrations in SILAM are largely explained by too low dry deposition velocity over terrestrial areas, especially on the vegetated surfaces. The new scheme explicitly accounting for the leaf area index is being tested in a preoperational regime.
The normalised indicators (MNMB and FGE) of all seven models vary from −0.45 to 0.45 and from 0.16 to 0.59, respectively. This means that all models performed fairly well Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ The seven models show a fairly similar overall behaviour against observations because of their common framework (meteorology, chemical boundary conditions and emissions). Nevertheless, there are differences between ozone forecasted by each of the individual models because of their specificities (different chemistry schemes, different implementations for use of input data, different physical parameterisations). The ENSEMBLE gives generally better scores for ozone than any of the individual models.
In Sect. 3.3, we made a first investigation of the robustness of the ensemble median method with regard to the number of models available on a case study of 1 week. To go a step further, we ran a series of tests by removing one or more models in the calculation of the ensemble median over the 3 months of summer 2014. The removal is done randomly on each of the daily forecasts. Figure 7 shows the statistical results against observations for the ENSEMBLE (seven models) and the other ensemble medians calculated by removing randomly one, two, three or four models. For MB, MNMB and FGE, there is hardly any difference between all ensembles. Only RMSE and R (correlation) give significant changes. As expected, decreasing the number of models used in the ensemble tends to degrade its performances. Using six models gives a RMSE and R close to that of the full ensemble based on seven models. The scores for ensembles with four and five models are close to each other but are degraded compared to when seven or six models are used. When only three models are used, RMSE and R are worse compared to the other configurations by ∼ 0.5 µg m −3 and ∼ 0.05, respectively. This shows that the multi-model ENSEMBLE at the end of MACC-II, which is based on the median of seven models, is robust even if two or three models are unavailable. These results are consistent with the results discussed in Sect. 3.3 that were calculated over 1 week and with a different method for the model removal. Figure 8 shows the seven models and the ENSEMBLE scores for PM 10 for the last winter (2013-2014) of MACC-II. Since there were much fewer observations available at 00:00 UTC compared to other times of the day, the values given at the forecast times of 0, 24, 48, 72 and 96 h show a specific behaviour that is not analysed since it is not typical. Among the seven models, MATCH PM 10 scores are the poorest for the period. This has been traced down to an error in the sea salt emissions leading to too strong emissions and a coding error regarding summation of the various aerosol components that build up PM 10 in the model. Secondary organic aerosols are not yet included and there should be an underestimate, rather than an overestimate of PM 10 by MATCH. The poor correlations in the period are partly related to a too strong signal from sea salt. The verification for 2015 shows clearly that correction of these errors has improved the simulation of PM 10 . There is also a positive bias (∼ 3.5 µg m −3 on average for MB and 0.3 for MNMB) for CHIMERE. This could be due to the specific setup in CHIMERE being more efficient in the detection of PM 10 threshold exceedances during wintertime. It is achieved with a correction for lowering the wind over urban areas and with the modulation of the emissions from domestic heating to account for the impact of extremely low temperatures occurring during a cold surge for instance. The other extreme models that exhibit a negative bias of about −6 µg m −3 on average for MB and −0.37 for MNMB are MOCAGE and SILAM. For MOCAGE, this is due to the lack of secondary aerosols in the model. Although secondary aerosols are not dominant in PM 10 in winter, there is an expected contribution of sulfates mainly in eastern Europe. Substantial bias of SILAM PM 10 is caused by the missing SOA. However, the model showed the highest spatial correlation with the PM 10 observations, which largely follows from the detailed treatment of sea salt emission and transport. As a result, SILAM also showed among the lowest RMSEs. LOTOS-EUROS, EURAD-IM, and EMEP have a smaller negative MB and MNMB. LOTOS-EUROS has an advanced treatment of the SIA fraction but the secondary organic aerosols (SOA) are not included yet, which explains part of the negative bias. The bias in EMEP and EURAD-IM is relatively small as these two models include a comprehensive treatment of SIA and SOA. The ENSEMBLE MB and MNMB both indicate a low bias related to the fact that five of the seven models have a negative bias. RMSE and FGE (Fig. 8c,d) are consistent with bias scores with largest values for MATCH and MOCAGE.
In Fig. 8, MB, MNMB, RMSE and FGE are best during daytime (generally around 06:00-07:00 and 15:00 UTC) with diurnal variations fairly similar for all models. This is related to the fact that PM 10 is dominated by primary anthropogenic emissions of black and organic carbon which are prescribed in all models by the same TNO inventories and which have maxima in the morning and in the afternoon. The worst MB, MNMB, RMSE and FGE are at night, as for ozone. This may be linked to uncertainties in the boundary layer height at night, in vertical diffusion and/or to an underestimation of emissions. The diurnal cycle is less marked in the correlation but there is a significant day-to-day decrease of skill. As for ozone, this decrease is likely linked to a decrease of the meteorological forecast skills with time which affects more the correlation of pollutants than the other statistical indicators. Correlation values are fairly low, a bit higher than 0.4 at maximum. This is due to the lack of certain types of aerosols (SIA and/or SOA) in some models but also likely to uncertainties in the diurnal cycle of the anthropogenic emissions prescribed in the models and of the boundary layer height and vertical diffusion. The important contribution of traffic and residential heating to the anthropogenic emissions of aerosols in Europe is generally modelled as two fixed peaks that do not fully take into account the differences in habits between the countries in Europe. Also, sea salts contribute to PM 10 on the western side of Europe with emissions and scavenging depending closely on meteorological conditions and therefore directly affected by meteorological uncertainties.
The ENSEMBLE scores are best compared to the seven individual models for RMSE and FGE. For MB and MNMB, the EURAD-IM and EMEP models perform better than the ENSEMBLE. This shows that there are compensating positive/negative biases in EURAD-IM and EMEP that are removed when RMSE and FGE scores are considered. The EN- SEMBLE correlation is higher than the other models except SILAM. Figures 6 and 8 show that the seven forecasts on which the ENSEMBLE is calculated are less skillful in modelling the aerosols than ozone. This is a common feature of most chemistry models since there are still large uncertainties on primary aerosol emissions and processes of production and evolution of secondary aerosols, particularly of secondary organic aerosols. Moreover, because of the operational context of MACC-II production, the seven forecast models are optimized to run in short times. This constrains the level of detail of aerosol processes that can be afforded.

Example of the specific evaluation for the Mediterranean area
Within the European continent, the Mediterranean area is characterized by special features -high emission densities due to concentration of human activities in surrounding coastal areas, intense photochemistry, high background pollution, small-scale meteorology -that make air quality forecasting specially challenging. This is why work has been specifically carried out to evaluate the seven models and the ENSEMBLE in this region. This is complementary to the systematic daily and seasonal evaluation performed over the whole European continent. Its aim is not about scor-Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ ing the system but to achieve better scientific understanding of the behaviour of the seven models and the ENSEMBLE in the Mediterranean region. This work is based, firstly, on two high-resolution models run daily over eastern (Greece) and western (Spain) Mediterranean areas and surface station measurements that are not used in the operational MACC evaluation and, secondly, on scientific analyses of case studies.
For the eastern Mediterranean area, the LAP-AUTH forecasting system is run daily. It consists of the Weather Research and Forecasting mesoscale meteorological model (WRF version 3.2) (Skamarock et al., 2008) and the chemistry-transport model Comprehensive Air quality Model with extensions (CAMx version 5.30) (ENVIRON, 2010). The anthropogenic emission data, used as CAMx input data, are from Kuenen et al. (2014) for the reference year 2009. Anthropogenic emissions data are temporally processed using the Model for the Spatial and tEmporal diStribution of emissionS (MOSESS) (Markakis et al., 2013). The emissions originating from natural sources are calculated with the use of the emission model NEMO (Natural Emission MOdel) (Markakis et al., 2009). Wind erosion dust, sea salt and biogenic NMVOC emissions are calculated using the WRF model meteorology. The air quality forecasting system derives meteorological initial and boundary conditions from the operational 12:00 UTC forecast of ECMWF while chemical boundary conditions derived from the IFS-MOZART global model forecast and replaced by C-IFS from September 2014. The domain of the WRF-CAMx implementation is the south-eastern Europe/eastern Mediterranean region 18-30 • N, 34.9-44.5 • E. The grid resolution is 10 × 10 km 2 . The air quality modelling system runs on a daily basis in order to produce 72 h air quality forecasts. For the verification, the WRF-CAMx, the ENSEMBLE and the seven models are compared with available air quality data from the GMEECC (Greek Ministry of Environment Energy and Climatic Change) air pollution monitoring network as well as from the background station of Finokalia, operated by the University of Crete (Greece).
AEMET (the Spanish Meteorological State Agency) runs daily a version of the MOCAGE (Josse et al., 2004) model at 0.05 • horizontal resolution in the western Mediterranean coast, over a 48 h time range using the ENSEMBLE forecasts as chemical boundary conditions. Meteorological forcings for the high-resolution domain come from an operational HIRLAM run every 6 h at AEMET (Navascues et al., 2013). Emissions over land in this domain come from the GEMS-TNO inventory (Visschedijk et al., 2007). The domain is 44-36 • N, 5 • W-5 • E. The ENSEMBLE has been compared to the AEMET forecasts and to observations from EMEP/GAW Spanish stations and from different local and regional air quality monitoring networks. From these highresolution daily forecasts, a collection of case studies in which high resolution could have been an advantage, has been selected and analysed. These comparisons show the high variability of results between model forecasts depending on the location, time and day, whereas, sometimes, model forecast agreement is quite noticeable.
We are presenting here a brief summary of the analysis of the case study that occurred between 15 and 18 July 2013, when high values of ozone were measured in many Spanish air quality monitoring stations due to very strong solar radiation and high temperatures together with persistent anticyclonic conditions and very weak pressure gradients. Ozone concentrations at the surface above 140 µg m −3 were not rare at the stations used in this period and values above 120 µg m −3 were common. Figure 9 shows two maps with the 18 July 2013 ENSEMBLE and AEMET model at H+18 forecasts and the observations overplotted using the same colour intervals. The ENSEMBLE forecasts generally fit well to the measurements. The main characteristic of the ENSEMBLE forecasts (left) is that it is too smooth to capture all the small-scale features occurring in reality because of its horizontal resolution (∼ 15 km). As an example, we can look at Fig. 9 in which the Madrid area has been magnified to observe how ozone values between 100 and 160 µg m −3 were measured by different air quality networks (belonging to the Madrid Regional Authorities and Madrid City Council) whereas in the ENSEMBLE forecasts all the concentrations lie in the 100-120 µg m −3 interval. In the same period, the AEMET forecasts provide values in this area with a higher spread, between 100 and 160 µg m −3 , which fit better to observations. Something similar can be observed in the eastern Spain area, also magnified in the same figure.
Illustrations of the results of the ENSEMBLE and AEMET models compared to three EMEP stations are given in Figs. 10-12 for ozone during summer. For Cap de Creus (Fig. 10), there is a good agreement between observations and models, but the two models have a wider diurnal cycle in concentrations. This behaviour can be related to the local dynamics. The area is located in a strong wind zone on the north-eastern coast of Spain. Therefore, ozone formed in this area can be transported rapidly, prohibiting ozone accumulation and leading to a smoother diurnal cycle than in the models that are not able to represent this local effect. For Mahón (Fig. 11), both models fit fairly well the observations but the observations have generally a wider diurnal cycle in concentrations, contrarily to Cap de Creus. The Mahón station is located in the small island of Menorca in the Balearic Archipelago. It is sometimes exposed to the pollution produced by ships entering the port early in the morning, leading to high NO x conditions and low ozone. This could explain the observed decrease of ozone at this time. This local effect is not captured by the two models. The San Pablo de Montes measurements (Fig. 12) show a different behaviour with generally higher concentrations than the ENSEMBLE and AEMET models. The discrepancy is particularly important during July. This can be explained by an underestimation of the local isoprene emissions in the two models. Isoprene concentrations measured at San Pablo de los Montes exhibit higher values than other sites in the same area because of the oak vegetation surrounding the station.
Overall, the quality of the ENSEMBLE forecasts is generally good and the verification scores of the forecasts calculated for the whole period of the project show, most of the time, better results for the ENSEMBLE than for the AEMET forecasts. The limitations of the verification carried out (only the seven EMEP background air quality stations within the domain have been considered) and the different high-resolution emission inventories used in AEMET and ENSEMBLE can be part of the reason for these different results.
Another product we have started to generate at the end of the project is the behaviour of forecasts of the seven models together with the ENSEMBLE and the AEMET forecasts against observations from the EMEP air quality network. An example is presented in Fig. 13. In this figure, we can see the ozone forecasts at the ES10 station, which is located at Cap de Creus in the north-eastern corner of Spain (42.32 • N, 3.32 • E). We observe that the spread between the seven model forecasts in the H+24-H+48 forecast periods from 9 April 2014 is fairly low with most of the members producing similar forecasts. It changes quickly on the next day at the same place with the seven models providing very different concentrations leading to a high spread. We have also observed differences in the spread of the members at other locations on the same day and forecast time. More generally, this pattern with very different spreads (ranging from low to high) depends on the case studies: day, time period and location. The analysis of the spread between different model forecasts in the same period can help modellers understand how their models behave in the Mediterranean area.

Conclusion and future developments
In this paper, we give an overview of the current state and performances of the forecasting system for European air quality that was put in place in the framework of the MACC project and continued during the MACC-II project and now in the MACC-III project. Its strength comes from the fact that it is based on an ensemble of seven state-of-the-art chemistry-transport models (CHIMERE, EMEP, EURAD-IM, LOTOS-EUROS, MOCAGE, MATCH, SILAM) that are developed and run by recognised institutes in Europe. It also relies on good quality inputs for meteorological forcings, emissions and chemical boundary conditions. It provides daily 4-day forecasts for six major pollutants (O 3 , NO 2 , SO 2 , CO, PM 10 and PM 2.5 ) and birch pollen during the pollen season, as well as additional species for downscaling air quality modelling purposes. The production also includes hourly analysis for the previous day. Daily statistical performances of the forecasts against available European air quality monitoring stations are processed daily, weekly and every 3 months, giving an objective assessment of the products to users. They are also used to monitor the seasonal and yearly evolutions of the forecast scores.
Because of the resolution of the seven models (10-20 km), this system is not designed and does not attempt to fore- cast very local concentrations but large-scale phenomena and background air pollution. The ENSEMBLE has the capability to forecast pollution episodes at the regional scale as illustrated over the period from 10 to 13 June 2014. On a seasonal basis, the seven models show good statistical performances for ozone in summer 2014 and the ensemble median outperforms any of the individual models. The normalised indicators, which are less sensitive to outliers than MB and RMSE, are low, varying for the ENSEMBLE from −0.03 to 0.33 for MNMB and from 0.16 to 0.45 for FGE. The diurnal ozone peak is underestimated by the ENSEMBLE by about 4 µg m −3 on average during summer 2014. The underestimation is larger during ozone episodes, such as during 9-15 June 2014, when the ENSEMBLE underestimates ozone daily maxima on average by about 10 µg m −3 . Comparing to local surface station measurements within the pollution episode areas, the ENSEMBLE low bias is often between 30 and 50 µg m −3 but one of the models is often close to the observed values. For PM 10 , there is a negative bias of the EN-SEMBLE with MB ∼ −4.5 µg m −3 and MNMB ∼ −0.1, on average. The ENSEMBLE FGE is larger than for ozone (∼ 0.52 on average for PM 10 and 0.30 for ozone) and the correlation is lower (∼ 0.35 on average for PM 10 and 0.54 for ozone). There is a large variability of the statistical indicators of the seven models for the last winter of the MACC-II period (winter 2013-2014). This is related to different levels of complexity in the representation of aerosols in the models. PM consists of regulatory pollutants that are difficult to forecast mainly because of uncertainties in primary aerosol emissions, our partial knowledge on secondary organic aerosol processes and the constraint of timely production that prevents using very sophisticated representations of secondary aerosols. A possible improvement to the ensemble median performances at low cost could be to remove the bias of the individual models before the ensemble median calculation. To complement the statistical evaluation done over the whole European domain, a scientific evaluation of the seven models and of the ENSEMBLE is also done for the Mediterranean region because of its specificities (emissions, population, topography, meteorology, photochemistry). Another important point to note is that major efforts have been put during MACC-II towards the full operationalisation of the system in order to improve its robustness.
The regional air quality production was extended during MACC-II and further developments are underway to improve the quality, the variety and the timeliness of its products based on users' feedbacks. In the very short term, the EN-SEMBLE analysis that is only provided for ozone will be extended to NO 2 from January 2015 and verification statistics with independent data will be produced. A shift to an earlier ENSEMBLE analysis production time, at 11:00 UTC, is also planned from early 2015 following the users' recommendation. One planned change in the mid-term will be to have all individual models run at a ∼ 10 km horizontal resolution. This should improve the performances of the system compared to observations. Also, the regional production benefits and will continue to benefit from the evolutions and improvements of the global production, such as from the use of the Geosci. Model Dev., 8, 2777-2813, 2015 www.geosci-model-dev.net/8/2777/2015/ newly operated C-IFS (fully coupled chemistry to the IFS meteorological model) since September 2014 for regional boundary conditions for chemical species and aerosols. In parallel, a dedicated fire emission product for regional forecast purposes, available earlier than the current operational product, is progressively implemented in the seven models and its usefulness will be assessed. Product evaluation is done on the basis of the available NRT measurements from the European AQ monitoring network. Stations used are selected to be as representative as possible of the model horizontal resolution, by retaining only classes 1-5 from the Joly and Peuch (2012) classification. There is ongoing work to improve the station selection, still based on Joly and Peuch's classification, by determining the best class ranges to be used individually for each of the six pollutants (O 3 , NO 2 , SO 2 , CO, PM 10 , PM 2.5 ) based on pollutant lifetime. Continuous research is pursued to improve the seven individual models and their assimilation systems. In particular, there is an important effort for the use of new satellite data or combinations of satellite data with surface measurements in the assimilation systems. Also, there is ongoing work on ensemble methods in order to extract as much value as possible from the seven model forecasts. Alternative methods to the median are currently tested: application of weights on the individual models at each grid point related to the performances from the day before or spectral decomposition . The results of these alternative methods applied to the MACC-II multi-model ensemble will be the subject of a forthcoming paper. Another goal in MACC-II was the start of research and developments for the modelling of CO 2 in the regional models in view of potential future high-resolution surface CO 2 flux inversion products over Europe. This work will be pursued.
In the next few years, the availability of more daily European surface observations in a wider European area (i.e. from more countries) and at earlier times is foreseen. More data on a wider area would improve the strength of the statistical product evaluation. The continuous improvement of the quality of the surface monitoring data is also important for performance evaluation. Earlier availability of the surface station data would allow for earlier production of the analyses with the goal of using the analyses as the initial state for the forecasts.
Other studies will be conducted on the possibility to provide complementary indicators such as the exceedances in ozone or PM 10 . In future, the production could be extended to types of pollens other than birch. There are currently some developments to test olive, grass and ambrosia pollens based on work done at the Finnish Meteorological Institute. Also, the possibility to produce additional species will be considered for users running forecast systems at finer scales than the MACC-II system, such as the concentrations of different types of aerosols.

Appendix A: Statistical indicators
The forecast performances are measured using five statistical indicators: the mean bias, the root mean square error, the modified normalised mean bias, the fractional gross error and the correlation.
The mean bias captures the average deviations between two data sets and is defined as where f i and o i are the forecast value at the observation location and the observation value, respectively. The root mean square error combines the spread of individual error and is defined as It should be noted that the RMSE is strongly dominated by the largest values, due to the squaring operation. Especially in cases where prominent outliers occur, the usefulness of the RMSE is questionable and the interpretation becomes more difficult. MB and RMSE are not dimensionless variables but have the same dimension as the modelled/observed quantity and require knowledge of typical mean values. By scaling the MB and RMSE to the observations, these metrics can be made relative, dimensionless, and hence more appropriate for use as a score. This is relevant when comparing the bias and RMSE of atmospheric species whose concentrations can vary by orders of magnitude. This is why the modified normalised mean bias (MNMB) and the fractional gross error (FGE) are also used. MNMB is defined as This gives a measure of the forecast bias bounded by the values −2 to +2. It performs symmetrically with respect to under-and overprediction of the observations, which is a desirable feature.
FGE is defined as FGE gives a measure of the overall forecast error. This is proposed in addition to the more traditional RMSE because due to the squaring procedure the RMSE gives the largest weight to the (possibly spurious) largest observations. FGE is bounded between 0 and 2.
In addition, the correlation coefficient is needed to indicate the extent to which patterns in the forecast match those in the observations. The correlation coefficient R between the forecast and observed values is defined as wheref andō are the mean values of the forecast and observed values and σ f and σ o are the corresponding standard deviations. The correlation coefficient has a maximum value of unity when, for each observation site, where c is a positive constant. In this case the two data sets have the same pattern of variation but are not identical unless c = 1 for all sites.
Acknowledgements. This study was funded by the European Commission under the EU Seventh Research Framework Programme (grant agreement no. 283576, MACC II). In situ air quality data were provided by the European Environment Agency. Additional financial support at the national level was given by the French Ministère de l'écologie, du développement durable et de l'énergie through the ADONISS project. This work was granted access to the HPC resources of CCRT under the allocation 2013-6695 made by GENCI (Grand Equipement National de Calcul Intensif). IASI has been developed and built under the responsibility of the Centre National d'Etudes Spatiales (CNES, France). Developments of the SILAM system were supported by projects ASTREX and IS4FIRES of the Academy of Finland. NO 2 column retrievals from AURA/OMI and METOP/GOME-2, and CO profiles from TERRA/MOPITT have been used for assimilation in some of the individual models.
Edited by: A. Lauer