Validation of reactive gases and aerosols in the MACC global analysis and forecast system

The European MACC (Monitoring Atmospheric Composition and Climate) project is preparing the operational Copernicus Atmosphere Monitoring Service (CAMS), one of the services of the European Copernicus Programme on Earth observation and environmental services. MACC uses data assimilation to combine in situ and remote sensing observations with global and regional models of atmospheric reactive gases, aerosols, and greenhouse gases, and is based on the Integrated Forecasting System of the European Centre for Medium-Range Weather Forecasts (ECMWF). The global component of the MACC service has a dedicated validation activity to document the quality of the atmospheric composition products. In this paper we discuss the approach to validation that has been developed over the past 3 years. Topics discussed are the validation requirements, the operational aspects, the measurement data sets used, the structure of the validation reports, the models and assimilation systems validated, the procedure to introduce new upgrades, and the scoring methods. One specific target of the MACC system concerns forecasting special events with high-pollution concentrations. Such events receive extra attention in the validation process. Finally, a summary is provided of the results from the validation of the latest set of daily global analysis and forecast products from the MACC system reported in November 2014. Published by Copernicus Publications on behalf of the European Geosciences Union. 3524 H. Eskes et al.: Validation of reactive gases and aerosols in the MACC global analysis and forecast system


Introduction
Air pollution is a major issue worldwide, and evidence is accruing on its adverse effects on human health (e.g. WHO, 2013) and ecosystems (e.g. Krupa et al., 2006). Since some air pollutants are also radiatively active, climate change and air pollution are tightly linked problems (IPCC, 2013;Alapaty et al., 2012). Air pollutant concentrations are not only influenced by very local sources (traffic, industry, local heating) but also contain a long-range component (HTAP, 2010;Schere et al., 2012). Greenhouse gases and certain pollutants like carbon monoxide (CO) and ozone (O 3 ) have long residence times and can easily travel around the globe, while chlorofluorocarbons can enter the stratosphere, harming the ozone layer (WMO, 2014). Desert dust, volcanic ash, and sulfur dioxide (SO 2 ), or pollution plumes from major fires often travel far, even between continents, and long-range transported air masses can have a major influence on pollution concentrations at the surface. The day-to-day variability of pollution levels is large, and strongly influenced by local and large-scale weather patterns.
The European Copernicus programme (http://www. copernicus.eu) is focusing on Earth observation activities in the field of land, marine, atmosphere, emergency monitoring, climate change, and security. This programme includes a series of satellite missions -the so-called sentinels. Sentinel 5 precursor (Veefkind et al., 2012;launch planned in 2016), Sentinel 4, and Sentinel 5 are missions dedicated to the atmosphere.
The atmospheric component of the Copernicus programme is the Copernicus Atmosphere Monitoring Service (CAMS). This service has been established to help Europe respond to air quality problems and a changing climate. The purpose of the CAMS and the precursor project MACC (Monitoring Atmospheric Composition and Climate) is to combine satellite and other observations into a data assimilation modelling system in order to provide daily analyses and forecasts of the variability in atmospheric pollutant concentrations. CAMS covers global and regional scales, providing boundary conditions to finer-scale air quality models.
The CAMS system will provide operational services for the composition of the atmosphere from 2015 onward, and was developed in the past 10 years by a series of European projects including Global and Regional Earth System Monitoring Using Satellite and In situ Data (GEMS;Hollingsworth et al., 2008), MACC-I, MACC-II, and the current MACC-III (http://www.copernicus-atmosphere.eu). For the global component of MACC, the numerical weather prediction Integrated Forecasting System (IFS) of the European Centre for Medium-Range Weather Forecasts (ECMWF) was extended to provide daily forecasts, analyses, and reanalyses of atmospheric composition, by combining satellite observations of atmospheric composition with state-of-theart atmospheric modelling. Modules for aerosols Benedetti et al., 2009) and greenhouse gases (En-gelen et al., 2009;Agustí-Panareda et al., 2014) were added to the IFS model code. Originally, atmospheric chemistry was not included online in the IFS, rather the chemistry transport models were run alongside the meteorological analysis system IFS with meteorological fields and chemical tendencies exchanged by a coupler (Flemming et al., 2009). Two such systems were developed, coupling the IFS to the chemical transport models (CTMs) MOZART (Kinnison et al., 2007) or TM5 . More recently, this reactive chemistry component has been integrated in the IFS , creating the Composition-IFS (C-IFS) system.
Through continued quantitative validation of forecasts and analyses, the performance of the MACC model and data assimilation system is documented. Awareness of issues relating to the uncertainties and representativeness of observations is crucial for interpreting the comparisons between the analysis and the independent measurements. In MACC the validation work is conducted by groups directly involved in the measurements or with strong links to the measurement teams. Verification and validation start with direct comparisons of model results with independent measurements, followed by the evaluation of a set of accuracy measures and/or skill scores (Wilks, 2006). For users of the MACC products, it is important to present the skill of the system in a way that is intuitively easy to understand and which documents the improvements of the system over time. Standard practices in the evaluation of meteorological forecasts, and the use of headline scores (e.g. Haiden et al., 2014) serves as inspiration for the MACC validation activity.
The validation (VAL) sub-project in MACC has the task of evaluating the quality of the global service products on aerosol and reactive trace gases, including not only the daily forecasts but also the 2003-2012 MACC reanalysis. This paper provides an overview of the VAL approach to the evaluation of the MACC global modelling system developed over the past 3 years (Sect. 2). Topics addressed are the validation reports (Sect. 3), the procedure for model upgrades (Sect. 4), and scoring methods (Sect. 5). The models evaluated, and the measurements used for these evaluations are listed in Sect. 6. A summary is provided of the main validation results for the daily global forecasts (Sect. 9), but it is not the purpose of this paper to describe these results in detail. Finally, we discuss current developments and future aspects (Sect. 10).
More detailed validation results have been (and will be) described in several scientific papers from the individual partners of VAL Cuevas et al., 2015;Wagner et al., 2015;Langerock et al., 2015;Katragkou et al., 2015) or contributions to papers led by partners from other sub-projects of MACC Flemming et al., 2015;Pérez García-Pando et al., 2014;Stein et al., 2014;Cesnulyte et al., 2014). Several of these papers are submitted to the MACC special issue of the EGU Copernicus journals Atmospheric Chemistry and Table 1. Overview of the trace gas species and aerosol quantities relevant for the real-time global atmospheric composition service. Shown are the data sets assimilated (second column) and the data sets used for validation (third column). Normal text indicates that substantial data are available to either constrain the species in the analysis, or substantial data are available to assess the quality of the analysis. Italic text indicates that measurements are available, but that the impact on the analysis is not very strong or indirect (second column), or that only certain aspects are validated (third column).

Validation of the global MACC services
Quality assurance is an essential element of a pre-operational monitoring service such as MACC. Validation information needs to be supplied regularly and accompany the data products and services provided on the MACC website. The main purpose of the MACC validation effort is to provide the users of the future CAMS with appropriate information to judge the quality of the data sets. A secondary aim of the validation work is to provide feedback to the MACC modelling teams so as to guide model improvement and further development and to contribute to scientific studies and the evaluation of new model versions .
In MACC it was decided to provide 3-monthly updates of the validation reports of the near-real-time analysis and forecasts services. This high update frequency of the validation is implemented both for the global production of daily aerosol and trace gas analyses (Eskes et al., 2014b), as well as for the regional air quality forecast service, which is based on a decentralized ensemble of seven models (Marécal et al., 2015). In this paper we discuss the activities for the global aerosol and reactive gas services. The greenhouse gas sub-project of MACC (Bergamaschi et al., 2013;Chevallier et al., 2014;Massart et al., 2014) has its own validation activity, which will not be discussed in this paper.
For the other global services, the update frequency of validation reports depends on the product. During the production of the MACC reanalysis  in MACC-II, the corresponding validation report was updated roughly each half year, corresponding to one more year added to the reanalysis data record. These reports (Eskes et al., 2014a) are available on the MACC web-site at http://macc.copernicus-atmosphere.eu/services/aqac/ global_verification/validation_reports/. The VAL sub-project also provided a validation report for the MACC 30year ozone column reanalysis (the Multi-Sensor Reanalysis (MSR); van der A et al., 2010), which is available on the MACC website.
The VAL sub-project is maintaining a set of web pages with more detailed verification plots for individual seasons, months, or days (http://www.copernicus-atmosphere. eu/services/aqac/global_verification/). Some of these pages are based on near-real-time data, and they are complemented by the near-real-time (NRT) monitoring information from the data assimilation system.
For a good understanding of the quality of the MACC system, it is important to consider which species in the global assimilation system are constrained by the observations, and which species are covered by the validation data sets used; this is summarized in Table 1. The MACC aerosol and reactive gas models contain on the order of 100 species with global coverage and range from the surface into the mesosphere. Clearly, only a small fraction of this is observed and constrained by the available observations.
-Assimilation: the MACC assimilation is focusing on aerosol optical depth (AOD), ozone, CO, NO 2 , and SO 2 . Note that the species are treated in a univariate way and correlations in background errors of different species are neglected . An analysis update of one trace gas will nevertheless influence others through the chemical reactions.
-Validation: the validation is also constrained by the limited amount of trace gas and aerosol properties for which validation data are available. Furthermore, validation is limited by the amount of external data that are available in real time or at least within a few weeks after measurement, and with a reasonable global coverage.
For the validation work MACC has the following requirements.
-For near-real-time verification of the analyses, the independent measurements should become available within a few days.
-For the evaluation of the daily analyses and forecasts service -through the 3 monthly validation reports -data can be used that becomes available within 6 weeks.
-For the 10-year reanalysis produced by MACC (or planned reanalyses in the future CAMS), the requirements are more relaxed and observations several years old can also be accommodated.
Because of these requirements, the MACC consortium is keeping close contacts with major worldwide networks. -MACC maintains close links with the World Meteorological Organization, Global Atmosphere Watch (WMO-GAW) (http://www.wmo.int/pages/prog/arep/ gaw/gaw_home_en.html) to improve the use of the measurements performed at the numerous stations worldwide, contributing to this programme, and some stations have begun to submit data sets with weekly or monthly update frequencies for use in the MACC validation.
-Regarding aerosols, MACC has negotiated access to level 1.5 AERONET (AErosol RObotic NETwork; http: //aeronet.gsfc.nasa.gov) data as level 2.0 data only become available after re-calibration of the instruments which have been in the field.
We note that Table 1 represents the current status of the system. In collaboration with networks like GAW and NDACC, other data sets are investigated for inclusion in the future CAMS validation activity. For instance, in the coming years the IAGOS aircraft will provide observations of aerosols, NO x , NO y , CO 2 , and CH 4 , in addition to O 3 and CO that are currently used.

Validation reports for the atmosphere composition forecast and analysis service
The main aim of the 3-monthly validation reports (e.g. Eskes et al., 2014a, b) is to provide the users of the services with up-to-date information on the quality of the products through comparison with independent observations. The reports contain the following sections.
-An extended summary -typically seven pages -of the main findings of the validation work. This summary is targeting the different user areas, which are defined in the reports as climate forcing, regional air quality, ozone layer, and UV.
-A system summary section. This section contains an overview of the model configurations; description of the models and assimilation; overview of the assimilated data sets; evolution of the system and overview of major model changes; MACC products overview; availability and timing of the daily MACC analyses/forecasts. The document refers to the detailed change logs and model information that are available on the MACC website.
-A detailed section on the validation results obtained for the different species in troposphere and stratosphere. This is the bulk of the document.
-A section to discuss a number of high concentration events and the ability of the MACC forecast and analysis to capture these events.
-An annex providing traceability information on the validation methodology used.

New updates: e-suite reports
The MACC project follows a well-defined procedure to introduce model upgrades of the operational data assimilation and model system, which is called the "o-suite". First, model changes that are developed by ECMWF's research department or the scientific partner institutions in the MACC project are tested offline, and quick checks are performed to test the improvement of the model or assimilation aspects targeted by the update. Once these tests are satisfactory, a new model version is earmarked for operational use. At this point, a series of hindcasts for a period between 3 and 6 months are generated in a set-up that closely mimics the o-suite. This parallel assimilation system is called the "e-suite", or experimental suite. A change log for this e-suite is provided on the MACC website. Near the end of the e-suite production phase, VAL performs an evaluation, comparing the performance of the operational o-suite and the new e-suite against the independent observations. If this test shows improved (or at least comparable) scores, a positive advice is given to replace the o-suite, but if problems are identified the VAL results may also lead to a delayed instalment of the new model version after the weaknesses have been corrected for.
In the period January 2012-November 2014, four upgrades of the o-suite have been introduced, and for each of them an "upgrade verification note" was produced. These reports are part of the production system description pages that can be found on the "operational info" section of the MACC website. In one case a negative upgrade advice was given because the e-suite showed a strong loss of aerosol mass during the forecast (see Fig. 1).

Accuracy measures and scoring methods
The VAL sub-project maintains a living document on the evaluation methodology with project-wide recommendations on scoring approaches . The aims of this evaluation methodology report are -to "harmonize" the scoring methods by proposing a "default" set of accuracy measures for VAL as well as the other sub-projects in MACC; -to develop a set of "headline scores" which may be used in the future to document the improvements of the Copernicus Atmosphere Monitoring Service products over time (discussed in the Discussion and Future Perspectives section); -to introduce uniform graphics styles and a uniform presentation of validation results on the MACC II website; -to briefly discuss the value of alternative scoring approaches (e.g. threshold scores, ranking scores).
The main scoring recommendations are the following.
-Initial evaluation: verification-validation starts with basic evaluation of the model results against individual independent observations. This includes time series plots and scatter plots. For large number of points (> 200) it is recommended to replace the scatter plot by scatter density plots.
-Accuracy measures: it is recommended to use a minimal set of accuracy measures to evaluate and compare model results. These are the modified normalized mean bias, the fractional gross error and the correlation coefficient.
-Data stratification: it is recommended to apply a baseline temporal aggregation of the individual modelobservation comparisons on a (3-monthly) seasonal basis. For the global models and for the troposphere it is recommended to apply a baseline spatial data stratification using pre-defined regions. It is recommended that verification is done both against (a) gridded observations (model-oriented verification) on common latitude-longitude grid, and, (b) station observations (user-oriented verification) whenever possible.
-Presentation: within VAL we adopted a uniform presentation in the figures. The colours of the curves are reserved for the different model configurations. Black is generally used for the independent data.
The scoring recommendations are used not only in VAL, but also for instance for the evaluation of the MACC European ensemble air quality forecasts (Marécal et al., 2015). Representativity issues should be taken into account, given that model predictions represent averaged concentrations over a grid box, whereas observed values are either taken at individual locations that are unequally distributed over the globe, in the case of in situ observations, or integrated over space, in the case of observations from remote sensing instruments. The modified normalized mean bias (MNMB) B n , fractional gross error (FGE) E f and correlation coefficient r are computed using the following formulas: where f and o are the mean values of the forecast and observed values and σ f and σ o are the corresponding standard deviations (SDs), and N is the number of observations. The B n can have values between −2 and 2, and is symmetric around zero. E f ranges from 0 to 2, where 0 is perfect agreement, and values close to 1 or larger indicate a very poor agreement. r ranges between −1 and 1, where −1 means perfect anti-correlation, 0 means uncorrelated, and 1 indicates perfect correlation. The MNMB and FGE are alternatives for the more commonly used mean bias and the root mean square error, respectively. The normalized approach in the MNMB and FGE provides errors in a relative sense, which is easier to comprehend by users not very familiar with the concentration ranges and their units. The fractional gross error is a linear measure, and has the advantage compared to the more common root mean square measure in that it is not dominated by outliers. Both MNMB and FGE are defined relative to the mean of the observation and the model value, (f i + o i )/2, which improves over expressions where the observation alone is used as reference. For instance, surface ozone observations do in practice give readings equal to 0, which causes the division by o i to become infinity.
In the coming years, the resolution of the CAMS system is expected to increase to below 1 • . The MNMB and FGE scores in this case become less appropriate to monitor the model improvements. Small filaments of polluted air may be slightly displaced, and the mean norms will lead to a "double penalty" for the higher resolution model, even though the simulated peak values are more realistic. The introduction of new metrics is needed for a more appropriate evaluation of the improvements, and this is one of the tasks of the future validation sub-project of CAMS. During the projects GEMS and MACC, three modelling systems were developed and used to describe reactive gases in troposphere and stratosphere (Hollingsworth et al., 2008). These were constructed by coupling the ECMWF IFS system to a CTM. The CTM can be MOZART, TM5, or MOCAGE, resulting in a small ensemble of models. In this coupled system, the IFS simulates only the transport of a limited number of chemical species (O 3 , CO, NO x , SO 2 , HCHO), and the CTM provides concentration tendencies due to emissions, deposition, and chemical conversion to IFS. Satellite observations of these species (apart from HCHO) are assimilated into the IFS using the 4D-VAR analysis system, together with the full suite of meteorological observations. The resulting analyses for the five species are subsequently passed to the CTM. The CTMs maintain their own transport schemes and are driven by meteorological data at hourly resolution from the IFS. More details on the coupled systems, and references for the three models involved, can be found in Flemming et al. (2009).
During MACC, the MOZART and TM5-based systems have been used to produce daily forecasts. Because of the computing costs of running the MACC 4D-VAR system, and in order to provide one single pre-operational product, it was decided to have only one operational analysis. This MACC osuite was based on the IFS-MOZART coupled system. This system was used both for the daily analyses and forecasts, and for the production of the MACC 2003-2012 reanalysis . Apart from the analysis runs, the two coupled systems are operated without data assimilation to produce daily forecasts. The IFS-MOZART runs apply the same settings as the o-suite, except that data assimilation is not switched on and the spatial resolution is lower: T159L60 (where "T" is the spectral resolution and "L" is the number of vertical layers) compared to T255L60 for the IFS part, and this model version does not contain aerosol. The IFS-TM5 runs apply similar emissions as IFS-MOZART, but chemical reactions, deposition and transport are described by the TM5 model . More details on the model configuration and the change log can be found on the MACC website or in the validation report (Eskes et al., 2014a).
The aerosol model is integrated in the IFS and includes 12 prognostic variables, which are 3 bins each for sea salt and desert dust, hydrophobic and hydrophilic organic matter, and black carbon, sulfate aerosols, and its precursor trace gas SO 2 . Satellite AOD measurements from the Moderate Resolution Imaging Spectroradiometer (MODIS) are assimilated in this system . Changes of the operational system compared to the aerosol model described in the above papers can be found on the MACC website or in the VAL reports. The aerosol system is based on one model , and there is no stand-alone version of the model operated without data assimilation.
The reactive gas and aerosol modelling systems use realtime aerosol fire emissions from the Global Fire Assimilation System (GFASv1; Kaiser et al., 2012) developed within GEMS and MACC.
The VAL project evaluates all of these model configurations. For the near-real-time reports (Eskes et al., 2014a) three model configurations are considered: the o-suite, the free-running IFS-MOZART, and free-running IFS-TM5 coupled systems. The aerosol model is only switched on in the o-suite. The comparison between the o-suite simulated gas concentrations and the free-running model pro-vides important information on the impact of the observations through the assimilation. The comparison between the MOZART and TM5 configurations provides information on the variability between the CTMs.

After September 2014: C-IFS
A major change occurred in September 2014 when the osuite based on the coupled system was replaced by an o-suite based on a version of IFS with online chemistry (C-IFS). Currently the chemistry modules from the TM5 model are used, which are based on a modified Carbon Bond (CB05) chemical mechanism. This C-IFS (CB05) model is described in detail in Flemming et al. (2015) and the reactive gas data assimilation results with C-IFS (CB05) are reported in . The aerosol scheme is basically unchanged, and was already fully integrated into the IFS code.
The daily production of the analyses and forecasts consists of operating the full system with 4D-VAR assimilation (the o-suite). In parallel, daily forecasts are produced by running the same model without assimilation. Both model configurations are evaluated by the VAL team. A precursor of the C-IFS (CB05) system without data assimilation was producing daily forecasts from December 2012 to September 2014. This version was also evaluated by the VAL team, and results for this version are shown below.
We remind the reader that o-suite always refers to the IFSbased analysis and forecast system including the assimilation of the full suite of aerosol, chemical, and meteorological observations.

Measurements used for validation
The following independent data sets are presently used (year 2014) to produce the validation reports. Typical uncertainties and geographical details are provided in Table 2.  6000 profiles/day up to 10 % (Kramarova et al., 2014) summer 2003 heat wave over Europe  and summer 2004 Canadian boreal forest fires  have been studied. Two versions of IAGOS data are used to assess the model. The first one is the validated data used to assess the NRT model runs qualitatively in terms of vertical, daily, and regional O 3 variability. The second and final version of IAGOS data is fully calibrated and hence more reliable for an accurate model evaluation. This is usually available within 6 to 12 months after recording.
- H 2 CO and aerosol using UV-VIS DOAS; and NO 2 using FTIR and UV-VIS measurements. The number of sites is continuously expanding as more sites start submitting data in rapid delivery and in GEOMS format.
-Independent DOAS-based retrievals of NO 2 and HCHO columns (Richter et al., 2005(Richter et al., , 2011Wittrock et al., 2006) from the UV-VIS sensors SCIAMACHY (Scanning Imaging Absorption spectroMeter for Atmospheric ChartographY; Bovensmann et al., 1999) onboard ENVISAT and GOME-2 (Global Ozone Monitoring Experiment-2A; Callies et al., 2000) onboard MetOp-A. These global data sets provide a large number of comparison points at all latitudes and seasons, but do not offer vertical resolution and have larger uncertainties than many in situ observations. As the European Space Agency lost contact with the ENVISAT satellite in April 2012, SCIAMACHY is used for model validation up to March 2012, while model results are compared to GOME-2 from April 2012 onwards.
-AOD and Ångström exponent (AE) data sets from the AERONET sun photometer network. NRT level 1.5 data are made available on a monthly basis by NASA Goddard (Holben et al., 2001;Smirnov et al., 2000) and are used for a real-time verification of the analyses and forecasts. Supporting graphs were generated with the Ae-roCom tools (http://aerocom.met.no/cgi-bin/aerocom/ surfobs_annualrs.pl?Project=MACC).
-AOD, AE and dust aerosol optical depth (DOD) from 36 AERONET stations, combined with AOD from MODIS (Aqua) and with lidar vertical extinction profiles at Tenerife station. These data sets are used for the quarterly assessments of mineral dust content, and analyses of outstanding dust events over northern Africa, Middle East, and Europe. This is a relevant geographical region where two of the most important mineral dust sources of the world (the Sahara-Sahel and Middle East) are present. Previous dust evaluations have extensively used AERONET and ground-based and space-borne lidars data to assess the column dust content provided by dust models (i.e. Pérez et al., 2006;Schmechtig et al., 2011;Tegen et al., 2013;Cesnulyte et al., 2014), and PM 10 for surface dust concentration validation .
Apart from the GAW and ESRL in situ observations, also measurements from rural and remote surface air quality measurement sites are considered. The sites have to be carefully selected because they should be representative for a larger area of the size of the model resolution. Furthermore, validated data sets are typically only available after a few years and only unvalidated data can be used for the near-real-time evaluations. In particular, observations from the European Monitoring and Evaluation Programme (EMEP; http://www.emep.int), and the European air quality database "AIRBASE" (http://www.eea.europa.eu/themes/air/ air-quality/map/airbase) are used to evaluate the reanalysis results. Also evaluations based on the USA "AirNow" observations (http://www.airnow.gov) are in preparation. Apart from ozone, the aerosol composition measurements from these networks will also be considered, as well as other compounds like CO and NO 2 .
The teams involved in MACC maintain close links with many of the observation networks from which the above mentioned observational data are obtained.

Case studies
One prominent application of MACC is the description and forecasting of the variability of trace gas and aerosol concentrations and the occurrence of high concentration events. These events include dust storms , major wildfire or biomass-burning events Huijnen et al., 2012), ozone and aerosol pollution episodes , ash and SO 2 from volcanic eruptions , and the rapid depletion of ozone over the Antarctic and Arctic . The VAL group studied more than 10 events in the period 2013-2014, and the results have been included in the validation reports.
A first example of a case study is shown in Fig. 2. In June 2014 a huge desert dust plume occurred that originated in the Sahara and travelled more than 6000 km over the Sahel and the North Atlantic, impacting the Amazon and the Caribbean. The path travelled by the plume was well captured by the MACC global system, as is shown by the comparison with MODIS. The correct timing of the dust event in the MACC o-suite is further confirmed by the time series at the available AERONET sites (black dots), although the modelled optical depth has a moderate low bias of about 0.1 compared to the observations. Note that the MODIS DeepBlue data, which is providing aerosol observations over bright land surfaces, is used in the figure but not in the assimilation. A second example is the observation of a prominent biomass-burning plume from Canada by ceilometer instruments in Germany. Active fires in Canada in June/July 2013 produced a large amount of biomass-burning aerosols which were transported to Europe. The features of this biomass plume were observed by German ceilometers. In Fig. 3 measured and modelled 2-D time-height sections of biomassburning plumes at the station Soltau (northern Germany) are compared. Though total extinction is displayed, the plumes are only made of smoke particles. The uncertainty of the ceilometer extinction coefficients is estimated to be ±50 %. Areas with noisy or missing ceilometer data, e.g. above clouds, are masked to prevent misinterpretations. During this period, which is characterized by fast transport of the air masses across the Atlantic, the heights of individual plumes and even their internal structure (7 and 9 July, early 10 July) are reproduced with remarkable detail by the model. This indicates that injection heights and plume dispersion are realistic. The plume observed on 8 July at Soltau appears too weak in the model, because it had a meridional extent of about 100 km only and was displaced southward with respect to the model grid cell. Absolute extinctions, however, are about a factor of 2 too small in the model due to the much coarser resolution (in order to prevent artefacts due to averaging the ceilometer data over regions with low signal/noise ratios a high resolution is maintained). Many as-pects influence the quantitative comparison, including uncertainties in the source strength (fire radiative power observation and aerosol mass produced) uncertainties in the transport over several days, removal processes, resolution of the model and local representativity issues. Part of these modelling errors may have been corrected by the assimilation of the MODIS observations.
The widespread use of ceilometers and their capability to measure the backscatter coefficient offers a level of information content that is well suited for the evaluation of aerosol models. Their uncertainty of extinction coefficients can be below 30 %, depending on the instrument used, see, e.g. Heese et al. (2010) or Wiegner and Geiß (2012). The adequate representation of sources and dispersion of different aerosol types is still a challenge for aerosol models. The evaluation of the MACC analyses with ceilometer observations from the German Weather Service (DWD; http://www.dwd.de/ceilomap) showed the usefulness of the ceilometer data to track fire plumes, (Sahara) dust plumes, and to validate the modelled boundary layer heights.
Data from major international measurement campaigns are also used to evaluate if the MACC system is able to describe mean concentrations, transport of pollutants and observed variability. Examples are ACCESS (Roiger et al., 2014) and POLARCAT/POLMIP (Emmons et al., 2015). Note that MACC is providing support to flight planning during field campaigns like ACCESS.

Validation of the MACC o-suite
Below we give a summary of the results from the latest (November 2014) validation update for the MACC o-suite. This provides an overview of the extent of the validation work and validation methodology for the global aerosol and reactive gas service, and at the same time it serves to document the performance status of the recent MACC system against independent observations for the period up to August 2014. More detailed validation results and plots can be found in the validation report (Eskes et al., 2014a), on the MACC website and in the papers mentioned in the introduction.
The runs discussed here contain the o-suite, for this period based on analyses and forecasts from the coupled IFS-MOZART assimilation system including the MACC prognostic aerosol module. The impact of other chemistry schemes and of the use of data assimilation is furthermore assessed by comparing the validation results from the o-suite to those of the two other MACC model configurations, both without assimilation. These are the coupled IFS-MOZART system, and C-IFS (CB05), which is an earlier version of the model described in Flemming et al. (2015).

Tropospheric ozone
Model tropospheric ozone is validated with respect to surface and free tropospheric ozone observations from the GAW network, IAGOS airborne data, and ozone sondes, hence covering the model performance at the surface, in the bound-  Fig. 4. The best performance is generally achieved over the northern mid-latitudes, with MNMB often less than 0.1. This is also the region with the largest coverage of ozone sonde data. In the northern mid-latitudes and tropics, the coupled IFS-MOZART system shows in most cases larger positive MNMBs: in the northern mid-latitudes a positive offset of up to 0.2, in the tropics of up to 0.3 which appears mostly during November to March. This demonstrates that the ozone data assimilation, using stratospheric profiles (MLS) and ozone column observations, on average has a positive impact on the tropospheric ozone profile . For high-latitude regions, where data assimilation is less effective, larger biases (±0.4) are observed (Fig. 4) and the o-suite partly shows larger biases than the version without assimilation.
At the surface, the o-suite evaluation against GAW stations is generally slightly positive, especially during the summer months for European stations, which is broadly in line with the evaluation against ozone sondes, and also discussed in .
For tropical stations, biases are generally larger than over the northern mid-latitudes. The model is scarcely evaluated by the GAW network over the Southern Hemisphere. Both for Arctic and Antarctic stations the variability between the three model versions is generally larger than for mid-latitude and tropical stations, while biases with respect to observations are significant. This indicates the poorer constraints from data assimilation and also the larger uncertainty arising from the chemistry model.

Tropospheric nitrogen dioxide
Retrievals of tropospheric NO 2 columns from SCIAMACHY and GOME-2 observations are used for the validation of the three MACC systems. Nitrogen dioxide (NO 2 ) satellite observations from the OMI instrument are assimilated , but this is based on a different retrieval scheme and data from the OMI instrument which has a later overpass time. Comparisons to SCIAMACHY/GOME-2 monthly mean tropospheric NO 2 columns on a global map (Eskes et al., 2014a) show that spatial distributions of tropospheric NO 2 columns are well reproduced by all three NRT model runs throughout all seasons, indicating that emission patterns and NO x photochemistry are generally well represented. A general feature is the underestimation of NO 2 columns over the continents in general and particularly in China (the latter is also evident from Fig. 5), which may point to an underestimation of anthropogenic NO 2 emissions in the inventories. The relatively low model resolution will lead to an underestimate of strong localized emission sources. Unresolved nonlinearities in NO x photochemistry at the coarse model resolution might also play a role, as well as larger retrieval uncertainties in the winter months. Another observation is the occurrence of localized high-bias regions of NO 2 in the northern high latitudes during spring/summer, which indicates that the NO 2 produced by boreal fires in Siberia, Canada, and Alaska, as derived from the GFAS system ) may be overestimated.

Tropospheric carbon monoxide
Carbon monoxide (CO) is validated using GAW network surface observations, IAGOS airborne data, FTIR observations and satellite retrievals, hence providing good coverage both horizontally and vertically. This evaluation consistently shows that -even though the seasonality of CO can be reproduced well -there is a systematic underestimation of CO surface mixing ratios by all model versions in the Northern Hemisphere, with seasonal MNMBs up to −0.3 in comparison with GAW observations. The biases are largest during winter and early spring. During take off and landing the IA-GOS in-flight profile observations are frequently capturing layers with elevated levels of CO, and have been used to evaluate the model ability to describe the magnitude and transport of plumes originating from biomass burning .
We note that MOPITT and IASI satellite retrievals of CO are assimilated in the o-suite , so such evaluation is not an independent source of information. Nevertheless, these retrievals provide a good reference for the ability of the models to capture spatial patterns and seasonal cycles in free tropospheric CO and also clearly quantify the effect of the bias correction applied in the o-suite.
During the fire season over Siberia and Alaska an underestimation up to 10 % is observed with respect to MOPITT, in contrast to the significant overestimate in NO 2 and a positive bias in aerosol. It should be noted that MOPITT and IASI show significant differences in this region.
A clear improvement in performance of the o-suite against the free-running IFS-MOZART coupled system was found, especially during summer seasons, indicating that data assimilation is more effective in summer compared to the winter season. This is confirmed by validation with FTIR profile observations. The GAW surface observations with high temporal resolution are used to evaluate the small-scale model variability. For instance, a rather remarkable improvement of the temporal correlation between the o-suite and C-IFS (CB05) is found for most stations. This is illustrated by the

Formaldehyde
Model validation based on SCIAMACHY and GOME-2 HCHO satellite observations shows that overall, mean concentrations and spatial patterns show a good match; see, e.g. Flemming et al. (2015). A more detailed comparison reveals differences between satellite data and models, particularly over the emission regions central Africa, South America, south-eastern USA as well as Southeast Asia, indicating the significant modelling uncertainties associated with this trace gas. For instance, time series over East Asia and the eastern USA, which are both regions where HCHO columns are likely dominated by biogenic emissions, show that the MOZART-based model versions are well in line with satellite retrievals in terms of magnitude and seasonality, whereas the C-IFS (CB05) shows larger biases. In the African regions, dominated by biogenic and biomass-burning HCHO (precursor) emissions, model performance is reasonable although the C-IFS (CB05) chemistry run overestimates satellite values. In contrast to NO 2 , the HCHO columns for boreal fire regions are well reproduced by all models. It should be noted that no formaldehyde observations are assimilated, and these results reflect the performance of the unconstrained models.

Aerosol
Bulk optical properties of the MACC aerosol model are validated against NRT level 1.5 AERONET observations (see Fig. 7). Level 1.5 data are the only observations available for validation within days or weeks after sensing. The correlation coefficients are based on consistent daily mean values, from all stations and when observations are available. The figure reveals that the latest model version has on aver-age a positive MNMB of about +20 % for AOD. The positive bias is smaller in winter (+5 %) but increases in spring. A month-to-month variation is observed in the correlation, ranging from 0.65 to 0.8. On average, approximately 50 % of the day-to-day AOD variability is predicted by the o-suite. Also the +3-day forecast aerosol distributions are routinely evaluated and show 5-10 % less AOD than the initial day. This indicates that the model AOD at equilibrium between emissions and removal is somewhat lower in optical depth than the IFS analysis, possibly implying a bias in the MODIS observations used in the assimilation. These forecasts additionally show slightly lower correlation, as a consequence of imperfect forecasted meteorology and a fading impact of the initial assimilation of MODIS AOD and MODIS fire information on model performance.
The model AE is evaluated with the AERONET data, and proved to be a good indicator of aerosol size changes as a consequence of aerosol parameterization changes. The current model version shows a positive global bias indicating too fine particles in the model. A significant variation of Ångström exponent was seen over the last 3 years, which is a result of changes in the contributions from fine and coarse aerosol components to total AOD. The latter being constrained through the assimilation method.
The NRT aerosol model evaluation remains limited. One limitation is the quality of the NRT AERONET data, which have a preliminary nature. Retrospective analysis of the year 2011 shows that this level 1.5 NRT AOD AERONET data, due to undetected cloud contamination and any uncorrected instrumental drift, are on global average 20 % higher than quality assured level 2.0 AERONET data (see Fig. 8). This suggests that the o-suite bias in AOD is likely to be larger than suggested by the comparison with the NRT observations. Another limitation is that little information on the aerosol composition is available, and this can only be assessed indirectly, e.g. through the AE.
MACC o-suite dust parameters have been routinely assessed over northern Africa, Middle East, and the Mediterranean basin and southern Europe, using AERONET, MODIS (Aqua), and lidar observations. A specific evaluation has been performed, as well, for the MACC-II short (2007)(2008) reanalysis with improved dust parameterizations . The spatial agreement between MACC o-suite AOD and MODIS AOD is very good, confirming that MACC o-suite captures almost all dust outbreaks, and tracks fairly well their spatiotemporal evolution, both over the North Atlantic and the Mediterranean. The results of the comparisons of the o-suite AOD/DOD with AERONET AOD/DOD, MODIS AOD, and the WMO Sand and Dust Storm Warning Advisory System (SDS-WAS) multi-model DOD median (http://sds-was.aemet.es/forecast-products/ forecast-evaluation/model-evaluation-metrics), formed with seven to nine models, indicate an excellent agreement in all regions, except over the Sahara. In this region the o-suite tends to overestimate, showing an averaged seasonal MB  (with AERONET) ranging from 0.08 to 0.24 in winter and spring, respectively. The o-suite behaves quite well compared with other regional and global dust models, providing similar results to those of the SDS-WAS multi-model median.

Stratospheric ozone
Ozone profiles are routinely evaluated with vertical profiles from balloon-borne ozone sondes, ozone profile retrievals from the MLS, OMPS, and OSIRIS satellite instruments, ground-based remote sensing observations at a selection of stations from NDACC, including microwave, FTIR, and lidar observations.
The daily stratospheric analyses from the three model configurations are further compared with three offline stratospheric analysis systems: BASCOE (Errera et al., 2008;Vis-cardy et al., 2010), SACADA TM3DAM (van der A et al., 2010). Lefever et al. (2015) compared the analyses of stratospheric ozone by the o-suite (IFS-MOZART) with the results of these three offline systems and showed that its quality is primarily determined by the availability and vertical range of Aura-MLS observations.
Relative monthly mean biases of the o-suite are on average between −5 and 17 % compared with ozone sondes. The Antarctic ozone hole in 2013 was reproduced by the osuite with relative biases less than 10 %. The validation results of the o-suite in comparison to other model versions clearly reveal that data assimilation, and especially the use of profile observations by limb-sounding instruments such as MLS, is essential for a correct representation of the vertical distribution of ozone in the stratosphere Lefever et al., 2015). The impact of data assimilation at other locations can also be seen in the evaluations based on NDACC stations, for example at Izaña, Fig. 9.
Total ozone columns in the o-suite shows an overall good agreement compared with TM3DAM . This system can serve as a reference for the ground truth since it applies bias corrections to GOME-2 data based on the surface Brewer-Dobson measurements.
Ozone daily mean time series from the o-suite are further compared to BASCOE assimilation system and to OMPS, OSIRIS, and MLS satellite data for different latitudes at 20 km (lower stratosphere), which is relevant for future validation and operation of forecast models, see Fig. 10. This evaluation illustrates that o-suite and BASCOE are usually very close (< 5 %). There are in fact significant biases between satellite instruments, with an ozone abundance in OMPS that is in general 25-30 % lower than MLS data for all latitudes at 20 km. A similar behaviour is found for OSIRIS in the tropics, while the agreement with MLS is much better at the poles. It should be noted that the product from OMPS is relatively new, and the comparisons may improve with future retrieval algorithm updates.

Discussion and future perspectives
In this paper we provided an overview of the validation approach for the global MACC service products. The principle behind this work is that every product in the catalogue of MACC should be accompanied by validation information based on independent observations, and summarized in validation reports, which is essential for the users. For the global forecast/analysis service this validation report is updated on a very regular 3-monthly basis to provide up-to-date information on the product quality. The validation team is operating largely independently from the modelling teams. The VAL activity is targeted to users, but it also provides feedback to the modelling and data assimilation teams in MACC concerning new model test versions.
The assimilation and validation activity within MACC is clearly limited by the finite amount of high-quality observations available for comparison in NRT. The model contains a large number of trace gases and aerosol components simulated with global coverage at as high resolution as practically feasible. Only a small amount of these variables is constrained, as was indicated in Table 1. Additional constraints can occasionally be obtained from an in-depth analyses of field campaigns, e.g. Emmons et al. (2015). The focus in VAL is mainly on those modelling aspects that are strongly influenced by the assimilation process: tropospheric and stratospheric ozone, tropospheric CO, aerosol optical properties, and, to a lesser extent, NO 2 , SO 2 , and HCHO. Apart from this, the availability of observations in near-real time is crucial for the assimilation. For the validation reports the requirements are somewhat more relaxed: observations should be available within 1 month to 6 weeks.
In the near future more focus will be given to the evaluation of the MACC system in terms of trace gas and aerosol boundary conditions to regional air quality models. Suitable evaluation data sets and good quality metrics are currently under investigation. Another aspect not yet well covered in the VAL activity is the evaluation of the aerosol composition and vertical distribution, in particular because no, or very limited NRT observations are available. Additional research will be based on the climatological aerosol composition and variation (as used for AeroCom model evaluations) to obtain relevant information on the quality of the IFS forecast system. Validation of vertical distribution of some components, such as aerosols, could be improved in future, incorporating observations from networks of ceilometers and micropulse lidars functioning operationally. However, for these measures to be truly useful in MACC validation, calibration constraints must be first overcome.
Apart from the observational data sets listed in Sect. 7, which are currently used for the validation of the MACC system, VAL is also expanding its scope by looking at new promising data sets. Previous (e.g. ACCESS) and future field campaign data provide interesting case studies and allow for a more extensive evaluation in the free troposphere. A data set that was considered in MACC are ceilometer observations, and the use of ceilometer networks was discussed in Sect. 8.
A second type of new observations studied in MACC involves ground-based MAX-DOAS instruments. These instruments are well suited to probe the amount of pollutants in the boundary layer above urban areas. Because several of the instruments are located close to large cities, these observa- tions are especially valuable to test regional air quality models with enough spatial resolution to simulate fine-scale variability (see, e.g. Vlemmix et al., 2011). The models can be tested on an hourly basis during daytime, which offers the possibility to investigate diurnal, weekly, and seasonal dependencies, as well as dependencies on the meteorological conditions. For a continuous validation, a mix of stations at background locations in polluted and unpolluted regions as well as close to emission hot spots, such as cities or industrial areas would be ideal.
The near-future C-IFS system is foreseen to include a set of three different chemistry modules for tropospheric and stratospheric chemistry , and a more comprehensive aerosol model based on the GLObal Model of Aerosol Processes (GLOMAP) model (Mann et al., 2010). These independent model configurations will be employed routinely to provide a small ensemble of forecasts (without assimilation) to complement the o-suite. This ensemble will be evaluated by the validation team. This intercomparison between the model configurations will provide a better interpretation of the validation results, identifying model-related aspects and quantifying the improvement brought by the assimilation.
In the long-term there are several more generic aspects which are of concern for the validation activity in CAMS: 1. There is a clear need for a set of summary skill scores, which can be used to document the performance and monitor the improvements of the MACC system over time. This is related to the concept of "headline scores", which are used by meteorological centres to monitor and intercompare the performance evolution of the forecast system in time. A prominent example is the 500 hPa height anomaly score. In MACC we are developing a methodology to arrive at a set of skill scores. The application of this approach is work in progress.
2. The validation reports are written first of all for the users of the services. The information should be digestible by those user groups, and should be presented in a friendly way, e.g. through intuitively meaningful skill scores. Interaction with the users is facilitated by a dedicated "interface" sub-project in MACC through user surveys and workshops, and VAL is responding to the validation-related user feedback. One example is the provision of information on how well the global model is able to simulate surface ozone observations in Europe, which is currently being implemented. It is recommended that the interaction with the users will be intensified in CAMS, for instance by asking for feedback to specific users on a more detailed level.
3. The CAMS validation work done should be tested for compliance against general quality assurance principles. During MACC a "validation protocol" was developed (Lambert, 2013). In part this is based on principles developed in the Quality Assurance Framework for Earth Observation (QA4EO; http://www.qa4eo.org) activity of the Group on Earth Observations (GEO). Some aspects have been incorporated in the VAL practice, but a regular testing against these principles is foreseen.
4. The user driven future service evolution has been the topic of the EU project GMES-Pure (http://www. gmes-pure.eu). The definition of service data requirements (SDRs) was found to be a crucial intermediate step in the systematic approach on service evolution. The validation activity in the future CAMS forms an essential element for the translation of (i) the end-user requirements into SDRs and of (ii) the SDRs into observational requirements for both space and non-space components for assimilation as well as validation purposes.
5. Surface and airborne observations are crucial for CAMS, but the funding of these observations is not covered by Copernicus. Strong links with the major global networks and data providers will be maintained to ensure NRT access and data quality standards. We note that various MACC management team members and partners are strongly involved in observational network activities, in particular those coordinated by WMO.
The operational CAMS will start in 2015. It is foreseen that the validation of CAMS will proceed in a similar way as was developed in MACC, with, e.g. regular 3-monthly reports. These regular updates allow the validation teams to continuously improve the presentation of the information, taking into account the more long-term aspects mentioned above. We are grateful to the numerous operators of the AERONET network and to the central data processing facility at NASA Goddard Space Flight Center for providing the NRT sun photometer data, especially Ilya Slutker and Brent Holben for sending the data. Much of the AERONET sun photometers used in MACC aerosols validation have been calibrated within AERONET-Europe TNA supported by the European Community-Research Infrastructure Action under the FP7 "Capacities" specific programme for Integrating Activities, ACTRIS grant agreement no. 262254. The AeroCom tools have been recently developed with support from the ESA project cci-aerosol, the EU funded project ACTRIS (EFP7/2007(EFP7/ -2013, and the Norwegian research council project Aero-ComP3. The World Meteorological Organization Sand and Dust Storm Warning Advisory and Assessment System (SDS-WAS) Regional Center for northern Africa, Middle East, and Europe has contributed to MACC dust evaluation, and provided the multi-model dust aerosol optical depth median.