Representativeness errors in comparing chemistry transport and chemistry climate models with satellite UV – Vis tropospheric column retrievals

Ultraviolet–visible (UV–Vis) satellite retrievals of trace gas columns of nitrogen dioxide (NO2), sulfur dioxide (SO2), and formaldehyde (HCHO) are useful to test and improve models of atmospheric composition, for data assimilation, air quality hindcasting and forecasting, and to provide top-down constraints on emissions. However, because models and satellite measurements do not represent the exact same geophysical quantities, the process of confronting model fields with satellite measurements is complicated by representativeness errors, which degrade the quality of the comparison beyond contributions from modelling and measurement errors alone. Here we discuss three types of representativeness errors that arise from the act of carrying out a model–satellite comparison: (1) horizontal representativeness errors due to imperfect collocation of the model grid cell and an ensemble of satellite pixels called superobservation, (2) temporal representativeness errors originating mostly from differences in cloud cover between the modelled and observed state, and (3) vertical representativeness errors because of reduced satellite sensitivity towards the surface accompanied with necessary retrieval assumptions on the state of the atmosphere. To minimize the impact of these representativeness errors, we recommend that models and satellite measurements be sampled as consistently as possible, and our paper provides a number of recipes to do so. A practical confrontation of tropospheric NO2 columns simulated by the TM5 chemistry transport model (CTM) with Ozone Monitoring Instrument (OMI) tropospheric NO2 retrievals suggests that horizontal representativeness errors, while unavoidable, are limited to within 5–10 % in most cases and of random nature. These errors should be included along with the individual retrieval errors in the overall superobservation error. Temporal sampling errors from mismatches in cloud cover, and, consequently, in photolysis rates, are of the order of 10 % for NO2 and HCHO, and systematic, but partly avoidable. In the case of air pollution applications where sensitivity down to the ground is required, we recommend that models should be sampled on the same mostly cloudfree days as the satellite retrievals. The most relevant representativeness error is associated with the vertical sensitivity of UV–Vis satellite retrievals. Simple vertical integration of modelled profiles leads to systematically different model columns compared to application of the appropriate averaging kernel. In comparing OMI NO2 to GEOS-Chem NO2 simulations, these systematic differences are as large as 15– 20 % in summer, but, again, avoidable.


Introduction
Chemistry transport models (CTMs) are increasingly being evaluated with satellite column retrievals from ultravioletvisible (UV-Vis) solar backscatter satellite instruments.Satellite retrievals of trace gas concentrations constitute a rich source of information on key tropospheric species such as nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), and formaldehyde (HCHO) that is beginning to be exploited on an ever-larger scale.UV-Vis satellite observations are being used to Published by Copernicus Publications on behalf of the European Geosciences Union.
When comparing model simulations to satellite measurements, both modelling errors and measurement errors are usually taken into account.Measurement errors are often reasonably well characterized (e.g.Boersma et al., 2004;De Smedt et al., 2012;Lee et al., 2009), but modelling errors are more difficult to establish, because of the large number of uncertain model processes, uncertain boundary (e.g.emissions) and initial conditions, and unresolved or misrepresented aspects of atmospheric physics and chemistry.Modelling errors are best characterized by comparing model simulations to observations.Unfortunately, the observations available for such comparisons are mostly limited in vertical range and regional coverage such as in the case of ground-based networks, or they are merely sporadic in space and time, such as for aircraft campaigns.Satellite data records are based on robust retrieval methods, provide global coverage, and cover decadal time spans.Satellite data have recently been successfully used for dedicated modelling error studies (e.g.Lin et al., 2012;Stavrakou et al., 2013).When using satellite data, modellers need to be aware that most UV-Vis retrievals generally contain little information on the vertical distribution of a species (the exception is stratospheric ozone profile retrieval in the far UV of the spectrum, but this species will not be considered in this study).
Here we focus on the application of tropospheric UV-Vis retrievals, and we limit ourselves to retrievals of tropospheric species NO 2 , SO 2 , and HCHO for comparison with models.These species are all relatively short-lived and their retrievals are generally based on differential optical absorption spectroscopy (DOAS; Platt and Stutz, 2008).DOAS retrievals in the UV-Vis match relevant absorption cross-section spectra to the solar backscatter spectrum measured by the satellite instrument in order to infer the column integral (slant column density, expressed in molecules cm −2 ) of a species along the effective atmospheric photon path.The subsequent retrieval step requires the conversion of the slant column density into a vertical column density, and this conversion depends on knowledge (assumptions) of the state of the atmosphere, e.g. on the presence of clouds and aerosols, the vertical distribution of the species, and surface properties.When these assumptions are very different from the atmospheric state modelled by a CTM, this will lead to inflated differences between modelled (by, say, CTM 1) and retrieved columns (aided by CTM 2).Such differences, however, can be avoided or in any case minimized, if the user of satellite data accounts for the representativeness and averaging kernels of the satellite data while interpreting model simulations.Representativeness here is defined as the context in which the satellite measurement holds, i.e. the horizontal coverage, the temporal representativeness, and the vertical information content of the retrieval.It is the goal of this study to provide guidelines on how users can take the representativeness of the UV-Vis column retrievals into account when comparing CTM simulations to satellite retrievals, and by how much the modelretrieval differences would inflate if aspects of representativeness are neglected.
In Sect.2, we introduce the definitions and terminology for sources of error in the comparison of models and observations, and relate these to what is common practice in the data assimilation community.In doing so, we follow the notation proposed by Ide et al. (1997), also used in relevant work by Rodgers and Connor (2003) and Migliorini et al. (2008).Section 3 will give an overview of the common features shared by various UV-Vis retrievals with an emphasis on the assumptions made in the retrieval approach that are relevant to modellers and other data users, and it provides a recipe for constructing an appropriate observation operator.Section 4 introduces the TM5 and GEOS-Chem models that we will evaluate to demonstrate the nature and magnitude of representativeness errors.In Sect.5, we discuss the error budgets associated with a confrontation of CTM simulations with satellite measurements, and, in particular, how the representativeness errors contribute to that budget.Section 6 presents the result of a practical assessment of representativeness errors made when comparing global CTM simulations of tropospheric NO 2 to satellite measurements from the Ozone Monitoring Instrument, and provides recommendations on how to minimize these.use a two-step approach, based on the DOAS technique.In step 1, the reflectance spectra measured by the satellite instruments are modelled with a fitting routine that accounts for the spectral signatures from trace gas absorption, inelastic scattering, and (broadband) Rayleigh, Mie, and surface scattering.For each of the above species, spectral regions are selected where the absorption structures are most distinct, and spectral interference from other species is minimal.The species' slant column density is then calculated from the inferred absorption in combination with knowledge of the species' absorption cross-section.Before converting the slant column densities into tropospheric vertical columns, background corrections may be required to account for the fact that a portion of the slant column has originated from the species' absorption of light in the stratosphere.
In step 2 of the retrieval, the tropospheric slant column densities are converted into vertical column estimates, using a radiative transfer (forward) model and forward model parameters, that influence the retrieval.For DOAS UV-Vis retrievals, forward model parameters typically include the sensor viewing geometry, and best estimates of the surface albedo, terrain height, cloud and aerosol properties (or an effective representation thereof), as well as the a priori vertical distribution of the species (x a ) of interest.The radiative transfer calculations are expressed as so-called air mass factors, defined as the (forward) modelled ratio of slant (N S ) and vertical columns (N V ), given the set of forward model parameters: M = N S /N V .Tropospheric air mass factors have been shown to be very sensitive to choices for surface albedo, for cloud correction, and for a priori vertical distribution, and, consequently, air mass factor uncertainties are large, and dominate the retrieval error budget for tropospheric columns (e.g.Boersma et al., 2004;Millet et al., 2006;Lee et al., 2009).
Data users need to be aware of the important role played by clouds in UV-Vis retrievals.With the exception of elevated plumes resulting from volcanoes, lightning, and aircraft, most tropospheric NO 2 , SO 2 , and HCHO generally resides in the lower atmosphere, close to their surface sources.Clouds thus typically obscure the absorbing species from (satellite) view, leading retrieval groups to advise against the use of their satellite data when taken under cloudy conditions.Trace gas retrievals under cloudy situations suffer from larger errors (e.g.Schaub et al., 2006), because the detectable fraction corresponds to the column above the cloud, leaving a so-called "ghost column" below the cloud to be added somehow.Because ghost columns are generally taken from climatology or a CTM, they do not contribute to the measured information in any way, so that inclusion of columns under cloudy situations compromises a model-satellite comparison, unless the averaging kernels are taken into account (Schaub et al., 2006).In data assimilation systems, cloudy measurements still provide valuable information on the abundance and vertical information of trace gases above the cloud, for instance for constraints on e.g.lightning-produced NO 2 (Boersma et al., 2005) and in recent cloud-slicing techniques (Choi et al., 2014;Belmonte-Rivas et al., 2015).
DOAS UV-Vis nadir retrievals are characterized by a vertical sensitivity that generally reduces with increasing atmospheric pressure, and require an a priori vertical profile of the species x a to interpret the slant column (e.g.Palmer et al., 2001;Richter et al., 2006).Because Rayleigh scattering of sunlight is more effective in the UV, fewer photons reach the lower atmosphere in the spectral range where SO 2 has distinct absorption spectral features (300-330 nm), compared to the spectral windows for HCHO (340-360 nm) or NO 2 (400-500 nm).This implies that the measurement sensitivity to species in the lower atmosphere is lowest for SO 2 , followed by HCHO, and highest for NO 2 .The contribution of the a priori profile to the retrieved column increases with decreasing sensitivity of the measurement.Uncertainty in the species a priori vertical profile thus propagates stronger for SO 2 (up to 22 % error; Lee et al., 2009), and somewhat less for NO 2 (10-15 % error, e.g.Hains et al., 2010;Vinken et al., 2014).
This a priori profile error contribution to model-satellite comparisons can be eliminated by application of the averaging kernel to the model output (Eskes and Boersma, 2003;Boersma et al., 2004;Rodgers and Connor, 2003).The averaging kernel for UV-Vis retrievals describes the relationship between the true column and the estimated, or retrieved column ŷo where the hat denotes that the retrieval represents an estimated value of the true column: with A the averaging kernel whose discretized elements can be described as A l = m l M(x a ) , with m l the scattering weights (Palmer et al., 2001), or box air mass factors for layer l (see Eskes andBoersma, 2003, andBoersma et al., 2004, for more detail).Note that the retrieval problem has been linearized around x a = 0, related to the weak absorber character of the species, which implies that the a priori state does not explicitly appear in Eq. (1).

Model evaluation with UV-Vis satellite retrievals
A comparison between satellite measurements ŷo (e.g. the retrieved tropospheric NO 2 columns within a model grid cell), and the model state x m (e.g. the modelled vertical NO 2 distribution in the troposphere), in the form of measurementminus-model departures (d) is expressed as with H the observation operator that describes the relation between the observed data and the modelled state.Apart from the observation errors (σ o in the following) and the modelling errors (σ m ), we also need to take into account representativeness errors (σ r ) associated with the fact that model simulations and satellite measurements provide different rep-resentations of a geophysical quantity.We generalize the representativeness errors as the errors introduced in a satelliteto-model evaluation by an incorrect description of the relation between the grid cell mean concentrations and the satellite retrieval(s), i.e. we can think of them as errors in the observation operator H.In data assimilation, representativeness errors are normally included in the observation errors (e.g.Jones et al., 2003;Miyazaki et al., 2012).Substantial representativeness error may arise when the observation operator H is simplified and the model is not sampled in a manner fully consistent with the satellite observation.We can identify three types of representativeness errors associated with model-satellite comparisons: 1. Spatial representativeness errors.Such errors will arise because models provide a spatially smoothed representation of the atmospheric state, whereas satellite measurements provide "snapshots", and often resolve variability at scales (pixels) smaller than the model grid cell.
2. Temporal representativeness errors.In applications focusing on clear-sky situations such as emission estimates, failure to sample the model for the same clearsky conditions and overpass time as the satellite measurements, will lead to systematic sampling errors.
3. Vertical representativeness errors.Because the sensitivity of the UV-Vis satellite retrievals is altitudedependent (Palmer et al., 2001), UV-Vis retrievals should be regarded as estimates of the state weighted by the averaging kernel (Eskes and Boersma, 2003).Neglecting the averaging kernel or vertical sensitivity of the retrieval in the comparison will inevitably introduce additional representativeness errors to the comparison in Eq. (2).
To minimize these representativeness errors in comparing CTMs and satellite measurements, we recommend to follow the recipe given in Sect. 3.2.This recipe on how to compare a CTM with satellite observations is a set of mathematical operations on satellite and model data.This is particularly relevant for short-lived species that have a high spatial and diurnal variability such as NO 2 , SO 2 , and HCHO (e.g.Boersma et al., 2008;Vrekoussis et al., 2009;Barkley et al., 2013).Details of the approach may differ (e.g.spatial interpolation of the model state to the location of the pixel, averaging over different model times close to the satellite measurement time, replacing the a priori profile with the model profile in the retrieval), as long as the general principle of consistent sampling is observed.We advise against a comparison of the original satellite column (retrieved with a priori profile x a ) to the model column xm because in that case differences between the a priori and modelled vertical profiles would inflate the overall error d, see Sect.6 and recommendations in Sect.2.3 of Boersma et al. (2004), andDuncan et al. (2014).

Sources of errors in evaluating CTMs with UV-Vis retrievals
A comparison between model simulations and satellite retrievals begins with a comparison of their theoretical capabilities.A model-satellite comparison will be influenced by: 1. modelling errors σ m , related to an incomplete knowledge and description of the atmospheric state x m , 2. retrieval errors σ o , because of instrument noise and uncertainty in the (external) forward model parameters, and 3. representativeness errors σ r , arising from fundamental differences between the-atmospheric sampling by models and satellites, i.e. errors in the observation operator H.
Assuming that these error terms are independent, the error analysis for a satellite-model column difference ŷo − Hx m can be written as: (3) with σ 2 o the best estimate for the (relative) column retrieval errors, σ 2 m for the (relative) modelling error, and σ 2 r the contribution to the error arising from the act of carrying out the comparison itself (i.e. from errors in the observation operator).Some studies (e.g.Jones et al., 2003) include representativeness errors in the observation errors.Below we will show that representativeness errors may contribute substantially to the overall error in satellite-model confrontations.
The retrieval, modelling, and representativeness errors will all have systematic and random components.In principle, one would like to distinguish between the random and systematic contributions, but in practice this is very complicated, because many systematic contributions to retrieval and model errors are only weakly correlated in space and time.Examples of subtle systematic retrieval effects are errors in individual albedo values with a small spatial correlation length but with 100 % correlation in time (for instance because residual cloud effects in the albedo climatology are strongly variable from one location to the other; Kleipool et al., 2008).When averaged over a larger region such as the spatial extent of a coarse model grid cell, the impact of such errors tends to reduce.Likewise, models will suffer from systematic errors in for instance the description of vertical transport.In particular circumstances, such as strong, small-scale convective activity, such errors tend to be acute, but in an average sense, such as comparisons aggregated over a month and a region, we may expect these errors to be smaller.

Recipe for minimizing representativeness errors
(1) The first step in comparing satellite observations to model simulations is to ensure that the satellite measurements are spatially representative for the area of the model grid cell.This is achieved by calculating the weighted average of all individual retrievals ŷo i within the superobservation model grid cell over the entire area covered by all (valid) retrievals, where the weight is given by the pixel area w i (in km 2 ): If the model grid cell happens to be smaller than the satellite pixel, Eq. ( 4) will reduce to ŷo = ŷo 1 for grid cells that are completely overlapped by a single satellite pixel (w 1 = 1).
(2) The second step is to sample the CTM field sequence x m [t], here expressed as a discrete series of periodic fields with t an integer, when model time t is closest to the satellite overpass time t o : (5) The model sequence is sometimes also sampled with somewhat looser criteria, by requiring that the absolute model-satellite time difference stays within 1-2 h (e.g.Martin et al., 2003).
(3) The third step is to apply the averaging kernel on the model vertical distribution x m to obtain the model estimate ŷm that can be directly compared to the observed state ŷo : where S l are the components at the lth vertical layer of an operator that executes a mass-conserving vertical interpolation or integration followed by a conversion to sub-columns (molecules cm −2 ) in the case that the model vertical distribution x m,l is not yet given in those units.The product of the mathematical expressions ( 5) and ( 6) forms the observation operator H in Eq. ( 2), which describes the relation between the superobservation and the modelled state.

Representativeness errors in evaluating CTMs with UV-Vis retrievals
The total representativeness error σ r is composed of horizontal representativeness errors, (temporal model) sampling errors, and vertical smoothing errors, and these three contributions may be assumed to be largely uncorrelated: For an appropriate comparison between model simulations and satellite retrievals, it is important to sample the CTM as closely as possible to the satellite's sampling of the atmosphere (see Sect. 3.2).These may seem like trivial conditions for comparison, yet one or more of these conditions are often violated.
4 Data used in this study

Satellite data
In this study, we use tropospheric NO 2 retrievals from the Dutch OMI NO 2 (DOMINO) algorithm v2.0 (Boersma et al., 2011).These retrievals proceed along the lines discussed above, with spectral fitting of NO 2 in the 405-465 nm window (van Geffen et al., 2015), data assimilation of the NO 2 slant columns in the TM4 chemistry transport model (Williams et al., 2009) to estimate the stratospheric background (Dirksen et al., 2011), and final conversion of the tropospheric slant columns with air mass factors based on radiative transfer calculations with the DAK model.In the DOMINO algorithm, altitude-dependent air mass factors (AMFs) are interpolated from pre-calculated look-up tables using the best available information on the satellite viewing geometry, surface albedo (Kleipool et al., 2008), and terrain height (3 km resolution elevation data provided with Aura data).Subsequently, the local altitude-dependent AMFs are combined with the predicted local vertical NO 2 distributions (from TM4), to produce the (tropospheric) AMFs.The AMF step also includes a correction for the temperaturedependency of the NO 2 absorption cross-section (Boersma et al., 2004), because only the 220 K cross-section is used in the spectral fit.The DOMINO v2.0 data have been evaluated in a number of validation exercises (e.g.Irie et al., 2012;Ma et al., 2013;Lin et al., 2014), showing their quality and use, although a number of relevant improvements is planned and currently being implemented (Maasakkers, 2013;van Geffen et al., 2015).DOMINO v2.0 has been used in many applications and model studies (e.g.Stavrakou et al., 2013;Castellanos et al., 2014;McLinden et al., 2014;Verstraeten et al., 2015), which makes the data product well suited for evaluating satellite-to-model comparisons and the errors associated with such comparisons, which is the purpose of this study.
CTMs are the central tools to simulate tropospheric concentrations of NO 2 , SO 2 , and HCHO, and to help interpret and use satellite measurements of these species.For the short-lived species studied here, previous studies indicate modelling biases of ±20-30 % for NO 2 (e.g.van Noije et al., 2006), and 20-50 % for HCHO (e.g.Dufour et al., 2009;Williams et al., 2012) over regions with substantial pollution.

TM5
We use the TM5, the global 3-D CTM version 3.0 (Huijnen et al., 2010b) with a grid of 3 • longitude × 2 • latitudes × 34 vertical layers, and a model top at 0.1 hPa (Krol et al., 2005).The TM5 model is used in many studies for atmo-spheric chemistry (e.g.Williams et al., 2014), aerosol haze (e.g.von Hardenberg et al., 2012), data assimilation, and inversion applications (e.g.Hooghiemstra et al., 2012;Krol et al., 2013).The model is driven by ERA-Interim meteorological reanalysis data from the European Centre for Medium Range Weather Forecasts (ECMWF) (Dee et al., 2011) and the base time step is 1 h.In the version used here, TM5 operates with Carbon Bond Mechanism 4 chemistry (Gery et al., 1989) to describe the production of ozone, hydrogen oxide radicals (HO x = OH + HO 2 ) and oxidation of nitrogen oxides (NO x = NO + NO 2 ), SO 2 , and volatile organic compounds (VOCs), with 40 species, 64 gas-phase, and 16 photolysis reactions.In TM5, SO 2 is oxidized in clouds and on aerosols, and nighttime hydrolysis of N 2 O 5 into nitric acid (HNO 3 ) is parametrized with a global mean uptake coefficient of 0.02 following recommendations by Evans and Jacob (2005).NO x emissions are from the RETRO inventory for the anthropogenic sectors (Regional Emission inventory in ASia -REAS for Asia) with a total of 33 Tg N yr −1 , 9 Tg N yr −1 from soil, 5 Tg N yr −1 from biomass burning (from the Global Fire Emissions Database v2 (GFED2) van der Werf et al., 2006), and 6 Tg N yr −1 for lightning.Global anthropogenic SO 2 emissions are taken from the AeroCom project at 108 Tg SO 2 yr −1 (Dentener et al., 2006).Biogenic VOC emissions, including the important HCHO and its precursor isoprene, are from the ORCHIDEE database (Lathière et al., 2006), and are 10 Tg C yr −1 for HCHO and 565 Tg C 5 H 8 yr −1 for isoprene.We simulated the year 2006 with a 1-year spin-up.
TM5 simulations of NO 2 and HCHO have been evaluated by Huijnen et al. (2010b) and Williams et al. (2012).These studies indicate that tropospheric NO 2 columns in TM5 are 20-30 % low compared to DOMINO v2.0 columns, but the model captures the seasonality, and shows realistic vertical distributions of NO 2 relative to INTEX-B aircraft measurements.TM5 captures the seasonality of HCHO tropospheric columns but also overestimates these columns by 0-50 %, partly because of inadequate photolysis rates in the model (Williams et al., 2012).

GEOS-Chem
We also use the GEOS-Chem model, v9-02i, with a grid of 2.5 • longitude × 2 • latitude × 47 vertical layers, and the model top at 80 km.The GEOS-Chem model is a CTM in use by a large community of scientists for a wide range of applications including, shipping NO x plume-in-grid chemistry (Vinken et al., 2011), and estimating isoprene and ammonia emissions (e.g.Millet et al., 2008;Paulot et al., 2014).GEOS-Chem is driven by GEOS-5 meteorological fields from NASA GMAO, with a time step of 30 min.As TM5, GEOS-Chem uses a condensed O 3 -NO x -HO x -VOCaerosol chemistry scheme (described in Mao et al., 2010, and references therein).The standard chemistry scheme has 66 species, and 236 chemical reactions.GEOS-Chem takes into account heterogeneous chemistry on aerosol and cloud particles (Mao et al., 2010), including the uptake of N 2 O 5 on aerosols leading to nighttime HNO 3 formation following the parametrization by Evans and Jacob (2005).Anthropogenic NO x emissions are from the global EDGAR 3.2FT2000 inventory (Olivier and Berdowski, 2001), but these are replaced by regional inventories over various continents.Other NO x emission sources in GEOS-Chem include soil, lightning, biomass burning, biofuel, aircraft, and ship, resulting in a global total source of 51.5 Tg N yr −1 for 2006 (similar to TM5 with 53 Tg N yr −1 for the same year).A 2-year spin-up was performed (2004)(2005), and GEOS-Chem output was stored for the year 2006.For more details on the GEOS-Chem simulation, see Vinken et al. (2014).
GEOS-Chem simulations of tropospheric NO 2 columns have been evaluated before by Lamsal et al. (2010) and Lin (2012), who found, similar to the TM5 evaluation discussed above, that the model underestimates tropospheric NO 2 by 20-35 % (over China).Zhang et al. (2012), in a study targeting nitrogen deposition over the United States, found excellent agreement between the modelled and OMIobserved spatial distribution of tropospheric NO 2 , but underestimates of 10 % in the northeastern US, and 40 % locally in southern California, were also evident.

Horizontal representativeness errors
If the complete spatial extent of a model grid cell is covered with valid retrievals, a good comparison is straightforward because a spatially fully representative area average can be calculated.For partly covered cases, the difficulty lies in estimating the magnitude of the (horizontal representativeness) errors associated with limited coverage of a model grid cell.One way to calculate a representative grid cell average is by averaging all valid satellite observations that were taken within the boundaries of the grid cell within a given model time step, as in Eq. ( 3), with w i the fractional grid cell coverage defined as A pixel /A cell with A pixel the area (in km 2 ) covered by the fraction of the satellite pixel that falls within the boundaries of the model grid cell with area A cell (in km 2 ).In this manner, one obtains a "superobservation" that may be considered as representative for the grid cell average (Dirksen et al., 2011;Miyazaki et al., 2012).In some modelsatellite confrontations, the number of satellite retrievals is thinned out to one per grid cell, but we advise against such an approach in view of the strong sub-grid variations and the considerable errors in individual measurements.In many global applications, the spatial resolution of the model is coarser than the resolution of the satellite observations.We caution against applying additional weighting by the individual retrieval errors in Eq. ( 4).Because, by nature of the DOAS approach, retrieval errors are largest for large col- umn values (see e.g.Boersma et al., 2004), error weighting would skew the average to the lower values in the distribution.The measurement error for superobservations can be calculated from area-weighting the individual pixel errors σ o,i to provide an area-weighted average (statistical) retrieval error σ , and by accounting for a partial correlation in the errors between pixels as in Eskes et al. (2003) (see Appendix B for a derivation): with the second term on the right-hand side representing the error correlation (c) between the n retrievals.Miyazaki et al. (2012) propose c = 0.15, based on the consideration that errors in clouds, albedo, a priori profile, and aerosol in retrievals are typically correlated in space, but they acknowledge that the exact number is difficult to estimate.Some studies take a different approach than the superobservations proposed in Eq. ( 4) and interpolate the model simulations to the centre of a satellite pixel, but the difficulty with this approach is the questionable spatial representativeness of the interpolated model value, especially if the model grid cells cover a larger area than the satellite pixels.
Both individual pixel errors and representativeness errors contribute to the total error in the superobservation.Following Miyazaki et al. (2012), we calculate the horizontal representativeness error σ r as a function of the total fractional coverage achieved by all valid pixels by random reduction of the number of retrievals used to calculate the mean grid cell value.For homogeneous scenes with little variability of NO 2 , SO 2 , or HCHO, such errors will obviously be small.But for grid cells covering strong inhomogeneous sources of air pollution, such as megacities or coal plants, we may expect the area average to depend strongly on the spatial sampling.Figure 1 illustrates the horizontal representativeness error as a function of total fractional coverage for one polluted model grid cell, here taken over the eastern United States (greater New York City), at two resolutions, i.e. 3 • × 2 • (typical for a global CTM) and 0.5 • × 0.5 • (regional CTM).To calculate the horizontal representativeness error, we randomly reduced the number of pixels n in Eq. ( 4) first by 1, then by 2, and so on, until there was only one pixel left, to obtain new estimates ŷ o .We repeated this 100 times and interpret the root mean squared difference with the original ŷo as the horizontal representativeness error, which is zero in situations of full coverage.Complete coverage of the grid cell is typically achieved by more than 100 OMI pixels in the case of 3 • × 2 • resolution grid cells, and by ±5 pixels1 for 0.5 • × 0.5 • .The horizontal representativeness errors appear higher for the 0.5 • × 0.5 • than for the 3 • × 2 • grid cell, due to the smaller sample (n = 5) size and the strong spatial gradients over the central New York area for the higher resolution model.For models with higher spatial resolution (0.5 • × 0.5 • ), there is less tolerance for reduced area coverage over strongly inhomogeneous areas such as central New York, as indicated by the steeper representativeness error increase with reduced cover (blue dashed line in Fig. 1).This reflects the more heterogeneous distribution of polluted NO 2 column values for the high-resolution model with a small sample (five pixels) than for the coarse resolution with a large sample (> 100 pixels).The 3 • × 2 • case with complete area coverage by OMI NO 2 pixels (on 17 July 2006) illustrates the potential for horizontal representativeness errors.For a fractional coverage of 0.5, the horizontal representativeness error increases to 10-15 %, which is still considerably smaller than the 20-30 % errors in the satellite measurements themselves.For fractional coverage of 0.1 however, the representativeness error increases to 35 %, a level that exceeds the theoretical NO 2 retrieval error (Boersma et al., 2011) and NO 2 validation errors (e.g.Irie et al., 2012).However, by averaging over multiple days, the representativeness error can be reduced further, depending on the day-to-day variability of the columns.Table S1 (in the Supplement) shows the statistics of a comparison between monthly mean observed and simulated columns over the greater eastern United States in July 2006, for different degrees of fractional coverage required.
In data assimilation systems, any fractional coverage may be used as long as the horizontal representativeness error is well described and accounted for along with the observation error.This can be achieved by adding in quadrature the measurement error and representativeness error σ 2 N,o + σ 2 N,r to represent the overall superobservation error.

Temporal representativeness errors related to clouds
In the case where UV-Vis satellite retrievals of the tropospheric column are used for air pollution applications (taken under cloud-free situations, see e.g.Schaub et al., 2006; Mil- let et al., 2006;Geddes et al., 2012), both measurements and models should be sampled under similar clear-sky situations.As long as the model appropriately simulates the effects of clouds on photolysis rates, this ensures that measurement and model represent the trace gas concentrations under similar photochemical regimes.Failure to sample the model on clear-sky days only, will introduce a bias in the modelled average.Short-lived trace gases may have a longer lifetime against photochemical loss in situations with overhead clouds (assuming they are represented well in models), when actinic fluxes and temperatures are lower and chemistry slower than in clear-sky situations.For trace gases whose emissions reflect distinct anthropogenic patterns, it is also necessary to sample the model according to the observations, in order to properly weigh well-documented weekend (e.g.Beirle et al., 2003;Boersma et al., 2009) and national holiday reductions (Lin and McElroy, 2011) when calculating the model average.We first evaluate the TM5 model's ability to simulate the effective cloud cover as observed by OMI at 13:30 local time.Cloud cover (and cloud optical thickness) data in TM5 are hourly interpolated from 3-hourly pre-processed ECMWF fields (Huijnen et al., 2010b).Since the OMI cloud retrieval reports effective cloud fractions, based on the assumption that clouds are optically thick (optical thickness of 40, with a corresponding cloud albedo of 0.8) (Acarreta et al., 2004;Stammes et al., 2008), we converted the TM5 geometrical cloud cover into an effective fraction comparable to the OMI observations.To do so, we used the maximum-random overlap assumption (Morcrette and Jakob, 2000) to compute the total geometrical cloud cover and total cloud optical thick-ness from the vertically resolved cloud cover and optical thickness in TM5.We used the modelled relationship between the total cloud optical thickness for a liquid water cloud and its spherical cloud albedo in Buriez et al. (2005) to calculate the effective cloud albedo associated with each grid cell's cloud cover.Finally, we weighted the total geometric cloud cover with the ratio of the effective cloud albedo to 0.8, the value assumed for all clouds in the OMI retrieval (Acarreta et al., 2004;Stammes et al., 2008).For more details we refer to Appendix C.
Figure 2 shows monthly mean effective cloud fractions as retrieved from OMI and simulated with TM5 for February and August 2006.The model was sampled within 30 min of the OMI overpass time of 13:30, and model and satellite were matched in space and time for further analysis.We see that TM5 captures the spatial patterns observed by OMI, with low cloud fractions over the subtropics, and high cloud fractions over the tropical ITCZ and the middle-to-high latitudes (> 40 • ).Largest differences occur at the edges of areas flagged as snow-covered in the OMI retrieval (February 2006), and over areas where TM5 predicts cloud optical thickness to exceed 40, such as over the tropics, where ice clouds often occur (and the relationship for water clouds from Buriez et al., 2005, is less valid).
To evaluate the simulated effective cloud fractions, we report the correlation coefficient, mean bias, and root mean square error relative to the OMI-observed cloud fractions over Europe for February and August 2006.Figure 2 shows significant positive correlation between TM5 and OMI effective cloud fractions over Europe both in February (r = 0.70, n = 3379) and August (r = 0.75, n = 4665).The mean bias between TM5 and OMI is −0.08 in February and +0.02 in August, and the root mean square error is 0.23 in February and 0.20 in August.The agreement between TM5 and OMI, while far from perfect, suggests that TM5 has some success in simulating the contrast between "cloud-free" (f OMI < 0.2) and "cloudy sky" (f OMI > 0.2) situations, i.e. the likelihood that OMI reports a clear-sky scene, while TM5 simulates a cloudy sky, and vice versa is < 20 and < 14 %, respectively.
Figure 3 shows a box and whisker plot for OMI and TM5 effective cloud fractions over Europe in February and August 2006.The figure indicates that for OMI measurements of effective cloud fractions smaller than 0.2, TM5 reproduces similar small effective cloud fractions (February median OMI: 0.09, TM5: 0.06; August median OMI: 0.05, TM5: 0.04).For days and locations when OMI observes effective cloud fractions larger than 0.2 (February: 0.59, August: 0.47), TM5 simulates comparable high effective cloud fractions (January: 0.49, July: 0.45), providing some confidence in the TM5 model, driven by ECMWF meteorological fields, to capture the observed effective cloud fractions.
Figure 4a shows a comparison of average TM5 tropospheric NO 2 columns simulated under clear-sky and cloudy situations over Europe in February and August 2006.TM5 was sampled for polluted situations (cells with monthly mean NO 2 columns in excess of 1.0 × 10 15 molecules cm −2 ) between 12:00-15:00 local time, on days with clear skies and on days with cloud cover.Under clear-sky situations, TM5 simulates tropospheric NO 2 columns that are on average 15-20 % lower than under cloudy circumstances, in line with in situ observations reported by Boersma et al. (2009) and Geddes et al. (2012) over Israeli and Canadian cities, respectively.Both in February and August, the clear-sky mean NO 2 column is 12 % below the 28-day monthly mean in February and 31-day monthly mean in August.Although we cannot rule out that other effects than enhanced photochemical loss may have contributed to lower NO 2 columns over the polluted grid cells (e.g.increased ventilation or deposition) on clear-sky days, a comparison of NO 2 columns for all European grid cells showed that the geometrical mean of the local clear-sky to cloudy column ratios was 0.74 in February and 0.89 in August, suggesting that reduced clear-sky NO 2 columns presented in Fig. 4 show a robust effect.
The results for August 2006 indicate that clear-sky sampling of the model is also relevant for HCHO in the growing season (Fig. 4b).Average HCHO columns are 12 % higher under clear-sky situations than on cloudy days and the clearsky mean HCHO column is 8 % higher than the all-sky monthly mean (August 2006).In winter, HCHO concentrations are generally low over Europe and differences between clear and cloudy sky are well below the detection limit of UV-Vis satellite sensors.
Exclusive sampling of the model on clear-sky days is important, because photolysis rates J [NO 2 ] in the lower troposphere are significantly higher on those days and can be simulated well by TM5 (Williams et al., 2012), so that NO 2 columns will be systematically lower.The differences between HCHO columns sampled on clear-sky and cloudy days are somewhat smaller than for NO 2 columns because both the formation and destruction of HCHO are driven by photochemistry.Nevertheless, the stronger summertime production of HCHO from the (OH-driven) oxidation of methane and especially isoprene outpaces the increased loss of HCHO through photolysis and oxidation (Fried et al., 1997) on clearsky days compared to cloudy days, in line with observations (e.g.Munger et al., 1995;Cerquiera et al., 2003).
To estimate the magnitude of the temporal representativeness errors arising from the particular choice of model sampling, we evaluated the satellite-model comparison results for different sampling strategies.Again, we use the averaged ratio of satellite measurements to model simulations ( ŷo / xm ), and the spatio-temporal correlation coefficient, as appropriate indicators of representativeness errors.Since the model-measurement bias may well be due to unrelated systematic errors in either the CTM (emissions, chemistry) or the satellite retrievals, we are not concerned with the absolute value of the measurement-to-model ratio, but we are interested in the sensitivity of the ratio to various sampling strategies.We tested four strategies for comparing tropospheric NO 2 over large polluted regions: (A) both OMI (for OMI effective cloud-fraction) and TM5 (TM5 effective cloud fraction) collocated and sampled for mostly clear-sky scenes only at the OMI overpass time of 13:30, (B) OMI and TM5 collocated and co-sampled for situations with OMI effective cloud radiance fractions < 0.5 2 , (C) OMI sampled for situations with OMI effective cloud radiance fractions < 0.5, but TM5 more loosely sampled for OMI effective cloud fractions < 0.6, and (D) OMI sampled for situations with OMI effective cloud radiance fractions < 0.5, but TM5 sampled for all days in the month (i.e.no temporal collocation except for appropriate overpass time).Strategy (A) is considered to be optimal, but to our knowledge has not been applied in studies to date.Strategy (B) has been followed in numerous studies, and relies on the assumption that CTMs capture the observed cloud cover well.In spite of its erroneous co-sampling with the satellite measurements, strategy (D) has also been used frequently, and therefore we tested its impact on the temporal representativeness errors.Finally, strategy (C) holds middle 2 The cloud radiance fraction is defined as the relative contribution of top-of-atmosphere radiance received by the cloud part of the pixel.A cloud radiance fraction of 0.5 corresponds to a geometric cloud fraction of ±0.2.ground between (B) and (D). Figure 5 shows that the modelto-measurement ratio shows substantial dependence on the comparison strategy, especially in winter.The differences between strategies (A) and (B) are negligible, but with strategy (D) the OMI / TM5 ratio drops more than 25 % below the values obtained by strategies (A) and (B).These strategies also demonstrate that strategy (D) leads to a reduced capacity of the model to explain the observed variability in the NO 2 spatial patterns, with R 2 dropping almost 10 % (from 0.64 to 0.55 in winter and from 0.66 to 0.59 in summer).Analyses for other regions showed similar results as in Fig. 5.These results imply that for applications of satellite data such as emission estimates or model evaluations, substantial systematic errors may occur in the final estiwww.geosci-model-dev.net/9/875/2016/Geosci.Model Dev., 9, 875-898, 2016 mate, if sampling strategies such as (D) are used.We therefore strongly discourage the use of such comparison strategies, as they lead to considerable temporal representativeness errors, and, thus, systematic underestimations in measurement : model ratios.

Vertical representativeness errors
Here we evaluate the representativeness errors introduced in a satellite-model comparison if the averaging kernel is not accounted for.To illustrate the way the kernels work, Fig. 6 shows GEOS-Chem NO 2 vertical profiles with and without the averaging kernel applied over the Beijing grid cell on clear-sky days with excellent spatial coverage (18 February and 23 August 2006).On both days, application of the kernel leads to a higher value for the model column, reflecting the relatively larger amounts of NO 2 aloft in GEOS-Chem simulations compared to the a priori TM4 NO 2 profiles.The lower panels show that on two other clear-sky days (17 February and 31 August 2006) the kernel has only little effect on the GEOS-Chem tropospheric NO 2 column.On these days, the TM4 a priori and GEOS-Chem NO 2 profiles show similar, less pronounced vertical distributions.Nevertheless, in Fig. 7 we see that, on average, for February and August 2006, the OMI averaging kernels result in increases in GEOS-Chem NO 2 columns over Beijing of 15 % (February) and 8 % (August), and a closer agreement with OMI NO 2 retrievals.This result can be understood from the stronger vertical mixing in the GEOS-Chem model compared to TM4, rather than from differences in NO x emissions or chemistry between models (NO 2 amounts are quite similar between TM4 and GEOS-Chem over Beijing in 2006).The above finding does not have general validity in the sense that applying the kernel on any other model will also result in a tropospheric column increase.Applying the kernels to NO 2 profiles from a model with weaker vertical mixing than TM4 (rather than generally stronger vertical mixing as in the case of GEOS-Chem) is likely to reduce those columns.Figure S1 in the Supplement shows as much for the North Sea grid cell in February 2006, when GEOS-Chem exceeds TM4 NO 2 concentrations below 900 hPa, and for Siberia in August 2006, when GEOS-Chem simulates a substantially enhanced tropospheric NO 2 column compared to TM4.
We next compare the monthly averaged GEOS-Chem tropospheric NO 2 column fields for February and August 2006 with and without the kernels applied.Figure 8 shows that applying the kernel leads to substantial increases of up to 2 × 10 15 molecules cm −2 in the columns for the polluted source regions in the Northern Hemisphere (eastern USA, Europe, and China).At the periphery of these regions in wintertime, and over regions with possible biomass burning in summer, we see that the smoothed columns can be lower than the original columns, indicating that the GEOS-Chem vertical NO 2 profile is more skewed towards the surface than the TM4 a priori in those situations, as confirmed by the profiles shown in Fig. S1.
Here we evaluate the level of agreement between the original GEOS-Chem and OMI NO 2 columns, compared to the level of agreement between the kernel-based GEOS-Chem and OMI NO 2 column for the polluted source regions in the Northern Hemisphere, as the differences provide a measure of the representativeness errors that can be avoided by using the averaging kernel.Figure 9 shows the agreement between OMI and the GEOS-Chem NO 2 columns with and without kernel over Europe in February and August 2006.The upper panels indicate that the spatial correlation between the model and OMI tropospheric columns improves when the kernel is applied on the model NO 2 profiles, especially in summer when differences between the TM4 a priori and GEOS-Chem NO 2 profile shapes are strong.Application of the kernel also results in geometric mean OMI : GEOS-Chem ratios with smaller uncertainty intervals at values of 1.15 1.88  0.70 (February) and 1.24 1.80 0.86 (August) compared to 1.13 1.89 0.67 and 1.42 2.21 0.91 .We find similar results over the eastern United States and China (see Table 1).Figure 9d further supports the notion that application of the kernel allows for a better-constrained evaluation of the model, as witnessed by the more peaked and narrower histogram of satellite : model ratios.We conclude that sampling the model according to the averaging kernel is especially relevant in summer, and improves the satellite-model evaluation by removing differences between (TM4 a priori and GEOS-chem) profile shapes contributing to the discrepancies (Boersma et al., 2004).Neglecting the kernels for GEOS-Chem would lead to up to 15 % stronger discrepan-  cies between OMI and GEOS-Chem, and this portion could be wrongfully attributed in a model evaluation to e.g.too low NO x emissions, or too fast NO 2 removal by chemistry or deposition.Appendix D presents an alternative to the application of the averaging kernel by providing a recipe to replace the a priori profile used in the retrieval by the profile from the CTM under evaluation.Such a recipe results in a modified retrieval that can be directly compared with the CTM under evaluation.

Combined representativeness errors
To obtain an estimate of typical, overall representativeness errors in model evaluations with UV-Vis satellite measurements, we define three types of model evaluations, executed with increasing degree of detail.We again evaluate tropospheric NO 2 from the GEOS-Chem model here (with OMI NO 2 retrievals), as this model is sufficiently different from the TM4 model used to provide the a priori profiles in the OMI retrievals.The three types of evaluations can be characterized as advanced, common, and naïve: For evaluation (C), the model monthly average was based on samples from all days of the month (on OMI overpass time), irrespective of cloud coverage, and no kernel was applied (in other words a 31-day, all-sky, without AK monthly mean).We first evaluate the (avoidable) representativeness errors by comparing local OMI : GEOS-Chem ratios evaluated with approaches (A) vs. (C), and approaches (A) vs. (B).Figure 10 shows the relative difference in the local OMI : GEOS-Chem ratios for February and August 2006.We see that the systematic, avoidable errors in the OMI : GEOS-Chem ratio are largest with evaluation approach (C).The blue colours in the upper panel of Fig. 10a indicate that, in winter, sampling the model on all (including cloudy sky) days leads to too low (by 15-20 %) OMI/GEOS-Chem ratios reflecting the too high GEOS-Chem NO 2 values resulting from temporal representativeness errors (cloudy-sky sampling, see Fig. 4).
The similarity between the panels of Fig. 10b shows that appropriate sampling is not as important in summer, a season with ample clear-sky days, and, consequently, a smaller sampling error.Figure 10b suggests that application of the averaging kernel when sampling the model is the most important step, with the red colours indicating that failure to apply the averaging kernel leads to OMI / GEOS-Chem NO 2 ratios that are too high by up to 30 %.We conclude that appropriate clear-sky sampling is mainly important in winter, but vertical smoothing is less relevant in that season.The reverse holds in summer: with sufficient clear-sky days available, application of the averaging kernel becomes essential, reflecting the fact NO 2 vertical distributions are especially different between (the TM4 and GEOS-Chem) models in that season.Table 1 summarizes the results of the OMI / GEOS-Chem comparisons for the three specific regions of the United States, Europe, and China following the different evaluation approaches.In all cases, the spatial correlation between model and measurements within the regions is highest for evaluation approach (A), and generally lowest for approach (C).Wintertime OMI : GEOS-Chem ratios are too low by 15-20 % with approach (C) and too high by 5-10 % in summer.Using the common approach (B), OMI/GEOS-Chem ratios are primarily biased in summer, by +15-20 % for Europe and the United States, and by −5% for China.The results in Table 1 and Fig. 9 also indicate that the spread of local OMI/GEOS-Chem ratios is ±30 % for approach (A), smaller than for approaches (B) and (C) with spreads of ±35 %, corroborating the fact that using the kernel results in a better-defined comparison between satellite measurements and model simulations.
We summarize the contribution of the model sampling errors to the overall representativeness errors for the evaluation of GEOS-Chem simulations with OMI NO 2 in Table 2.The table should not be interpreted as a general recommendation for all applications, but rather as a recommendation for air pollution applications such as model evaluation and inversions to estimate emissions.For instance, for data assimilation and studies of the higher atmosphere, retrievals under cloudy situations can still be used, and the main recommendation there is to apply the averaging kernel.The table shows that naïve comparison strategies (C) that do not account for appropriate temporal or vertical sampling will result in a largely systematic representativeness error of up to 25 %.Following the motivated recommendations discussed

Discussion and conclusions
Evaluations of CTM simulations with UV-Vis satellite retrievals of short-lived gases, notably NO 2 and HCHO, are strongly influenced by the exact comparison strategy.The characteristics of these satellite retrievals -with ground pixels typically smaller than model grid cells, clear-sky sampling needed for air pollution applications, and reduced vertical sensitivity towards the lower troposphere -require that models and retrievals are sampled as consistently as possible.This pertains to consistent sampling in space (horizontally and vertically) and in time (day-of-week, clear-sky day, timeof-day).Of these aspects, appropriate horizontal sampling is a relatively minor, but unavoidable concern.In most modelto-satellite comparisons, we recommend using the concept of the superobservation, which has the distinct advantage of providing a grid cell average observed state along with a realistic measurement plus horizontal representativeness error.Depending on the model resolution and the satellite instrument resolution, users can impose a minimum fractional coverage (of the model grid cell area) by the ensemble of pixels to reduce horizontal representativeness errors down to levels where the measurement contribution becomes the dominant term in the superobservation error budget.
Recommendations on and error estimates of the fractional coverage requirement depend on the exact method of comparing model simulations and satellite retrievals and on the spatial variability of the species of interest.Generally speaking, fractional coverage requirements may be rather loose for comparisons over regions with little spatial variability in gas concentration, for coarse-resolution model simulations, and for temporal averages over multiple days (e.g.monthly means).In contrast, total fractional coverage requirements need to be strict for comparisons over regions with strong variability in gas concentrations (i.e.SO 2 and NO 2 source regions), and for high spatial resolution modelling with (regional) CTMs.
In these situations we recommend limiting horizontal representativeness errors to within ±10 % because representativeness errors are then still considerably smaller than the satellite observation error σ N,o .
A faithful comparison between satellite measurements and model simulations requires that models need to be sampled appropriately in time.Sampling models irrespective of photochemical regime (such as when calculating a 31-day monthly mean without collocating the model with individual measurements) gives rise to systematic temporal representativeness errors on the order of +12 % for NO 2 and −8 % for HCHO.Such errors should (and can) be avoided, as they may misdirect interpretation of model-satellite differences, for instance by misinforming inversion studies by requiring changes in the rates of emissions, or chemical reactions to better match the observations.Our comparison of OMI O 2 -O 2 and co-sampled TM5 cloud information indicated that a strict requirement on the TM5 model to simulate a clear-sky scene along with a mostly clear-sky OMI superobservation has little effect over omitting such a filter.In the case of TM5, driven by ECMWF ERA Interim meteorological fields, the model shows good correlation with OMI-observed cloud fractions, with little probability (< 15 %) of simulating false positives or negatives.
Larger systematic errors in model-satellite ratios will be introduced when model profiles are not sampled according to the averaging kernel associated with most UV-Vis satellite products.While the exact magnitude effect depends on the model under evaluation and on the a priori profiles and other assumptions used in the retrievals, our analysis showed that for a comparison between OMI and GEOS-Chem NO 2 , application of the averaging kernels results in up to 20 % lower satellite-to-model ratios, and more coherent values of these ratios within relevant regions such as the eastern United States and Europe.The effect of applying the kernel is most relevant in summer, when the vertical distribution of species like NO 2 and HCHO is variable, and differences between the model profiles and the profiles used in the retrieval are most prominent.We strongly recommend using averaging kernels in satellite-model evaluations.Use of the averaging kernel allows for a better satellite-to-model comparison, by ensuring that the model is sampled in a manner consistent with the satellite retrievals because identical assumptions are made on vertical sensitivity, and differences between the model and satellite a priori vertical distribution cancel.Here we focused on an evaluation of tropospheric NO 2 simulations from the GEOS-Chem model with retrievals of tropospheric NO 2 columns with substantial vertical sensitivity down to the lower troposphere.However, application of the averaging kernel will be even more relevant for model evaluations of HCHO and SO 2 , since these retrievals are less sensitive to the lower troposphere.Recently, retrieval scientists have also made averaging kernel information available along with the HCHO and SO 2 data products (e.g.González Abad et al., 2015;Theys et al., 2013).
For future evaluations of CTMs and data assimilation with UV-Vis satellite retrievals (of NO 2 , HCHO, CHO-CHO, or SO 2 ), we advocate the use of the recommendations laid out in this paper, especially with respect to the required clear-sky sampling and appropriate vertical smoothing.
Given the fractional coverage f cov , the horizontal representativeness error can be read off from Fig. 1 for models with 3 • × 2 • and 0.5 • × 0.5 • resolution.Figure 1b of Miyazaki et al. ( 2012) provides a similar figure for a model resolution of 2.5 • × 2.5 • .For example, a 0.6 fractional coverage for a 3 • × 2 • model grid cell corresponds to a horizontal representativeness error of ∼ 10 %. 0.6 coverage for a 0.5 • × 0.5 • model corresponds to a representativeness error of ∼ 15 %.
Note that the recipe laid out above provides a horizontal representativeness error that is at the high end of the possible range.The variability in the complete ensemble of pixels will often be much smaller than the variability in the ensemble of pixels from Fig. 1 (over New York City) or Fig. 1b from Miyazaki et al. (2012) (which excluded situations with small NO 2 columns).

33
opposite) corners to obtain estimates for the 'base' and the 'height' of the pixel .Then 6 approximate the pixel as a parallelogram, to calculate the pixel area A i as base × 7 height.8 (2) Calculate the fractional coverage  !"# of all valid satellite pixels in the model grid cell 9 as the ratio of the area covered by all n valid pixels to the complete area covered by 10 the grid cell  !"## : 11 (3) Given the fractional coverage  !"# , the horizontal representativeness error can be read 13 off from Figure 1 for models with 3° × 2° and 0.5° × 0.5° resolution.

Appendix B: Derivation of the superobservation error
If the retrieval errors within a superobservation grid cell have some degree of correlation, we cannot simply take the areaweighted average retrieval error σ (calculated as n i=1 w i σ i n i=1 w i ) as representative for the superobservation error.The error expectation value for the ensemble of pixels composing a superobservation is written as with i the individual retrieval error in pixel i, the area weights now normalized ( i w i = 1) to facilitate notation.Now, for a partly correlated error between pixels i and j , we write i j = σ 2 i for i = j cσ i σ j for i = j (B2) so that the superobservation error σ 2 N can be written as follows: For σ i = σ , and w i = 1 n this reduces to Eq. ( 6): σ N = σ 1−c n + c.

Appendix C: Calculating CTM-simulated effective cloud fractions
We can express the modelled cloud properties into a quantity that is comparable to the effective cloud fraction provided by the OMI O 2 -O 2 cloud retrieval, and defined as the radiometric equivalent fraction of a viewing scene covered by a Lambertian reflector with an albedo of 0.8 (corresponding to a cloud with an optical thickness of ∼ 40) (Stammes et al., 2008).Some data products use cloud information retrieved with different approaches, but many UV-Vis trace gas retrievals use the effective cloud fraction approach.The TM5 cloud information (geometric cloud cover, and cloud optical thickness) was converted into an effective cloud fraction in a two-step approach.In the first step the maximumrandom overlap assumption is used to calculate the one column-representative geometrical cloud cover f tm5,geo following practical guidelines for similar model evaluations with MODIS clouds by Quaas (2011).The maximumrandom overlap assumption implies maximum overlap for cloud cover in adjacent layers (one cloud layer is exactly on top of the other), and random overlap for (layers of) cloud cover f l separated by at least one clear-sky layer: where f (here 0.001) is the threshold value for which a layer is considered to be cloud-free.In the second step the albedo of the cloud is determined based on the cloud optical thickness and the sensitivity of cloud spherical albedo to cloud optical thickness modelled by Buriez et al. (2005) for a liquid water cloud3 .The final step to obtain the effective (OMI equivalent) TM5 cloud fraction f tm5,eff from the geometrical cloud fraction and the obtained cloud albedo a c proceeds as The Supplement related to this article is available online at doi:10.5194/gmd-9-875-2016-supplement.
Author contributions.K. F. Boersma performed the research, drafted the manuscript, prepared the figures and developed the analysis methods.G. C. M. Vinken contributed to the development of the averaging kernel code.H. J. Eskes aided in drafting the manuscript and methods, and supported the development of the analysis methods and interpretation.All authors contributed to discussions of the results and preparation of the manuscript.

Figure 1 .
Figure 1.Relative horizontal representativeness errors as a function of the covered fraction of one model grid cell in the case of OMI tropospheric NO 2 columns for polluted area(s) (mean column 5 × 10 15 molecules cm −2 ).The black line indicates the error as a function of the fractional coverage for a 3 • × 2 • grid cell over the area of New York City on 1 day (17 July 2006, 114 OMI pixels).The blue asterisks indicate the mean error as a function of fractional coverage for various 0.5 • × 0.5 • grid cells on 17 July 2006.

Figure 2 .
Figure 2. Monthly average effective cloud fraction observed from OMI (upper panels) and simulated by TM5 based on ECMWF meteorological fields (middle panels) in February (left column) and August 2006 (right).Cloud fractions have been selected only for those days and locations that had a successful OMI O 2 -O 2 retrieval.Grey areas indicate less than three successful coincidences.Bottom panels: scatter plot of daily pairs of OMI (x axis) and TM5 cloud fractions (y axis) in February 2006 (left) and August 2006 (right) over Europe (10 • W-30 • E; 35-60 • N).The colours indicate the number of times a particular grid cell has been filled, where light blue corresponds to 2 ×, green 3 ×, yellow 4 ×, orange 5 ×, red 6 ×, and magenta to 7 × or more.TM5 effective cloud fractions can be expressed as −0.10 + 1.06 f OMI (February) and −0.01 + 1.07 f OMI (August).

Figure 3 .
Figure 3. Box and whisker plots for OMI (black) and TM5 (red) effective cloud fractions over Europe in February 2006 (left panel) and August 2006 (right panel).The two left boxes of each panel indicate the clear-sky situations when the OMI cloud fraction < 0.2.The centreline of each box indicates the median cloud fraction, the upper and lower edges indicate the 25th and 75th percentiles and the lower and upper whiskers represent the minimum and maximum value in the sample.For February 2006, the sample consisted of 3379 pairs (737 clear sky, 2642 cloudy), and for August, the sample size was 4665 (1991 clear sky, 2674 cloudy).

Figure 4 .
Figure 4. (a) Monthly mean tropospheric NO 2 columns simulated by TM5 for polluted grid cells (with all-sky monthly means > 1.0 × 10 15 molecules cm −2 , n = 18 in February, n = 17 in August).The blue bars represent the average of the tropospheric NO 2 column sampled on days when the OMI cloud fraction was smaller than 0.2.Light blue: average for columns sampled when OMI cloud fraction > 0.2.(b) Monthly mean TM5 HCHO columns for clear-sky and cloudy situations (n = 18 in February, for August: all-sky monthly mean > 7.5 × 10 15 molecules cm −2 , n = 12).

Figure 5 .
Figure 5. Impact of sampling strategy on monthly averaged OMI : TM5 ratio of tropospheric NO 2 columns (black dots) and on spatial correlation coefficient (R 2 , blue dots) over the eastern United States (30-44 • N, 90-72 • W).Left panel: ratio and R 2 for February 2006 (n = 28).Right panel: August 2006 (n = 32).Grid cells were selected in the comparison when the covered fraction exceeded 0.5.The dashed black line shows the normalized OMI : TM5 ratio for strategy (A), and the dashed grey line shows the R 2 for strategy (A) as a guide to the eye.

Figure 7 .
Figure 7. Monthly mean averaging kernel (black dashed line) and NO 2 profiles simulated by GEOS-Chem (blue), TM4 (red, a priori profiles in OMI NO 2 retrieval), and GEOS-Chem convolved with the averaging kernel (purple) following Eq.(6): left panel.February 2006; right panel, August 2006 over the Beijing grid cell (centred on 40 • N, 116.25 • E).The numbers given in blue, purple, and red indicate the tropospheric vertical NO 2 columns in GEOS-Chem and TM4.

Figure 8 .
Figure 8. Difference between monthly mean GEOS-Chem with AK (Eq. 6) and GEOS-Chem tropospheric NO 2 columns without AK for February 2006 (upper panel) and August 2006 (lower panel).Only grid cells with more than 3 days of better than 40 % coverage of clear-sky pixels have been selected.

Figure 9 .
Figure 9.Comparison between monthly average OMI and GEOS-Chem tropospheric NO 2 columns over Europe in February 2006 (left panels) and August 2006 (right panels); (a) scatter diagram of monthly average GEOS-Chem with AK (black circles) and GEOS-Chem without AK (grey circles) vs. OMI tropospheric NO 2 columns for February 2006.The black and grey lines indicate the geometric mean of the OMI : GEOS-Chem ratio; (b) as for (a) but for August 2006; (c) histogram of per-grid cell OMI-to-GEOS-Chem with AK tropospheric NO 2 column ratios (black bars) and OMI-to-GEOS-Chem without AK ratios (grey bars) for February 2006; (d) as (c) but for August 2006.Only grid cells with more than 3 days of better than 40 % coverage of clear-sky pixels have been selected.

Table 1 .
Summary of tropospheric NO 2 GEOS-Chem model evaluations following recipes (A), (B), and (C) with OMI NO 2 retrievals for February and August 2006.n refers to the number of grid cells used in the comparison.

Table 2 .
Overview of magnitude and nature of various model sampling errors, their contribution to the overall comparison error budget, and ways to avoid them.Based on the GEOS-Chem evaluation with OMI NO 2 retrievals for February and August 2006.Note that these recommendations hold for air pollution applications of UV-Vis satellite retrievals such as model evaluation and top-down emission estimates.