The Community Intercomparison Suite (CIS) is an easy-to-use command-line tool
which has been developed to allow the straightforward intercomparison of
remote sensing, in situ and model data. While there are a number of tools
available for working with climate model data, the large diversity of sources
(and formats) of remote sensing and in situ measurements necessitated a novel
software solution. Developed by a professional software company, CIS supports
a large number of gridded and ungridded data sources “out-of-the-box”,
including climate model output in NetCDF or the UK Met Office pp file format,
CloudSat, CALIOP (Cloud-Aerosol Lidar with Orthogonal Polarization), MODIS
(MODerate resolution Imaging Spectroradiometer), Cloud and Aerosol CCI
(Climate Change Initiative) level 2 satellite data and a number of in situ
aircraft and ground station data sets. The open-source architecture also
supports user-defined plugins to allow many other sources to be easily
added. Many of the key operations required when comparing heterogenous
data sets are provided by CIS, including subsetting, aggregating, collocating
and plotting the data. Output data are written to CF-compliant NetCDF files to
ensure interoperability with other tools and systems. The latest
documentation, including a user manual and installation instructions, can be
found on our website (
Modern global climate models (GCMs) produce huge amounts of prognostic and
diagnostic data covering every aspect of the system being modelled. The
upcoming CMIP6 (Coupled Model Intercomparison Project Phase 6) is likely to
produce as much as 40 Pb of data alone
Analysis of the data from these models forms the cornerstone of the
IPCC
Observational data can also be extremely voluminous. For example, modern Earth Observation (EO) satellites can easily produce petabytes of data over their lifetime. There are dozens of EO satellites being operated by the National Aeronautics and Space Administration (NASA), the European Space Agency (ESA) and other international space agencies. While modern missions use common data standards, there are many valuable data sets stored in unique formats and structures which were designed when storage was at more of a premium, and so are not particularly user-friendly. Ground-based EO sites and in situ measurement of atmospheric properties are also areas where many different groups and organizations produce data in a wide variety of formats.
The process of model evaluation typically involves a relatively small set of
common operations on the data: reading, subsetting, aggregating, analysis and
plotting. Many of these operations are currently written as a bespoke
analysis for each type of data being compared. This is time consuming and
error prone. While a number of tools currently support the comparison and
analysis of model data in standard formats, such as NetCDF Operators
(NCO)
Comparing global model data with observations that can be considered point
measurements may introduce substantial errors in any
analysis
In this paper, we first describe the development of this new tool
(Sect.
CIS has been developed by a professional software development consultancy
(Tessella Ltd.) working closely with the Centre for Environmental Data
Analysis (CEDA) and the Department of Physics at the University of Oxford to
ensure a high quality tool which meets the need of a broad range of users.
The use of modern development practices such as test-driven development (TDD)
The development was also carried out in an agile fashion, specifically using
Scrum
CIS is completely written in Python, which provides a good balance between
speed, versatility and maintainability, and allows easy installation across
many platforms (see Sect.
Much consideration was given to the need for parallelization and optimization
of the functions within CIS, particularly around collocation where long
runtimes for large data sets can be expected. Significant development time was
devoted to optimizations in these functions and many of the runtimes now
scale very well with size of the data. However, we deemed it a lower priority
to devote development time to parallelizing these operations, as they are
usually trivially parallelized by the user by performing the operation on
each input file separately across the available compute nodes (using a batch
script, for example, and subsetting the data first as needed). Such a script
is pre-installed alongside CIS on the UK JASMIN big-data analysis cluster
All of the source code for CIS is freely available under the GNU Lesser General Public License v3, which is expected to promote widespread uptake of the tool and also encourage wider collaboration in its development.
One of the key features of CIS is the flexible and extensible architecture. From the outset it was obvious that there was no way for a single, unextendable, tool to provide compatibility with the wide variety of data sources available and support all of the various analyses which would be performed on them. A modular design was therefore incorporated, which allowed user-defined components to be swapped in as easily as possible.
At the heart of the design is the CommonData interface layer which allows
each of the analysis routines and commands to work independently of the
actual data being provided, as shown in Fig.
An illustration of the architecture of CIS demonstrating the different components in the modular design.
There are an extensive number of data sources which are supported by CIS,
which can be broadly categorized as either gridded or ungridded data.
Gridded data are defined as any regularly gridded data set for which points
can be indexed using
In CIS, the gridded data type is really just a thin wrapper around the cube
provided by the Iris
In this section, we describe the core functionality of CIS. Each sub-section gives a brief description of an operation, the command line syntax and expected output, a formal algorithmic description of the operation (where appropriate) and a short example.
In order to keep the formal algorithmic descriptions concise without any loss
of accuracy, we adopt a mixture of set and vector notation and define that
notation here. It is useful to define vector inequalities as
Although data reading is something a user is rarely aware of when using CIS,
the flexibility offered in this regard is an important distinguishing
feature. All of the functions described in the following sections are
possible with any of the supported data sets and any data sets supported by
user-written plugins (as described in Sect.
A list of the ungridded data sources supported by CIS out-of-the-box is
presented in Table
For all supported data sets any
A list of the ungridded data sources supported by CIS 1.4.0. The
file signature is used by CIS to automatically determine the correct product
to use for reading a particular set of data files, although this can easily
be overridden by the user. (Internally these signatures are represented as
Python regular expressions; here, they are shown as standard wildcards for ease
of reading.). The Global Aerosol Synthesis and Science Project (GASSP)
data sets are large collections of harmonized in situ aerosol observations
from groups around the world (
A list of the gridded data sources supported by CIS 1.4.0. The file signature is used by CIS to automatically determine the correct product to use for reading a particular set of data files. This can always be overridden by the user.
Subsetting allows the reduction of data by extracting variables and restricting them to user-specified ranges in one or more coordinates. Both gridded and ungridded data sets can be reduced in size by specifying the range over which the output data should be included, and points outside that range are removed.
The basic structure of the subset command is
The datagroup is a common concept across the various CIS commands. It
represents a collection of variables (from a collection of files) sharing the
same spatio-temporal coordinates, which takes the form
Here, the “variables” element specifies the variables to be operated on and
can be a single variable, a comma-separated list, a wildcarded variable name
or any combination thereof. The “filenames” element specifies the files to
read the variables from and can be a single filename, a directory of files to
read, a comma-separated list of files or directories, wildcarded filenames
or any combination thereof. The optional “product” element can be used to
manually specify the particular product to use for reading this collection of
data. See Tables
The “limits” are a comma-separated list of the upper and lower bounds to be
applied to specific dimensions of the data. The dimensions may be identified
using their variable names (e.g. latitude) or by choosing a shorthand
from “x”, “y”, “z”, “p” or “t” which refer to longitude, latitude,
altitude, pressure and time, respectively. The limits are then defined simply
using square brackets, e.g.
The detailed algorithm used for subsetting ungridded data is outlined in
Algorithm 1 and for gridded data in Algorithm 2. The algorithms use a mix of
pseudo-code and mathematical notation to try to present the operations in a
clear but accurate way. The operations themselves will involve other checks
and optimizations not shown in the algorithms, but the code is available for
those interested in its exact workings. See Sect.
For example, the following command would take the variable “aod550” from the
file “
The output file is stored as a CF-compliant NetCDF4 file.
CIS also has the ability to aggregate both gridded and ungridded data along
one or more coordinates. For example, it can aggregate a model data set over
the longitude coordinate to produce a zonal mean or aggregate satellite
imager data onto a 5
The aggregation command has the following syntax:
where aggregate is the sub-command; datagroup specifies the
variables and files to be aggregated (see Sect.
The optional arguments should be given as
The mandatory “grid” argument specifies the coordinates to aggregate over. The detail of this argument and the internal algorithms applied in each case are quite different when dealing with gridded and ungridded data so they will be described separately below. This difference arises primarily because gridded data can be completely averaged over one or more dimensions and also often require area weights to be taken into account.
In the case of the aggregation of ungridded data, the mandatory “grid”
argument specifies the structure of the binning to be performed for each
coordinate. The user can specify the start, end and step size of those bins
in the form
The output of an aggregation is always regularly gridded data, so CIS does not currently support the aggregation over only some coordinates. If a coordinate is not specified (or is specified, but without a step size) then that coordinate is completely collapsed. That is, we average over its whole range, so that the data are no longer a function of that coordinate. Specifically, one of the coordinates of the gridded output would have a length of one, with bounds reflecting the maximum and minimum values of the collapsed coordinate.
The algorithm used for the aggregation of ungridded data is identical to ungridded to gridded collocation (as this is essentially a collocation operation with the grid defined by the user) described in Algorithm 4.
An example of the aggregation of some satellite data which contain latitude,
longitude and time coordinates is shown below. In this case, we explicitly
provide a 1
For gridded data, the binning described above is not currently available; this is partly because there are cases where it is not clear how to apply area weighting. (The user would receive the following error message if they tried: “Aggregation using partial collapse of coordinates is not supported for GriddedData”.) The user is able to perform a complete collapse of any coordinate however, simply by providing the name of the coordinate(s) as a comma-separated list; e.g. “x,y” will aggregate data completely over both latitude and longitude, but not any other coordinates present in the file.
The algorithm used for this collapse of gridded dimensions is more
straightforward than that of the ungridded case. First, the area weights for
each cell are calculated and then the dimensions to be operated on are
averaged over simultaneously. That is, the different moments of the data in
all collapsed dimensions are calculated together, rather than independently
(using the Iris routines described here:
A full example of gridded aggregation, taking the time and zonal average of
total precipitation from the HadGEM3
Point-wise quantitative inter-comparisons require the data to be mapped onto
a common set of coordinates before analysis, and CIS provides a number of
straightforward ways of doing this. One of the key features of CIS is the
ability to collocate one or more arbitrary data sets onto a common set of
coordinates, for example, collocating aircraft data onto hybrid-sigma model
levels or satellite data with ground station data. The options available
during collocation depend on the types of data being analysed as demonstrated
in Table
A plot of the zonal average of global rainfall, demonstrating the simple aggregation of global model outputs using CIS. See the text for the exact command used to produce this output.
An outline of the permutations of collocations types, as a function
of the structure of the data and sampling inputs, the default in each case is shown in bold. The available kernels are
described in Table
The basic structure of the collocation command is
The samplegroup has a slightly different format to the datagroup, as the
sample variable is optional, and all of the collocation options are specified
within this construct. It is of the format
A list of the different kernels available. Note that not all of the kernels are compatible with all of the collocators.
A full example would be:
There are also many other options and customizations available. For example,
by default all points in the sample data set are used for the mapping. However,
(as CIS provides the option of selecting a particular variable as the
sampling set) the user is able to disregard all sample points whose values
are masked (whose value is equal to the corresponding fill_value). The many
different options available for collocation, and each collocator can be found
in the user manual (see
In the following sections, we describe each mode of collocation in more detail, including algorithmic representations of the operations performed.
For a set of gridded data points which are to be mapped on to some other
gridded sample the operation is essentially a re-gridding and the user is
able to use either linear interpolation (lin), where the data values at each
sample point are linearly interpolated across the cell where the sample point
falls; nearest neighbour, for which the data cell nearest to the sample cell
can be uniquely chosen in each dimension for every point; and box, for
which an arbitrary search area can be manually defined for the sampling using
Algorithm 3. The interpolations are carried out using the Iris interpolation
routines which are described in detail elsewhere (see
This schematic shows the collocation of gridded data onto a gridded sampling with differing dimensionality. The output dimensionality is always the same as that of the input data.
CIS can also collocate gridded data sets with differing dimensionality. Where
the sample array has dimensions that do not exist in the data, those
dimensions are ignored for the purposes of the collocation and will not be
present in the output. Where the data have dimensions that do not exist in the
sample array, those dimensions are ignored for the purposes of the
collocation and will be present in the output. Therefore, the output
dimensionality is always the same as that of the input data, as shown in
Fig.
CIS is also able to collocate ungridded data. For ungridded to ungridded
collocation the user is able to define a box to constrain the data points
which should be included for each sample point. The schematic in
Fig.
This schematic shows the components involved in the collocation of ungridded data onto an ungridded sampling. The user-defined box around each sampling point provides a selection of data points which are passed to the kernel. Note that the resampled data points lie exactly on top of the sample points (which are not visible).
The specific process is outlined in Algorithm 3. For simplicity, we have assumed the dimensionality of the data sets is the same; in reality this need not be the case. CIS will collocate two data sets as long as both have the coordinates necessary to perform the constraint and kernel operations. Note also that this algorithm only outlines the basic principles of the operations of the code; a number of optimizations are used in the code itself.
One particular optimization involves the use of kd trees
For ungridded data points which are mapped onto a gridded sample, there are two options available. Either the ungridded data points can be binned into the bounds defined by each cell of the sample grid using the bin option, or the points can be constrained to an arbitrary area centred on the gridded sample point using the box option as described above. Either way, the moments kernel is used by default to return the number of points in each bin or box, the mean of their values and the standard deviation in the mean.
Algorithm 4 describes this process in more detail. As with Algorithm 3, we show here the operations performed, but not the exact code-path. In reality, a number of optimizations are made to ensure efficient calculations.
When mapping gridded data onto ungridded sample points, the options available are for the nearest neighbour value or a linearly interpolated value.
The methods used to perform the interpolation are provided by the SciPy
library
CIS also includes a comprehensive set of plotting capabilities, allowing the
analysis and comparison of the whole variety of data which can be read. This
includes plots of aircraft flight tracks (see, e.g. Fig.
This schematic shows the collocation of gridded data onto an ungridded sampling where the altitude component of the data is defined on a hybrid height grid. CIS will first collocate the data in the coordinate dimensions (latitude, longitude, etc.) to extract a single-altitude column and then perform a second interpolation on the altitude coordinate.
The plotting output is highly customizable, with more than 35 different
options available for specifying everything from the axes labels, to the
colour of the coastlines. The user is also able to output the plots directly
to screen for interactive visualization, including zooming and panning, or
straight to image file (including .png, .jpg, .eps or .pdf) for publication-ready plots.
A full description of the plotting syntax and available options
is provided in the user manual
(
In addition to standard analysis options as described above, CIS allows
general arithmetic operations to be performed between different variables
using the “eval” command. The two variables must be on the same
spatio-temporal sampling, CIS will check that the data have the same dimensions but
not that the points correspond to the same sampling. There are limitless
possibilities, but it enables, for example, the calculation of the difference
between two collocated variables as demonstrated in
Fig.
The basic structure of the eval command is as follows:
An example scatter plot from a particular aircraft measurement of
ambient temperature as a function of latitude (
This flexibility allows for some quite complex analysis. For example,
consider the case of calculating the Ångström exponent (
This can be straightforwardly calculated using the following CIS command:
An example plot showing the cloud liquid path over the Indian Ocean
just off Malaysia, retrieved by the ESA Cloud CCI product MODIS
Aqua
An example of plotting two collocated variables against one another as a scatter plot and also as a 2-D histogram. This can be useful for inspecting dense scatter plots.
Note that we have used the NumPy library to calculate the log of each of the
variable arrays and have used the (AOT) variable names in the file for
Users are also able to perform a basic statistical analysis on two variables
using the stats command. This command has a very basic structure:
For example, the user might wish to examine the correlation between a model data variable and actual measurements or (as in the Ångström exponent example above) the correlation between a calculated and measured variable. The stats command will calculate the following:
the number of data points used in the analysis, the mean and standard deviation of each data set (separately), the mean and standard deviation of the absolute difference ( the mean and standard deviation of the relative difference ( the linear Pearson correlation coefficient, the Spearman rank correlation coefficient and the coefficients of linear regression (i.e.
Many of these values are calculated using the SciPy
library
CIS was primarily designed as a command line tool, however it is also straightforward to use some of the power of CIS in other Python modules or scripts. In particular, CIS provides an interface for reading any of the data sets which CIS supports (either built-in or through user-supplied plugins). The data are returned in a well-documented data structure which provides straightforward access to the raw data, the coordinates and all associated metadata.
Further, because the data structure returned by these routines are built on
NumPy arrays, it is trivial to build these into existing Python-based data
analysis routines. There is also an option to return the data as a Pandas
(
Version 2.0 of CIS is planned to include full support for all of the main CIS
commands through the Python interface. For an outline of our future plans for
CIS, please see
Consider the comparison of a set of AERONET data from the Agoufou station
with model AOT data over a given time period, for example, in order to help
inform and constrain the approximations and assumptions used in the model.
For the sake of this example, we use ECHAM6-HAM2 (first described by
A comparison of annual average AOT at 550 nm between ECHAM and HadGEM3 across the globe.
As a first step it is often useful to inspect the contents of a data file to
determine which variables it contains. This is straightforward using the
“info” command:
This will return a list of the variables in the file, exactly as they should
be passed to other commands. Variables can also be specified in the usual way
to get more detailed information about any specific variables. Next, we might
plot each of the data sets in order to examine their spatio-temporal extents
and get a feel for the magnitudes of the AOT. We can plot the AERONET data
with the following command:
An example output plot is shown in Fig.
Next, we might decide to subset the AERONET data to cover the same temporal range as the model data (which is for 2007):
In order to quantitatively compare the values, we need to bring the model data
onto the AERONET spatio-temporal sampling; this is straightforward using the
collocation command:
Note that this will linearly interpolate model data values in both space and
time by default, though we could have chosen to use a nearest neighbour
algorithm instead. Once we have two collocated data sets, we can calculate the
point-wise difference between the observations and the collocated model data
using
We can also use the built-in analysis routines to give us an overview of the
correlations between the two data sets using the
This will print out to screen the mean and standard deviation in each
data set, the absolute and relative differences between them and the linear
Pearson and Spearman rank correlation coefficients, as described in
Sect.
This provides the average difference of the collocated data values.
Furthermore, because CIS commands can take multiple filenames as input, we can
easily extend this process for multiple AERONET stations to produce a plot of
the annual difference across the globe. Assuming we have performed the
collocation and differencing over all of the stations, the aggregation step
above is only slightly changed:
CIS plot with default options for AOT observed from a single AERONET station.
We have had to define a sufficiently fine spatial grid to maintain the
spatial component of the difference. It is anticipated that future versions
of CIS will support aggregation of ungridded data sets over only time, to
support exactly this kind of workflow without the need to define an arbitrary
grid (see
Note that all of the cells in the aggregation with no points (where there are no AERONET stations) are masked in the output and thus not shown in the scatter plot.
The intercomparison of observational and model data is a crucial aspect of modern climate science. There exist a few tools to work with gridded NetCDF data sets, but very few in support of process studies using assorted data sources, and none which allow generic intercomparison of multiple ungridded data sets, multiple gridded data sets or any combination thereof.
Here, we have demonstrated the power and use of CIS – a new universal tool for the inter-comparison of model, remote sensing and in situ climate data. The open and extensible nature of the tool allows for the easy and reproducible collocation, aggregation, subsetting and analysis of a huge variety of data sources on everything from laptops to large processing clusters. Further, the ability to extend the data sources compatible with CIS through user-developed plugins provides the opportunity for a shared tool to serve a diverse community.
The difference between the annual average AOT measured at AERONET stations around the world and ECHAM6-HAM2 modelled values. This plot only demonstrates a type of analysis which is easy to perform with CIS; no scientific critique of these differences is offered.
Further development of CIS is ongoing and we hope to include a number of new
features in the future, such as an extended Python interface, hybrid
gridded/ungridded data structures and improved time series analysis, as
outlined in our roadmap (
The CIS source code is available on GitHub at
CIS is a tool for working with a wide variety of data; however, none of the data sets used or described within this paper are supplied with the tool and should be obtained directly through their respective providers.
A table of terms in this paper.
In this section, we describe two specific ways that users are able to easily extend the functionality provided by CIS. The plugins are short pieces of Python code that users can write themselves and which CIS will then automatically incorporate. Our website offers functionality for users to upload new plugins to be shared with the wider CIS community. Submitted plugins will not be automatically included in the base CIS install, but can easily be downloaded and included by other users. If certain plugins prove popular then they will be tested, documented and included in the base install.
A detailed description of the development of CIS plugins and a number of
increasingly in-depth tutorials can be found in the CIS documentation
(
CIS uses the notion of a “data product” to encapsulate the information about
different types of data. Users can write their own products for reading in
different types of data, referred to as plugins. These products (or
plugins, if provided by the user) are concerned with interpreting the raw
data and their coordinates and producing a single self-describing data object
conforming to the CommonData interface (see Fig.
All plugins must subclass the
The underlying I/O layers are also available for the plugins to use (such as NetCDF reading) which ensures the writing of plugins is as straightforward as possible.
Users can also write their own plugins for performing the collocation of two data sets. There are three main objects used in the collocation which the user is free to override: the collocator, the constraint and the kernel. The basic design is that the collocator loops over each of the sample points, calls the relevant constraint to reduce the number of data points and then calls the kernel which returns a single value for the collocator to store.
The main plugin which is available is the collocation method itself. A new
one can be created by subclassing
The constraint object limits the data points for a given sample point
in some way. The user can also add a new constraint method by subclassing
Although we provide an outline here, please see the technical documentation
for more details
(
We would like to acknowledge the guidance and support of Stephen Pascoe through his role in CEDA during the first phases of development, and Caroline Poulsen (Remote Sensing Group, EOAS Division, RAL Space) who provided invaluable user feedback. The first phase of development was supported by e-infrastructure capital grants for JASMIN from the Science and Technology Facilities Council (ST/K000594/1). Subsequent development was supported by Natural Environment Research Council capital funding for JASMIN. Scientific support has been provided by the Global Aerosol Synthesis and Science Project (GASSP), Natural Environment Research Council (NE/J022624/1). The research leading to these results has received funding from the European Research Council under the European Union's Seventh Framework Programme (FP7/2007–2013)/ERC grant agreement no. FP7-280025. We thank the AERONET principal investigators (PIs) and their staff for establishing and maintaining the AERONET sites used in the examples here. We are also grateful to the ESA Cloud CCI project and to NASA for the underlying MODIS data sets which went into one of the examples used. Edited by: F. O'Connor Reviewed by: two anonymous referees