A flexible and highly extensible data assimilation testing suite, named DATeS, is described in this paper. DATeS aims to offer a unified testing environment that allows researchers to compare different data assimilation methodologies and understand their performance in various settings. The core of DATeS is implemented in Python and takes advantage of its object-oriented capabilities. The main components of the package (the numerical models, the data assimilation algorithms, the linear algebra solvers, and the time discretization routines) are independent of each other, which offers great flexibility to configure data assimilation applications. DATeS can interface easily with large third-party numerical models written in Fortran or in C, and with a plethora of external solvers.

Data assimilation (DA) refers to the fusion of information from different
sources, including priors, predictions of a numerical model, and snapshots of
reality, in order to produce accurate description of the state of a physical
system of interest

Numerical experiments are an essential ingredient in the development of new DA algorithms. Implementation of numerical experiments for DA involves linear algebra routines, a numerical model along with time integration routines, and an assimilation algorithm. Currently available testing environments for DA applications are either very simplistic or very general; many are tied to specific models and are usually completely written in a specific language. A researcher who wants to test a new algorithm with different numerical models written in different languages might have to re-implement his/her algorithm using the specific settings of each model. A unified testing environment for DA is important to enable researchers to explore different aspects of various filtering and smoothing algorithms with minimal coding effort.

The DA Research Section (DAReS) at the National Center for Atmospheric
Research (NCAR) provides Data Assimilation Research Testbed (DART)

Matlab programs are often used to test new algorithmic ideas due to its ease
of implementation. A popular set of Matlab tools for ensemble-based DA
algorithms is provided by the Nansen Environmental and Remote Sensing Center
(NERSC), with the code available from

Python is a modern high-level programming language that gives the power of
reusing existing pieces of code via inheritance, and thus its code is
highly extensible. Moreover, it is a powerful scripting tool for scientific
applications that can be used to glue legacy codes. This can be achieved by
writing wrappers that can act as interfaces. Building wrappers around
existing C and Fortran code is a common practice in scientific research.
Several automatic wrapper generation tools, such as
SWIG

This paper presents a highly extensible Python-based DA testing suite. The
package is named DATeS and is intended to be
an open-source, extendable package positioned between the simple typical
research-grade implementations and the professional implementation of DART
but with the capability to utilize large physical models. Researchers can use
it as an experimental testing pad where they can focus on coding only their
new ideas without worrying much about the other pieces of the DA process.
Moreover, DATeS can be effectively used for educational purposes where
students can use it as an interactive learning tool for DA applications. The
code developed by a researcher in the DATeS framework should fit with all
other pieces in the package with minimal to no effort, as long as the
programmer follows the “flexible” rules of DATeS. As an initial
illustration of its capabilities, DATeS has been used to implement and carry
out the numerical experiments in

The paper is structured as follows. Section

This section gives a brief overview of the basic discrete-time formulations
of both statistical and variational DA approaches. The formulation here is
far from conclusive and is intended only as a quick review. For detailed
discussions on the various DA mathematical formulations and algorithms, see,
e.g.,

The main goal of a DA algorithm is to give an accurate representation of the
“unknown” true state,

The model-based simulations, represented by the model states, are inaccurate
and must be corrected given noisy measurements

In the so-called “Gaussian framework”, the prior is assumed to be Gaussian

Consider assimilating information available about the system state at time
instant

Applying Eqs. (

The maximum a posteriori (MAP) estimate of the true state is the state that maximizes the posterior probability density function (PDF). Alternatively, the MAP estimate is the minimizer of the negative logarithm (negative log) of the posterior PDF. The MAP estimate can be obtained by solving the following optimization problem:

Assimilating several observations

The MAP estimate of the true state at the initial time of the assimilation
window can be obtained by solving the following optimization problem:

In idealized settings, where the model is linear, the observation operator is linear, and the underlying probability distributions are Gaussian, the posterior is also Gaussian; however, this is rarely the case in real applications. In nonlinear or non-Gaussian settings, the ultimate objective of a DA algorithm is to sample all probability modes of the posterior distribution, rather than just producing a single estimate of the true state. Algorithms capable of accommodating non-Gaussianity are too limited and have not been successfully tested in large-scale settings.

Particle filters
(PFs)

DATeS provides standard implementations of several flavors of the algorithms
mentioned here. One can easily explore, test, or modify the provided
implementations in DATeS, and add more methodologies. As discussed later, one
can use existing components of DATeS, such as the implemented numerical
models, or add new implementations to be used by other components of DATeS.
However, it is worth mentioning that the initial version of DATeS (v1.0) is
not meant to provide implementations of all state-of-the-art DA algorithms;
see, e.g.,

DATeS seeks to capture, in an abstract form, the common elements shared by most DA applications and solution methodologies. For example, the majority of the ensemble filtering methodologies share nearly all the steps of the forecast phase, and a considerable portion of the analysis step. Moreover, all the DA applications involve common essential components such as linear algebra routines, model discretization schemes, and analysis algorithms.

Existing DA solvers have been implemented in different languages. For example, high-performance languages such as Fortran and C have been (and are still being) extensively used to develop numerically efficient model implementations and linear algebra routines. Both Fortran and C allow for efficient parallelization because these two languages are supported by common libraries designed for distributed memory systems such as MPI and shared memory libraries such as Pthreads and OpenMP. To make use of these available resources and implementations, one has to either rewrite all the different pieces in the same programming language or have proper interfaces between the different new and existing implementations.

The philosophy behind the design of DATeS is that “a unified DA testing suite has to be open-source, easy to learn, and able to reuse and extend available code with minimal effort”. Such a suite should allow for easy interfacing with external third-party code written in various languages, e.g., linear algebra routines written in Fortran, analysis routines written in Matlab, or “forecast” models written in C. This should help the researchers to focus their energy on implementing and testing their own analysis algorithms. The next section details several key aspects of the DATeS implementation.

The DATeS architecture abstracts, and provides a set of modules of, the four generic components of any DA system. These components are the linear algebra routines, a forecast computer model that includes the discretization of the physical processes, error models, and analysis methodologies. In what follows, we discuss each of these building blocks in more detail, in the context of DATeS. We start with an abstract discussion of each of these components, followed by technical descriptions.

The linear algebra routines are responsible for handling the data structures representing essential entities such as model state vectors, observation vectors, and covariance matrices. This includes manipulating an instance of the corresponding data. For example, a model state vector should provide methods for accessing/slicing and updating entries of the state vector, a method for adding two state vector instances, and methods for applying specific scalar operations on all entries of the state vector such as evaluating the square root or the logarithm.

The forecast computer model simulates a physical phenomena of interest such as the atmosphere, ocean dynamics, and volcanoes. This typically involves approximating the physical phenomena using a gridded computer model. The implementation should provide methods for creating and manipulating state vectors and state-size matrices. The computer model should also provide methods for creating and manipulating observation vectors and observation-size matrices. The observation operator responsible for mapping state-size vectors into observation-size vectors should be part of the model implementation as well. Moreover, simulating the evolution of the computer model in time is carried out using numerical time integration schemes. The time integration scheme can be model-specific and is usually written in a high-performance language for efficiency.

It is common in DA applications to assume a perfect forecast model, a case where the model is deterministic rather than stochastic. However, the background and observation errors need to be treated explicitly, as they are essential in the formulation of nearly all DA methodologies. We refer to the DATeS entity responsible for managing and creating random vectors, sampled from a specific probability distribution function, as the “error model”. For example, a Gaussian error model would be completely set up by providing the first- and second-order moments of the probability distribution it represents.

Analysis algorithms manipulate model states and observations by applying widely used mathematical operations to perform inference operations. The popular DA algorithms can be classified into filtering and smoothing categories. An assimilation algorithm, a filter or a smoother, is implemented to carry out a single DA cycle. For example, in the filtering framework, an assimilation cycle refers to assimilating data at a single observation time by applying a forecast and an analysis step. On the other hand, in the smoothing context, several observations available at discrete time instances within an assimilation window are processed simultaneously in order to update the model state at a given time over that window; a smoother is designed to carry out the assimilation procedure over a single assimilation window. For example, EnKF and 3D-Var fall in the former category, while EnKS and 4D-Var fall in the latter.

In typical numerical experiments, a DA solver is applied for several consecutive cycles to assess its long-term performance. We refer to the procedure of applying the solver to several assimilation cycles as the “assimilation process”. The assimilation process involves carrying out the forecast and analysis cycles repeatedly, creating synthetic observations or retrieving real observations, updating the reference solution when available, and saving experimental results between consecutive assimilation cycles.

The design of DATeS takes into account the distinction between these
components and separates them in design following an object-oriented
programming (OOP) approach. A general description of DATeS architecture is
given in Fig.

The enumeration in Fig.

All DATeS components are independent so as to maximize the flexibility in experimental design. However, each newly added component must comply with DATeS rules in order to guarantee interoperability with the other pieces in the package. DATeS provides base classes with definitions of the necessary methods. A new class added to DATeS, for example, to implement a specific new model, has to inherit the appropriate model base class and provide implementations of the inherited methods from that base class.

Diagram of the DATeS architecture.

In order to maximize both flexibility and generalizability, we opted to
handle configurations, inputs, and output of DATeS object using
“configuration dictionaries”. Parameters passed to instantiate an
object are passed to the class constructor in the form of key-value pairs in
the dictionaries. See Sect.

The main linear algebra data structures essential for almost all DA aspects are (a) model state-size and observation-size vectors (also named state and observation vectors, respectively), and (b) state-size and observation-size matrices (also named state and observation matrices, respectively). A state matrix is a square matrix of order equal to the model state-space dimension. Similarly, an observation matrix is a square matrix of order equal to the model observation space dimension. DATeS makes a distinction between state and observation linear algebra data structures. It is important to recall here that, in large-scale applications, full state covariance matrices cannot be explicitly constructed in memory. Full state matrices should only be considered for relatively small problems and for experimental purposes. In large-scale settings, where building state matrices is infeasible, low-rank approximations or sparse representation of the covariance matrices could be incorporated. DATeS provides simple classes to construct sparse state and observation matrices for guidance.

DA filtering routines provided by the initial version of DATeS (v1.0).

Third-party linear algebra routines can have widely different interfaces and
underlying data structures. For reusability, DATeS provides unified
interfaces for accessing and manipulating these data structures using Python
classes. The linear algebra classes are implemented in Python. The
functionalities of the associated methods can be written either in Python or
in lower-level languages using proper wrappers. A class for a linear algebra
data structure enables updating, slicing, and manipulating an instance of the
corresponding data structures. For example, a model state vector class
provides methods that enable accessing/slicing and updating entries of the
state vector, a method for adding two state vector instances, and methods for
applying specific scalar operations on all entries of the state vector such
as evaluating the square root or the logarithm. Once an instance of a linear
algebra data structure is created, all its associated methods are accessible
via the standard Python dot operator. The linear algebra base classes
provided in DATeS are summarized in Table

Python implementation of state vector, observation vector, state matrix, and observation matrix data structures. Both dense and sparse state and observation matrices are provided.

Python special methods are provided in a linear algebra class to enable
iterating a linear algebra data structure entries. Examples of these special
methods include

DATeS provides linear algebra data structures represented as NumPy ndarrays,
and a set of NumPy-based classes to manipulate them. Moreover, SciPy-based
implementation of sparse matrices is provided and can be used efficiently in
conjunction with both sparse and non-sparse data structures. These classes,
shown in Fig.

DA filtering routines provided by the initial version of DATeS (v1.0).

Each numerical model needs an associated class providing methods to access
its functionality. The unified forecast model class design in DATeS provides
the essential tasks that can be carried out by the model implementation. Each
model class in DATeS has to inherit the model base class:

While some linear algebra and the time integration routines are
model-specific, DATeS also implements general-purpose linear algebra classes
and time integration routines that can be reused by newly created models. For
example, the general integration class

In many DA applications, the errors are additive and are modeled by random
variables normally distributed with zero mean and a given or an unknown
covariance matrix. DATeS implements NumPy-based functionality for
background, observation, and model errors as guidelines for more
sophisticated problem-dependent error models. The NumPy-based error models in
DATeS are implemented in the module

DA filtering routines provided by the initial version of DATeS v1.0.

Assimilation classes are responsible for carrying out a single assimilation
cycle (i.e., over one assimilation window) and optionally printing or writing
the results to files. For example, an EnKF object should be designed to carry
out one cycle consisting of the “forecast” and the “analysis” steps. The
basic assimilation objects in DATeS are a filtering object, a smoothing
object, and a hybrid object. DATeS provides the common functionalities for
filtering objects in the base class

A model object is passed to the assimilation object constructor via configuration dictionaries to give the assimilation object access to the model-based data structures and functionalities. The settings of the assimilation object, such as the observation time, the assimilation time, the observation vector, and the forecast state or ensemble, are also passed to the constructor upon instantiation and can be updated during runtime.

Table

Covariance inflation and localization are ubiquitously used in all
ensemble-based assimilation systems. These two methods are used to counteract
the effect of using ensembles of finite size. Specifically, covariance
inflation counteracts the loss of variance incurred in the analysis step and
works by inflating the ensemble members around their mean. This is carried
out by magnifying the spread of ensemble members around their mean by a
predefined inflation factor. The inflation factor could be a scalar, i.e.,
space–time independent, or even varied over space and/or time. Localization,
on the other hand, mitigates the accumulation of long-range spurious
correlations. Distance-based covariance localization is widely used in
geoscientific sciences, and applications, where correlations are damped out
with increasing distance between grid points. The performance of the
assimilation algorithm is critically dependent on tuning the parameters of
these techniques. DATeS provide basic utility functions (see
Sect.

The assimilation process in DATeS.

A common practice in sequential DA experimental settings is to repeat an
assimilation cycle over a given time span, with similar or different settings
at each assimilation window. For example, one may repeat a DA cycle on
several time intervals with different output settings, e.g., to save and print
results only every fixed number of iterations. Alternatively, the DA process
can be repeated over the same time interval with different assimilation
settings to test and compare results. We refer to this procedure as an
“assimilation process”. Examples of numerical comparisons, carried out
using DATeS, can be found in

The assimilation process object either retrieves real observations or
creates synthetic observations at the specified time instances of the
experiment. Figure

Utility modules provide additional functionality, such as the

A sample of the modules wrapped by the main utility module

Ensemble-based assimilation algorithms often require matrix representation of
ensembles of model states. In DATeS, ensembles are represented as lists of
states, rather than full matrices of size

The sequence of essential steps required in order to run a DA experiment in DATeS.

The module

Initializing the DATeS run.

The sequence of steps needed to run a DA experiment in DATeS is summarized in
Fig.

Initializing a DATeS run involves defining the root directory of DATeS as an
environment variable and adding the paths of DATeS source modules to the
system path. This can be done by executing the code snippet in
Fig.

QG-1.5 is a nonlinear 1.5-layer reduced-gravity QG model with double-gyre
wind forcing and biharmonic friction

This model is a numerical approximation of the equations

Creating the QG model object.

We use a standard linear operator to observe

Creating a DEnKF filtering object.

One now proceeds to create an assimilation object. We consider a
deterministic implementation of EnKF (DEnKF) with ensemble size equal to

Ensemble inflation is applied to the analysis ensemble of anomalies at the
end of each assimilation cycle of DEnKF with an inflation factor of

Creating a filter process object to carry out DEnKF filtering using the QG model.

Most of the methods associated with the DEnKF object will raise exceptions if immediately invoked at this point. This is because several keys in the filter configuration dictionary, such as the observation, the forecast time, the analysis time, and the assimilation time, are not yet appropriately assigned. DATeS allows creating assimilation objects without these options to maximize flexibility. A convenient approach is to create an assimilation process object that, among other tasks, can properly update the filter configurations between consecutive assimilation cycles.

Running the filtering experiment.

We now test DEnKF with the QG model by repeating the assimilation cycle over a
time span from

Here,

Finally, the assimilation experiment is executed by running the code snippet in
Fig.

The filtering results are printed to screen and are saved to files at the end
of each assimilation cycle as instructed by the

Figure

The true field, the forecast errors, and the DEnKF analyses errors at
different time instances are shown in
Fig.

Typical solution quality metrics in the ensemble-based DA literature include
RMSE plots and rank (Talagrand)
histograms

Upon termination of a DATeS run, executable files can be cleaned up by
calling the function

The QG-1.5 model. The truth (reference state) at the initial time
(

In the linear settings, the performance of an ensemble-based DA filter could be judged based on two factors. Firstly, convergence is explained by its ability to track the truth and secondly by the quality of the flow-dependent covariance matrix generated given the analysis ensemble.

Data assimilation results. The reference field

Cleanup of DATeS executable files.

Data assimilation results. In panel

The convergence of the filter is monitored by inspecting the RMSE,
which represents an ensemble-based standard deviation of the difference between reality, or truth, and the model-based prediction.
In synthetic experiments, where the model representation of the truth is known, the RMSE reads

Figure

For benchmarking, one needs to generate scalar representations of the RMSE
and the uniformity of a rank histogram of a numerical experiment. The average
RMSE can be used to compare the accuracy of a group of filters. To generate a
scalar representation of the uniformity of a rank histogram, we fit a beta
distribution to the rank histogram, scaled to the interval

Rank histograms with fitted beta distributions. The KL-divergence measure is indicated under each panel.

Measures of uniformity of the rank histograms shown in
Fig.

Figure

Data assimilation results with DEnKF applied to Lorenz-96 system.
RMSE results on a log scale are shown in panel

The architecture of DATeS makes it easy to generate benchmarks for a new experiment.
For example, one can write short scripts to iterate over a combination of settings of a filter to find the best possible results.
As an example, consider the standard

Figure

Data assimilation results with DEnKF applied to Lorenz-96 system.
The minimum average RMSE over the interval

Concluding the best inflation factor for a given ensemble size, based on Fig.

To answer the question about the ensemble size, we pick the ensemble size

Despite being a relatively easy process, unfortunately, generating a set of benchmarks for all possible combinations of numerical experiments is a time-consuming process and is better carried out by the DA community. Some example scripts for generating and plotting benchmarking results are included in the package for guidance.

Note that, when the Gaussian assumption is severely violated, standard benchmarking tools, such as RMSE and rank histograms,
should be replaced with, or at least supported by, tools capable of assessing ensemble coverage of the posterior distribution.
In such cases, MCMC methods, including those implemented in DATeS

DATeS aims at being a collaborative environment and is designed such that adding DA components to the package is as easy and flexible as possible. This section describes how new implementations of components such as numerical models and assimilation methodologies can be added to DATeS.

The most direct approach is to write the new implementation completely in Python. This, however, may sacrifice efficiency or may not be feasible when existing code in other languages needs to be reused. One of the main characteristics of DATeS is the possibility of incorporating code written in low-level languages. There are several strategies that can be followed to interface existing C or Fortran code with DATeS. Amongst the most popular tools are SWIG and F2Py for interfacing Python code with existing implementations written in C and Fortran, respectively.

Whether the new contribution is written in Python, in C, or in Fortran, an appropriate Python class that inherits the corresponding base class, or a class derived from it, has to be created. The goal is to design new classes that are conformable with the existing structure of DATeS and can interact appropriately with new as well as existing components.

Illustration of a numerical model class named

A new model class has to be created as a subclass of

The leading lines of an implementation of a class for the model

The first step is to grant the model object access to linear algebra data
structures and to error models. Appropriate classes should be imported in a
numerical model class:

Linear algebra includes state vector, state matrix, observation vector, and observation matrix.

Error models include background, model, and observation error models.

The next step is to create Python-based implementations for the model
functionalities. As shown in Fig.

As an example, suppose we want to create a model class name

Note that in order to guarantee extensibility of the package we have to fix
the naming of the methods associated with linear algebra classes, and even if
only binary files are provided, the Python-based linear algebra methods must
be implemented. If the model functionality is fully written in Python, the
implementation of the methods associated with a model class is
straightforward, as illustrated in

The process of adding a new class for an assimilation methodology is similar to creating a class for a numerical model; however, it is expected to require less effort. For example, a class implementation of a filtering algorithm uses components and tools provided by the passed model and by the encapsulated linear algebra data structures and methods. Moreover, filtering algorithms belonging to the same family, such as different flavors of the well-known EnKF, are expected to share a considerable amount of infrastructure. Python inheritance enables the reuse of methods and variables from parent classes.

Illustration of a DA filtering class

To create a new class for DA filtering, one derives it from the base class

The leading lines of an implementation of a DA filter; the

Unlike the base class for numerical models (

Figure

The code snippet in Fig.

This work describes DATeS, a flexible and highly extensible package for solving data assimilation problems. DATeS seeks to provide a unified testing suite for data assimilation applications that allows researchers to easily compare different methodologies in different settings with minimal coding effort. The core of DATeS is written in Python. The main functionalities, such as model propagation, filtering, and smoothing code, can however be written in high-performance languages such as C or Fortran to attain high levels of computational efficiency.

While we introduced several assimilation schemes in this paper, the current version, DATeS v1.0, emphasizes the statistical assimilation methods. DATeS provide the essential infrastructure required to combine elements of a variational assimilation algorithm with other parts of the package. The variational aspects of DATeS, however, require additional work that includes efficient evaluation of the adjoint model, checkpointing, and handling weak constraints. A new version of the package, under development, will carefully address these issues and will provide implementations of several variational schemes. The variational implementations will be derived from the 3D- and 4D-Var classes implemented in the current version (DATeS v1.0).

The current version of the package presented in this work, DATeS v1.0, can be situated between professional data assimilation packages such as DART and simplistic research-grade implementations. DATeS is well suited for educational purposes as a learning tool for students and newcomers to the data assimilation research field. It can also help data assimilation researchers develop specific components of the data assimilation process and easily use them with the existing elements of the package. For example, one can develop a new filter and interface an existing physical model, and error models, without the need to understand how these components are implemented. This requires unifying the interfaces between the different components of the data assimilation process, which is an essential feature of DATeS. These features allow for optimal collaboration between teams working on different aspects of a data assimilation system.

To contribute to DATeS, by adding new implementations, one must comply with the
naming conventions given in the base classes. This requires building proper
Python interfaces for the implementations intended to be incorporated with
the package. Interfacing operational models, such the Weather Research and
Forecasting (WRF) model

The authors plan to continue developing DATeS with the long-term goal of making it a complete data assimilation testing suite that includes support for variational methods, as well as interfaces with complex models such as quasi-geostrophic global circulation models. Parallelization of DATeS, and interfacing large-scale models such as the WRF model, will also be considered in the future.

The code of DATeS v1.0 is available at

AA developed the package and performed the numerical simulations. The two authors wrote the paper, and AS supervised the whole project.

The authors declare that they have no conflict of interest.

The authors would like to thank Mahesh Narayanamurthi, Paul Tranquilli, Ross Glandon, and Arash Sarshar from the Computational Science Laboratory (CSL) at Virginia Tech, and Vishwas Rao from the Argonne National Laboratory, for their contributions to an initial version of DATeS. This work has been supported in part by awards NSF CCF-1613905, NSF ACI–1709727, and AFOSR DDDAS 15RT1037, and by the CSL at Virginia Tech. Edited by: Ignacio Pisso Reviewed by: Kody Law and three anonymous referees