We present a new version of the Compact Modeling Framework (CMF3.0) developed for the software environment of stand-alone and coupled global geophysical fluid models. The CMF3.0 is designed for use on high- and ultrahigh-resolution models on massively parallel supercomputers.
The key features of the previous CMF, version 2.0, are mentioned to reflect progress in our research. In CMF3.0, the message passing interface (MPI) approach with a high-level abstract driver, optimized coupler interpolation and I/O algorithms is replaced with the Partitioned Global Address Space (PGAS) paradigm communications scheme, while the central hub architecture evolves into a set of simultaneously working services. Performance tests for both versions are carried out. As an addition, some information about the parallel realization of the EnOI (Ensemble Optimal Interpolation) data assimilation method and the nesting technology, as program services of the CMF3.0, is presented.
As it was stated at the World Modeling Summit for Climate Prediction
Along with the development of physical models of individual Earth system components, the role of instruments organizing their coordinated work (couplers and coupling frameworks) becomes more and more important. The coupler architecture depends on the complexity of the models used, on the characteristics of interconnections between the models and on the hardware and software environment. Historically, the development of couplers follows the development of coupled atmosphere–ocean models. At some level of complexity, the development of such software became an external problem relative to the development of individual components of the coupled model.
The first coupled models used simple algorithms for coordination of
components through the file system. There was no separate coupler component,
and communication between models was realized as a set of model procedures
for input/output (I/O) and for interpolation between global model grids
(today this method is used, for example, in the INMCM4.0 climate model;
Coupling through a shared file or through a sequential hub is acceptable only
for models of relatively low resolution. Increasing of array sizes and of the
number of model components in the system will inevitably become a
“bottleneck” because of memory and performance limitations of a single
processor core and also due to problems related to global network
communications. Therefore, it was quite natural that the next generation of
couplers introduced parallelism in their internal algorithms (Community Earth
System Model cpl6;
A new coupler architecture was introduced for the CESM1.0 model in 2012
The coupled system can also be launched as a single or multiple executable
without a separate coupler, whose functions in this case are provided by a
coupling library and performed in parallel on a core subset of each model
component. Such a solution was proposed in OASIS3-MCT
Another important feature of the coupled model is the scheme of working with
the file system. In earlier versions, it was carried out independently by
each model component in a sequential way. Obviously, this master-process
scheme (used in CESM cpl6, OASIS3, OASIS3-MCT) was limited by the RAM of a
node. Increasing amounts of model data lead to rapid development of parallel
I/O (PIO) algorithms. Since version 1.0, the CESM system utilizes the
PIO library
Thus, we can point out the necessary features of modern coupling frameworks,
which define their functionalities and characteristics.
coupling architecture (serial, parallel, with a high-level driver or as a
set of procedures): the design of the framework defines the complexity of
development/maintenance of the coupled model and implicitly establishes
performance limitations; I/O-module architecture (serial or parallel, synchronous or
asynchronous): it should be considered as a balance between simplicity of algorithms and the necessary rate of I/O; ease of use: the level of system abstraction defines the convenience of users
work and the transparency of the overall coupled model; performance: the choice of underlying algorithms defines the computational
rate of the coupled model.
Our work began with the development of a parallel version of an ocean dynamics model. The aim at that time was to work out a high-resolution World Ocean model (WOM). We had to solve several problems, namely halo update, mapping (interpolation) of external forcing data to the model grid, saving solution to a file and gathering diagnostics. It was obvious that separation of numerical algorithms for solving ocean dynamics equations from low-level service procedures is necessary to write a transparent code, which would allow us to independently develop the physical model as well as service procedures.
This approach showed its advantages in coupling the atmosphere and ocean
general circulation models for medium- and long-term weather forecasts at the
Hydrometeorological Research Center of Russia. The purpose was to create
software capable of maintaining effective interaction of the high-resolution
(on the order of 0.1
At the beginning of our study in 2012 there were several solutions for
creation of coupled models. It should be noted that state-of-the-art
couplers, such as of CESM (with coupler based on MCT;
The OASIS3 system was very successful and was widely used by many research
groups around the world. But, as it was pointed out, it contains a serial
coupler, which is an obvious performance bottleneck due to constraints on
memory and global communications. The new version, OASIS3-MCT
According to the analysis of the Coupling Technologies for Earth System
Modeling workshop
In CMF2.0, the framework for ocean–ice–atmosphere–land coupled modeling
on massively parallel architectures
In this paper we present two versions of the Compact Modeling Framework
(CMF), v. 2.0 and v. 3.0. As the CMF2.0 was published only in Russian
In the CMF3.0, the pure MPI approach is replaced with the Partitioned Global Address Space (PGAS) paradigm of communications, while the central hub architecture has evolved to a kind of service-oriented architecture (SOA) with a set of simultaneously working services and a common task queue.
Any coupled model under control of the CMF runs as a single executable, with
each model component and the coupler using distinct processor cores. At the
beginning, the global MPI communicator is split into appropriate groups
according to the requested communicator sizes of the model components and the
coupler, and then all groups work simultaneously. The coupler performs some
initialization routines and enters the time cycle of requests. Following the
That is, in order to add a physical model to the coupled system a user only has to define the physical model adapter (the required template is provided) and to realize its abstract interfaces (filling them with calls to the user's internal model subroutines). This approach allows one to generate different executables for different coupled model combinations (e.g., switch between ocean simulations with different sea-ice models) and restricts the user from any changes in the code outside of the user's adapter. Also, the addition or modification of components does not affect the main CMF code, because it implements the abstract driver and the abstract component, which do not mention any specific component names (ocean, atmosphere, ice, etc).
Architecture of the coupled model run under control of the CMF2.0. In this example there are three components (ocean, atmosphere, ice) connected by the 3-core coupler.
For any model component, its decomposition is generated by the CMF2.0 system
in such a way that each coupler core interacts only with a specific subset of
the component cores. This allows one to reduce the required amount of
communication routes to the coupler for every component from
All events in the system are divided into a few classes (save diagnostics, save control point, read file data, send/receive mapping, etc.), defining different actions with data arrays. In the CMF2.0, we postulate that all events could be predefined before the start and occur with fixed periods. Thus, the coupler can take on the task of synchronizing models and avoiding deadlocks.
The sequence of events (time chain) is constructed in the main CMF program,
which is the entry point of the coupled model. Also, at the registration
stage, models provide the CMF system with pointers to the arrays that must be
processed in the events. So, during the system operation, events are
performed automatically and do not require explicit calls from the user. As
the information about the periods of all events is known at the registration
stage, the coupler can build a table of its actions. This allows one to exclude
parallel synchronization of the coupler cores, which otherwise would be
necessary when, for example, two components at the same time want to write
data to the file system. When a certain time moment arrives, the coupler
selects the next event from the chain and calls the appropriate handler
function based on the type of this event, while the model components
asynchronously send data. Moreover, it becomes possible to use persistent
MPI operations (combinations of
The interpolation algorithm uses SCRIP-formatted weight files built at the
pre-run (off-line) stage by means of the Climate Data Operators (CDO) package
(
The regridding process is performed in the coupler communicator and is
implemented as a sparse matrix – vector multiplication. It supports
logically rectangular grids. We implemented the “source” and
“destination” parallel mapping algorithms
The SCRIP format is used to organize the mapping process, i.e., SCRIP-type
links connect cells of destination and source grids with appropriate weights.
Since every coupler core works only with a subdomain of the global model
grid, it has only a part of the source grid data in memory. Other data should
be gathered from neighboring cores during every interpolation event, which is
functionally analogous to calling the “Rearranger” routines of
During the interpolation event, every coupler core first prepares and sends source cells required by its neighbors. Then, while these data are being sent, it weights its local source cells. And, at last, receives the missing data and completes the weighted sums on the destination grid. It is worth noting that the data are not sent directly, but as sorted unique cell vectors. This allows one to avoid sending duplicated data which could be the case when a source cell is used in a few destination cells. As a result, there is an overlap of computations and communications, which, in conjunction with persistent MPI transactions, determines the high efficiency of the algorithm.
The performance rate of the CMF2.0 interpolation system was evaluated in
several “ping-pong” tests, in which the coupler was ensuring
component–component exchanges of the INMIO-SLAV ocean–atmosphere model with
disabled solvers of physics equations (similarly to the ping-pong test of
OASIS3 in
In Test I, the ocean model sends three 2-D fields every 2 h to the
atmosphere model and receives nine 2-D fields every 1 h. The ocean model has
the
Results were obtained on four supercomputers: MVS-100k, MVS-10P, BlueGene/P
and BlueGene/Q (characteristics are provided in the Appendix). On all
supercomputers, the coupled system was compiled with a standard Intel Fortran
compiler. Timing results of the 10-day Test I on MVS supercomputers are
presented in Fig.
It is clear that 20–40 coupler cores provide a satisfactory speed for such
problems, because a cost of
Results of the same test for the BlueGene supercomputers are presented in
Fig.
Wall-clock time required for the 10-day ocean–atmosphere model run with disabled physics vs. number of coupler cores on MVS supercomputers (Test I for CMF2.0).
Wall-clock time required for the 10-day ocean–atmosphere model run with disabled physics vs. number of coupler cores on BlueGene supercomputers (Test I for CMF2.0).
Wall-clock time required for the 10-day ocean–atmosphere model run with disabled physics vs. number of coupler cores on BlueGene/Q supercomputer, for different decomposition sizes of the ocean and atmosphere models (Test II for CMF2.0).
The same data as in Fig.
– denotes not tested or unsupported configurations
Test II was conducted for estimation of the increasing communication load
associated with the growth of components' communicator sizes. The timing
still refers to the 10-day experiment with disabled physics. But the model
grids were decomposed on a much higher number of subdomains, increasing the
cost of the gather/distribute phase of the test (mapping process inside the
coupler communicator remains the same). The results are shown in
Fig.
The graph shows two interesting facts. Firstly, single-core coupler configurations do not work for Test II because of memory limitations. Secondly, increasing communication load (i.e., gather/distribute) affects performance only on small numbers of coupler cores. For example, test times for two coupler cores for model communicator sizes (8640, 3456) and (10 368, 4320) are correspondingly 26 % and 42 % higher than for Test I communicator sizes (1152, 288). For eight coupler cores this difference becomes 12 % and 21 %, correspondingly. Since every coupler core communicates only with a subset of component cores, increasing of the coupler communicator size leads both to decomposing of the interpolation computations and to decreasing of the component–coupler communication overhead, though slightly increasing intra-coupler rearrangement communications. As a result, even a few tens of coupler cores are suitable to provide good performance of high-resolution mapping with huge sizes of model communicators.
Since the speed of I/O operations in supercomputers is often slow, writing
large amounts of data (such as control points that include several
3-D arrays) can take an unacceptably long time. In the case of frequent data dumps
or a slow file system, the time of calculations could be even comparable to the
time of I/O, thus it is very important to optimize interaction with the file
system. Its realization can be synchronous (blocking) or asynchronous
(non-blocking) (e.g., see
In the former case, I/O operations are performed by some subset of the processor cores of physical model components, thus inhibiting the physical equations solving.
In the latter case, this inhibition is avoided at the cost of allotting distinct cores to specific I/O services and making procedures for data transfer between these services and the physical components. It is worth noting that an increase in the number of writing cores does not always increase the recording speed, but often reduces it. The particular behavior is defined by actually installed supercomputer hardware. For example, presence of a single I/O channel for the whole machine can serialize the I/O, and, in opposite, multiple special I/O nodes may allow one to even achieve some acceleration. But, even in the case of slow hardware, the total time of model experiment can be equal to the time of physical equations solving due to overlapping of computations and I/O. The scheme allows a model to accelerate as long as the writing time is less than or equal to the time of performing the chunk of calculations. This limitation is controlled by the hardware bandwidth and by the model–service data transfer realization.
The asynchronous approach is generally more flexible, so it was chosen for
the CMF. In the CMF2.0, all I/O actions are performed by the coupler. The
realization is fully parallel, so one can work with any grid size by just
increasing the number of coupler cores. Test results of the CMF2.0 I/O system
for the case of writing a single-precision model array of
Wall-clock time of parallel writing of a model array of
It can be seen that the writing speed of MVS-10P is approximately constant. Moreover, the timing does not change when writing cores are allotted on one or several nodes. The reason is that MVS-10P has only one I/O node and all file operations are performed through it. On the other hand, the BlueGene/P system has several I/O nodes, so a reduction in the writing time is obtained when increasing the number of coupler cores. Nevertheless, the main advantage of the CMF I/O system is its asynchrony and memory scalability. The acceleration obtained with the BlueGene/P system is rather a nice particular result than a permanent CMF feature. It draws attention to the need of developing I/O infrastructure on supercomputer systems. It is obvious that scalability graphs of a future exaflop machine with millions of cores become very artificial if one has to work with the file system through a single channel.
Apart from the coupler, the framework also includes two helpful blocks. For the pre-run stage, the CMF2.0 has got the off-line block, which constructs SCRIP interpolation weights and prepares initial condition files. Like the run-stage CMF program, it is implemented in terms of abstract operations, which reduces all model configuration actions required from the user (e.g., grid definition) to realization of a few abstract interfaces in a user-derived class.
At the run stage, the user can call various utility modules, like the HaloUpdater, which is needed in finite-difference models. It uses a 4-neighbor scheme of any length/dimension/type update on latitude–longitude and tripolar grids, still handling diagonal cells. Impact of the HaloUpdater on performance of the INMIO WOM is described later.
Also, the CMF2.0 provides helpful tools for automatic building of various model combinations, makefile and skeleton class generation, data preprocessing, and for other infrastructure actions.
The CMF2.0 has shown itself as a suitable framework for high-resolution coupled modeling, allowing us to perform long-term experiments which would be impossible without it. But the CMF2.0 still has several points for improvement. First of all, although the pure MPI-based messaging is quite fast, it needs explicit work with sending and receiving buffers. Additionally, development of nested regional models becomes quite difficult using only MPI routines. The CMF2.0 test results showed that we can easily sacrifice some performance and choose better (but perhaps less computationally efficient) abstraction to simplify messaging routines.
We have chosen the Global Arrays library (GA), which implements the
Partitioned Global Address Space (PGAS) paradigm of parallel communication
and provides an interface that allows one to distribute data while maintaining
the type of global index space and programming syntax similar to that
available when programming on a single processor
Development of this idea in the CMF3.0 has resulted in the class
Communicator_GA, which encapsulates the logic of working with the GA and
provides an interface for put/get operations of sections of global arrays
from different model components and services. Moreover, this interface could
be used not only for connections between models (including nested ones) but
also as a communication mechanism between the models and the coupler, because
it allows one to hide all decomposition-to-decomposition problems rising in
distributed-memory applications. In the CMF3.0, every array, which
participates in inter-model communications, has its “mirror” in the
corresponding virtual global array. When the model needs to perform some
action, it puts/gets data to/from the global array (this operation is local
since the global arrays' processor-wise allocation perfectly matches the
model decomposition) and continues calculations. Service components get the
array from
The architecture of the compact framework CMF3.0. There are four components in this example: ocean model (OCN), ice model (ICE), atmosphere model (ATM) and sea model (SEA). The components send requests to the common message queue, from where they are retrieved by the coupler (CPL), data assimilation (DAS), input and output data (IOD), and nesting (NST) services. The data itself is transferred through the mechanism of global arrays, which is also used for inter-processor communications in the components and services.
As the complexity of coupled models is growing, we need an easy and convenient way of connecting model components together. The SOA, which was originally introduced for web applications, gives a good pattern for component interactions. In the CMF3.0, all model components send their requests to the common message queue. The service components only receive the messages they could process, then get data from appropriate global arrays and perform the required actions. Such an architecture allows us to minimize dependencies between physical and service components, and makes development much easier. Moreover, since all services in the CMF3.0 are based on the same template (inherit the base class Service), it also allows the user to easily add new services to the system by filling only few abstract interfaces. Today, we have four completely independent services built into the CMF3.0: CPL (mapping), IOD (I/O service), NST (nesting service) and DAS (data assimilation service).
The CPL service represents the coupler from the CMF2.0 and serves all mapping requests. It receives data through the Communicator_GA class routines, performs interpolation and pushes data to the destination global array (without a request from the receiving side). Although the central coupler architecture of CMF2.0 allows one to collect all service operations on one external component and to perform each of them in parallel, simultaneous requests can sometimes lead to inefficient usage of processor time. For example, the CMF2.0 coupler can not perform parallel mapping and parallel I/O operations together. This is a disadvantage of all data transfer schemes that combine two or more actions on one process.
In the CMF3.0, we decided to pick out a separate I/O service, only responsible for working with the file system. For example, when writing data to a file, on the component side it works as follows: the model component has to wait until the corresponding GA array is free, put the data into the GA array, mark the GA array as full and send request for the IOD service. On the IOD side, the request is read by the service, the service then takes the requested data from the GA array and marks the array as free; calls NetCDF routines (same as in the CMF2.0) for parallel writing to file. This approach, though not expected to perform faster than the CMF2.0 direct MPI messaging, provides a flexible and fully asynchronous data writing, limited mainly by the bandwidth of the file system.
A performance test of the CMF3.0 I/O system in the INMIO World Ocean model of
0.1
Wall-clock time of an 8-day run of INMIO World Ocean model with 0.1
It should be noted that one external I/O service solves only part of the problem, because in the case of writing large CP files and dumping frequent lightweight diagnostics the model still would be blocked by the former. Therefore, the service may be further split into two parts – fast and slow I/O-devices. Due to the abstract structure of the Service class this separation can be done via a few lines of code.
Further development of the CMF has included data assimilation algorithms. For
the ocean model, we have added the new DAS service, which implements the
logic of parallel data assimilation
The Communicator_GA is a CMF3.0 system class that represents a kind of
facade for the GA library. That is, it defines a high-level interface and
hides some subtleties of the GA from the user. For example, the class allows
one to create an array that will be distributed on one component, but still
visible to another component. It can be a temperature array that is
physically distributed over the ocean's cores (and they can read and write
data to it), but, in addition, the CPL service can also work with this array,
although it does not store any part of it. Creation of such a global array in
CMF3.0 will require just a few subroutine calls:
request the CMF system for component identifiers and process lists
of currently running ocean model and CPL service; register this joint group of processes, prescribing ocean as the holder
and CPL as the subscriber; request the system for current ocean decomposition; register the array, specifying the ocean as the holder and passing its
decomposition (so that GA distributes the mirror array in the same way as the
model component does with the original one), and the coupler as the subscriber.
The GA put and get operations may now be called. For the holder side they
will be local due to consistency of the decomposition. One of the benefits of
this architecture is that now the model decomposition can be arbitrary. For
example, it becomes easy not to reserve processor cores for subdomains of an
ocean model that lay on land.
Every put/get operation must maintain explicit synchronization by setting the array status accordingly to “full” or “empty”. This is required since we are not allowed to “lose data”. That is, even if some component (e.g., ocean model) is faster than another component (atmosphere model, or IOD in the case of too frequent data dumps), we must not lose an array. Accumulation of arrays in a queue also will not lead to success, since models usually work at constant speeds and, as a result, the queue will soon exhaust all available memory. So, if the “fast” model is ready to put/get data, but the GA array is still occupied/empty, the model is blocked.
Since the logic of interpolation subroutines in the CMF3.0 remains the same
as in the CMF2.0, we can greatly simplify it by use of GA abstractions. Now,
all source data needed by the destination cell is collected directly by
Communicator_GA routines. The optimizations regarding repeated cells are
preserved. Disadvantage of using the GA is a decrease in performance, since
it can not provide persistent operations, overlapping of computations and
communication in one service, and obviously has its own overheads. We take
the same parameters and input files as of the Test I to compare the CMF3.0
performance with that of the CMF2.0 (Fig.
Tests were conducted on the MVS-10P supercomputer configuration with 16 cores
per node. The graph shows that results are not as good as those for the
CMF2.0 (Fig.
Wall-clock time required for the 10-day ocean–atmosphere model run with disabled physics vs. number of coupler cores on the MVS-10P supercomputer (Test I for CMF2.0 and CMF3.0).
At any moment during the run time, the CMF3.0 services can respond to request
messages and trigger certain actions on data arrays. So, the model is allowed
to send such requests (i.e., raise events) unexpectedly, at any step of its
time cycle. Nevertheless, sometimes we may know in advance a schedule of
actions (e.g., sending mapping every 2 h, diagnostics every day and CP every month).
The CMF3.0 provides a simple mechanism for generation of
such scheduled actions in order to save the user from having to keep track of
time and send requests at the right moments. At present, we have two types of
event generators: NormalEvent, which represents uniform actions (like
diagnostics saving, etc.), and SyncVarEvent, which allows one to synchronize
with the time axis of a NetCDF file (it is useful for experiments with
prescribed forcing referenced to the real calendar, e.g., the Drakkar forcing
set;
In case of unexpected behavior (like exceptions in model physics or changes in external data) the user can directly call the raise event routine, e.g., for emergency data dump or even to change the functioning of other model components by special messages.
The first to respond to an event is the model component itself – it looks at the event's type and determines what to do (e.g., in the case of saving diagnostics: put the data into the GA array, mark it as full, send a request to the services and continue running). Then, the event is packed into an MPI message and sent as a request to all services (if the model has decided to send it). Services unpack the event, look at its name, and either process it, or ignore.
Other parallel utilities implemented in the CMF3.0 include
array operations, such as resizing, changing index order, converting to
string and back, search for a particular element; calculating global sums and area integrals over a decomposed model field,
which is important in maintaining conservation in geophysical fluid models
(e.g., to correct the precipitation, evaporation and runoff algebraic sum in
stand-alone ocean simulations like; memory usage monitoring.
In the CMF3.0, we included all pre- and post-processing utility modules
available in CMF2.0. It is not difficult to migrate from the CMF2.0 to the
CMF3.0. Only one adapter file, about 200 lines of code, should be rewritten.
It contains several procedures (
There are several examples of using CMF for various geophysical numerical
models:
Eddy-resolving ocean dynamics modeling using the INMIO WOM with 0.1 Data assimilation of satellite observations and ARGO float measurements
using the DAS service in forecast and reanalysis experiments with the INMIO
WOM governed by the CMF3.0 There is a set of works with coupled atmosphere–ocean models for climate
change research and numerical weather prediction. The SLAV global atmosphere
model The nesting technology implemented in the CMF3.0 NST service has been tested for
the local INMIO-based model of the Barents Sea with a resolution of
0.1 First results of the seasonal variability simulation for the Arctic and
North Atlantic ocean waters and ice by the coupled INMIO WOM and a sea-ice
CICE5.1
As it was mentioned, one of the goals of the CMF is to provide tools for effective parallel calculations of stand-alone models. Historically, it was developed to provide efficient support for the INMIO WOM. This model utilizes a 2-D decomposition of the tripolar grid. Increasing the number of cores decreases (almost proportionally) the number of performed operations for each process, since the model uses explicit time schemes for horizontal operators, which require only local halo updates. Therefore, limitations in scalability can only be associated with halo update routines and external blocks (e.g., in the I/O system).
The latest version of INMIO WOM is distributed in an integrated package
together with the CMF2.0 and 3.0, all necessary libraries and a standardized
folder structure facilitating the adding of new model components (including
adapter files for the Los Alamos Community Ice CodE (CICE) sea-ice model). At present,
the INMIO code consists of the hydrodynamical solver, atmospheric boundary
layer bulk formulae, the built-in thermodynamic ice model of
Wall-clock time of the 0.1
Scalability of the INMIO WOM of 0.1
The second application of the framework was the numerical experiment with the
global coupled INMIO ocean
Prognostic coupled model calculations were carried out with a time step of 6 min for the oceanic component and 3.6 min for the atmospheric one. The initial state of the ocean was obtained by a spin-up of the stand-alone ocean model driven by the ERA-Interim atmospheric forcing. The atmosphere started from the objective analysis of the Hydrometcenter of Russia. Every 72 min, nine 2-D arrays were transferred from the atmosphere to the ocean (components of wind stress, short- and long-wave radiation, fluxes of sensible and latent heat, precipitation, evaporation and air temperature at 2 m). Conversely, every 144 min three 2-D arrays were transferred from the ocean to the atmosphere (upper grid-box temperature, temperature and concentration of sea ice). The sea ice was simulated by the INMIO built-in ice thermodynamics model, while the land processes were incorporated into the SLAV atmosphere model. The coupled model works stably and, along with intra-annual distribution characteristics of monthly data fields, reproduces enough thin elements of atmospheric and oceanic circulation.
The model throughput on the MVS-10P supercomputer was equal to 0.75 SYPD for the configuration ocean (1152 cores) – atmosphere (288 cores) – coupler (16 cores). At that moment, the maximal communicator size available for the atmosphere model was limited due to the one-dimensional latitudinal grid decomposition.
As well as any service of the CMF3.0, the data assimilation is performed on
separate processor cores. This allows us to structure the Earth modeling
system better, in order to make each software component solve its own
problem. At the same time, the model of the ocean does not take part in the
data assimilation. Only results of the ocean modeling in the form of ensemble
vector elements are used. On their basis, the covariance matrices are
approximated. The data from the ocean model is sent to the service (usually
once a modeling day) without using the file system (through the cluster
interconnect). Moreover, all matrix–vector operations are calculated in
parallel (on shared memory) using BLAS and LAPACK functions through the
Global Arrays (GA) toolkit
Due to the effective implementation of the EnOI method as the DAS parallel
software service, the data assimilation problem scales almost linearly
(Fig.
Scalability of the EnOI method in the context of the CMF3.0 DAS service
at the assimilation of
We present an original modeling framework CMF developed as our first step to high-resolution modeling. The key part of it, the coupler, in the initial version CMF2.0 has a sufficiently small code size for such programs (about 5000 lines of code including unit tests) and is able to manage the main parallel problems of the coupled modeling – synchronization, regridding and I/O. The coupled model follows the single executable design with the main program independent of components' code, and the coupler dealing with all service operations. The new version, CMF3.0, utilizes the SOA design, which allows one to divide the coupler responsibilities into small separate services, easily plug/unplug them and add new ones to the system, thus providing a further generalization to the coupling interface. The PGAS messaging greatly simplifies implementation of all model low-level interprocess communications.
Tests for CMF2.0 parallel mapping efficiency were carried out on four modern
supercomputer architectures. They show a nearly linear scalability of the
overall communication system and the regridding procedure. Satisfactory speed
results could be achieved already on 20–40 coupler cores even dealing with
grids of high resolution (0.1
Originally designed for the INMIO World Ocean model support, the CMF has developed into a flexible and extensible instrument providing means for high-resolution resource-demanding simulations in regional to global, stand-alone or coupled, and forecast and climate problems.
The code of the CMF3.0 and CMF2.0 (distributed under GPLv2
licence) is available on
The MVS-100k and MVS-10P systems are installed at the Joint Supercomputer
Center of the Russian Academy of Sciences (
Supercomputer BlueGene/P is located at the Faculty of Computational
Mathematics and Cybernetics, Moscow State University, and consists of
2048 computing nodes. Each node has four PowerPC 450 cores (850 MHz) and
2 GB of RAM. Nodes are networked with the 3-D torus topology
(5.1 GB s
Supercomputer BlueGene/Q is located at the IBM Thomas J. Watson Research
Center and consists of several racks. Every two racks have 2048 computational
nodes, each with 16 cores. The core is a PowerPC A2 (16 GB RAM, 1.6 GHz).
Nodes are networked with the 5-D torus topology (40 GB s
Supercomputer Lomonosov is located at the Lomonosov Moscow State University
and consists of more than 50 000 cores. We have used the partition with
eight-core nodes (
VK was responsible for most aspects of the CMF2.0 and CMF3.0 development, including design, code writing, and testing. RI and KU designed and developed the INMIO model and physical coupling algorithms. MK was the CMF3.0 co-developer and the author of the Data Assimilation Service. VK wrote the first draft of the article, and all co-authors contributed to software validation, performance tests, and the final version of the article.
The authors declare that they have no conflict of interest.
The research of Sections 1–3, 5.1 and 5.2 was supported by the Russian Science Foundation (project no. 14-37-00053) and performed at the Hydrometeorological Research Center of the Russian Federation. The research of Sections 4 and 5.3 was supported by the Russian Science Foundation (project no. 17-77-30001) and performed at the Federal State Budget Scientific Institution “Marine Hydrophysical Institute of RAS”. Edited by: Steve Easterbrook Reviewed by: Sophie Valcke and one anonymous referee