Rapidly evolving computational techniques are making a large gap between scientific aspiration and code implementation in climate modeling. In this work, we design a simple computing library to bridge the gap and decouple the work of ocean modeling from parallel computing. This library provides 12 basic operators that feature user-friendly interfaces, effective programming, and implicit parallelism. Several state-of-the-art computing techniques, including computing graph and just-in-time compiling, are employed to parallelize the seemingly serial code and speed up the ocean models. These operator interfaces are designed using native Fortran programming language to smooth the learning curve. We further implement a highly readable and efficient ocean model that contains only 1860 lines of code but achieves a 91 % parallel efficiency in strong scaling and 99 % parallel efficiency in weak scaling with 4096 Intel CPU cores. This ocean model also exhibits excellent scalability on the heterogeneous Sunway TaihuLight supercomputer. This work presents a promising alternative tool for the development of ocean models.
Many earth system models have been developed in the past several decades to improve the predictive understanding of the earth system (Bonan and Doney, 2018; Collins et al., 2018; Taylor et al., 2012). These models are becoming increasingly complicated, and the amount of code has expanded from a few thousand lines to tens of thousands or even millions of lines. In terms of software engineering, an increase in code causes the models to be more difficult to develop and maintain.
The complexity of these models mainly originates from three aspects. First, more model components and physical processes have been embedded into the earth system models, leading to a 10-fold increase in the amount of code (e.g., Alexander and Easterbrook, 2015). Second, some heterogeneous and advanced computing platforms (e.g., Lawrence et al., 2018) have been widely used by the climate modeling community, resulting in a 5-fold increase in the amount of code (e.g., Xu et al., 2015). Last, most of the model programs need to be rewritten due to the continual development of novel numerical methods and meshes. The promotion of novel numerical methods and technologies produced in the fields of computational mathematics and computer science have been limited in climate science because of the extremely heavy burden caused by program rewriting and migration.
Over the next few decades, tremendous computing capacities will be accompanied by more heterogeneous architectures which are equipped with two or more kinds of cores or processing elements (Shan, 2006), thus making a much more sophisticated computing environment for climate modelers than ever before (National Research Council, 2012). Clearly, transiting the current earth system models to the next generation of computing environments will be highly challenging and disruptive. Overall, complex codes in earth system models combined with rapidly evolving computational techniques create a very large gap between scientific aspiration and code implementation in the climate modeling community.
To reduce the complexity of earth system models and bridge this gap, a universal and productive computing library is a promising solution. Through establishing an implicit parallel and platform-independent computing library, the complex models can be simplified and will no longer need explicit parallelism and transiting, thus effectively decoupling the development of ocean models from complicated parallel computing techniques and diverse heterogeneous computing platforms.
Many efforts have been made to address the complexity of parallel programming for numerical simulations, such as operator overloading, source-to-source translator, and domain-specific language (DSL). Operator overloading supports the customized data type and provides simple operators and function interfaces to implement the model algorithm. This technique is widely used because the implementation is straightforward and easy to understand (Corliss and Griewank, 1994; Walther et al., 2003). However, it is prone to work inefficiently because overloading execution induces numerous unnecessary intermediate variables, consuming valuable memory bandwidth resources. Using a source-to-source translator offers another solution. As indicated by the name, this method converts one language, which is usually strictly constrained by self-defined rules, to another (Bae et al., 2013; Lidman et al., 2012). It requires tremendous work to develop and maintain a robust source-to-source compiler. Furthermore, DSLs can provide high-level abstraction interfaces that use mathematical notations similar to those used by domain scientists so that they can write much more concise and more straightforward code. Some outstanding DSLs, such as ATMOL (van Engelen, 2001), ICON DSL (Torres et al., 2013), STELLA (Gysi et al., 2015), and ATLAS (Deconinck et al., 2017), are used by the numerical model community. Although they seem to be source-to-source techniques, DSLs are newly defined languages and produce executable programs instead of target languages. Therefore, the new syntax makes it difficult for the modelers to master the DSLs. In addition, most DSLs are not yet supported by robust compilers due to their relatively short history. Most of the source-to-source translators and DSLs still do not support the rapidly evolving heterogeneous computing platforms, such as the Chinese Sunway TaihuLight supercomputer which is based on the homegrown Sunway heterogeneous many-core processors and located at the National Supercomputing Center in Wuxi.
Other methods such as COARRAY Fortran and CPP templates provide alternative ways. Using COARRAY Fortran, a modeler has to control the reading and writing operation of each image (Mellor-Crummey et al., 2009). In a sense, one has to manipulate the images in parallel instead of writing serial code. In term of CPP templates, it is usually suitable for small amounts of code and difficult for debugging (Porkoláb et al., 2007).
Inspired by the philosophy of operator overloading, source-to-source
translating and DSLs, we integrated the advantages of these three methods
into a simple computing library which is called OpenArray. The main
contributions of OpenArray are as follows:
Furthermore, we developed a numerical ocean model based on the Princeton
Ocean Model (POM; Blumberg and
Mellor, 1987) to test the capability and efficiency of OpenArray. The new
model is called the Generalized Operator Model of the Ocean (GOMO). Because
the parallel computing details are completely hidden, GOMO consists of only
1860 lines of Fortran code and is more easily understood and maintained than
the original POM. Moreover, GOMO exhibits excellent scalability and
portability on both central processing unit (CPU) and Sunway platforms.
The remainder of this paper is organized as follows. Section 2 introduces some concepts and presents the detailed mathematical descriptions of formulating the PDEs into operator expressions. Section 3 describes the detailed design and optimization techniques of OpenArray. The implementation of GOMO is described in Sect. 4. Section 5 evaluates the performances of OpenArray and GOMO. Finally, discussion and conclusion are given in Sects. 6 and 7, respectively.
In this section, we introduce three important concepts in OpenArray:
To achieve this simplicity, we designed a derived data type,
To illustrate the concept of an operator, we first take a 2-dimensional (2-D)
continuous equation solving sea surface elevation as an example:
Arrangement of variables in the staggered Arakawa C grid.
In this way, the above continuous equation can be discretized into the
following form.
Table 1 lists the detailed definitions of the 12 basic operators. The
term var denotes a 3-D model variable. All 12 operators for the finite
difference calculations are named using three letters in the form [
Definitions of the 12 basic operators.
Using the basic operators, Eq. (4) is expressed as
Implementation of Eq. (6) by basic operators. The elf and elb variables are the
surface elevations at times (
Next, we will use the operators in shallow water equations, which are more
complicated than those in the previous case. Assuming that the flow is in
hydrostatic balance and that the density and viscosity coefficients are
constant, and neglecting the molecular friction, the shallow water equations
are
Implementation of the shallow water equations by basic operators.
elf, el, and elb denote sea surface elevations at times (
Most ocean models are implemented based on the staggered Arakawa grids
(Arakawa and Lamb, 1981; Griffies et
al., 2000). The variables in ocean models are allocated at different grid
points. The calculations that use these variables are performed after
several reasonable interpolations or differences. When we call the
differential operations on a staggered grid, the difference value between
adjacent points should be divided by the grid increment to obtain the final
result. Setting the correct grid increment for modelers is troublesome work
that is extremely prone to error, especially when the grid is nonuniform.
Therefore, we propose an abstract staggered grid to support flexible
switching of operator calculations among different staggered grids. When the
grid information is provided at the initialization phase of OpenArray, a set
of grid increments, including horizontal increments (
As shown in Fig. 4, the cubes in the (a), (b), (c), and (d) panels are the
minimum abstract grid accounting for
The schematic diagram of the relative positions of the variables on the abstract staggered grid and the jumping procedures among the grid points.
When we call the average and differential operators mentioned in Table 1,
e.g., on the abstract Arakawa C grid, the position of
The jumping rules of different operators are given in Table 2. Due to the design of the abstract staggered grids, the jumping rules for the Arakawa A, B, C, and D grids are fixed. A change in the position of an array is determined only by the direction of a certain operator acting on that array.
The jumping rules of an operator acting on an
The grid type can be changed conveniently. For instance, if one would like to do so, only the following steps need to be taken. First the position information of
each physical variable needs to be reset (shown in Fig. 4). Then the discrete
form of each equation needs to be redesigned. We take the Eq. (1) switching
from Arakawa C grid to Arakawa B grid as an example. The positions of the
horizontal velocity
The position information and jumping rules are used to implicitly check
whether the discrete form of an equation is correct. The grid increments are
hidden by all the differential operators, thus it makes the code simple and
clean. In addition, since the rules are suitable for multiple staggered
Arakawa grids, the modelers can flexibly switch the ocean model between
different Arakawa grids. Notably, the users of OpenArray should input the
correct positions of each array in the initialization phase. The value of
the position is an input parameter when declaring an
Although most of the existing ocean models use finite difference or finite volume methods on structured or semi-structured meshes (e.g., Blumberg and Mellor, 1987; Shchepetkin and McWilliams, 2005), there are still some ocean models using unstructured meshes (e.g., Chen et al., 2003; Korn, 2017), and even the spectral element method (e.g., Levin et al., 2000). In our current work, we design the basic operators only for finite difference and finite volume methods with structured grids. More customized operators for the other numerical methods and meshes will be implemented in our future work.
Through the above operator notations in Table 1, ocean modelers can quickly convert the discrete PDEs into the corresponding operator expression forms. The main purpose of OpenArray is to make complex parallel programming transparent to the modelers. As illustrated in Fig. 5, we use a computation graph as an intermediate representation, meaning that the operator expression forms written in Fortran will be translated into a computation graph with a particular data structure. In addition, OpenArray will use the intermediate computation graph to analyze the dependency of the distributed data and produce the underlying parallel code. Finally, we use stable and mature compilers, such as the GNU Compiler Collection (GCC), Intel compiler (ICC), and Sunway compiler (SWACC), to generate the executable programs according to different back-end platforms. These four steps and some related techniques are described in detail in this section.
The workflow of OpenArray.
Although the basic generalized operators listed in Table 1 are only suitable
to execute 1st-order difference, other high-order difference or even more
complicated operations can be combined by these basic operators. For
example, a 2nd-order difference operation can be expressed as
Using these operators to express the discrete PDE, the code and formula are very similar. We call this effect “the self-documenting code is the formula”. Figure 6 shows the one-to-one correspondence of each item in the code and the items in the sea surface elevation equation. The code is very easy to program and understand. Clearly, the basic operators and the combined operators greatly simplify the development and maintenance of ocean models. The complicated parallel and optimization techniques are hidden behind these operators. Modelers no longer need to care about details and can escape from the “parallelism swamp”, and can therefore concentrate on the scientific issues.
The effect of “the self-documenting code is the formula” illustrated by the sea surface elevation equation.
Considering the example mentioned in Fig. 6, if one needs to compute the
term
To solve this problem, we convert an operator expression form into a
directed and acyclic graph, which consists of basic data and function nodes,
to implement a so-called lazy expression evaluation
(Bloss et al., 1988; Reynolds, 1999). Unlike the
traditional operator overloading method, we overload all arithmetic
functions to generate an intermediate computation graph rather than to
obtain the result of each function. This method is widely used in deep-learning frameworks, e.g., TensorFlow (Abadi et al., 2016)
and Theano (Bastien et al., 2012), to improve computing
efficiency. Figure 7 shows the procedure of parsing the operator expression
form of the sea level elevation equation into a computation graph. The input
variables in the square boxes include the sea surface elevation (elb), the
zonal velocity (
Parsing the operator expression form into the computation graph.
Given a computation graph, we design a lightweight engine to generate the corresponding source code (Fig. 8). Each operator node in the computation graph is called a kernel. The sequence of all kernels in a graph is usually fused into a large kernel function. Therefore, the underlying engine schedules and executes the fused kernel once and obtains the final result directly without any auxiliary or temporary variables. Simultaneously, the scheduling overhead of the computation graph and the startup overhead of the basic kernels can be reduced.
The schematic diagram of kernel fusion.
Most of the scientific computational applications are limited by the memory
bandwidth and cannot fully exploit the computing power of a processor.
Fortunately, kernel fusion is an effective optimization method to improve
memory locality. When two kernels need to process some data, their fusion
holds shared data in the memory. Prior to the kernel fusion, the computation
graph is analyzed to find the operator nodes that can be fused, and the
analysis results are stored in several subgraphs. Users can access any
individual subgraph by assigning the subgraph to an intermediate variable
for diagnostic purposes. After being given a series of subgraphs, the
underlying engine dynamically generates the corresponding kernel function in
C
Notably, the time to compile a single kernel function is short but practical applications usually need to be run for thousands of time steps and the overhead of generating and compiling the kernel functions for the computation graph is extremely high. Therefore, we generate a fusion kernel function only once for each subgraph, and put it into a function pool. Later, when facing the same computation subgraph, we fetch the corresponding fusion kernel function directly from the pool.
Since the arrays in OpenArray are distributed among different processing units, and the operator needs to use the data in the neighboring points, to ensure the correctness it is necessary to check the data consistency before fusion. The use of different data-splitting methods for distributed arrays can greatly affect computing performance. The current data-splitting method in OpenArray is the widely used block-based strategy. Solving PDEs on structured grids often divides the simulated domain into blocks that are distributed to different processing units. However, the differential and average operators always require their neighboring points to perform array computations. Clearly, ocean modelers have to frequently call corresponding functions to carefully control the communication of the local boundary region.
Therefore, we implemented a general boundary management module to implicitly maintain and update the local boundary information so that the modelers no longer need to address the message communication. The boundary management module uses asynchronous communication to update and maintain the data of the boundary region, which is useful for simultaneous computing and communication. These procedures of asynchronous communication are implicitly invoked when calling the basic kernel or the fused kernel to ensure that the parallel details are completely transparent to the modelers. For the global boundary conditions of the limited physical domains, the values at the physical border are always set to zero within the operators and operator expressions. In realistic cases, the global boundary conditions are set by a series of functions (e.g., radiation, wall) provided by OpenArray.
With the help of dynamic code generation and JIT compilation technology,
OpenArray can be migrated to different back-end platforms. Several basic
libraries, including Boost C
According to the TOP500 list released in November 2018, the Sunway TaihuLight is ranked third in the world, with a LINPACK benchmark rating of 93 petaflops provided by Sunway many-core processors (or Sunway CPUs). As shown in Fig. 9, every Sunway CPU includes 260 processing elements (or cores) that are divided into four core groups. Each core group consists of 64 computing processing elements (CPEs) and a management processing element (MPE) (Qiao et al., 2017). CPEs handle large-scale computing tasks and the MPE is responsible for the task scheduling and communication. The relationship between MPE and CPE is like that between CPU and many-core accelerator, except that they are fused into a single Sunway processor sharing a unified memory space. To make the most of the computing resources of the Sunway TaihuLight, we generate kernel functions for the MPE, which is responsible for the thread control, and CPE, which performs the computations. The kernel functions are fully optimized with several code optimization techniques (Pugh, 1991) such as loop tiling, loop aligning, single-instruction multiple-date (SIMD) vectorization, and function inline. In addition, due to the high memory access latency of CPEs, we accelerate data access by providing instructions for direct memory access in the kernel to transfer data between the main memory and local memory (Fu et al., 2017).
The MPE–CPE hybrid architecture of the Sunway processor. Every Sunway processor includes 4 core groups (CGs) connected by the network-on-chip (NoC). Each CG consists of a management processing element (MPE), 64 computing processing elements (CPEs) and a memory controller (MC). The Sunway processor uses the system interface (SI) to connect with outside devices.
In this section, we introduce how to implement a numerical ocean model using OpenArray. The most important step is to derive the primitive discrete governing equations in operator expression form, and then the following work is completed by OpenArray.
The fundamental equations of GOMO are derived from POM. GOMO features a bottom-following free-surface staggered Arakawa C grid. To effectively evolve the rapid surface fluctuations, GOMO uses the mode-splitting algorithm inherited from POM to address the fast-propagating surface gravity waves and slow-propagating internal waves in barotropic (external) and baroclinic (internal) modes, respectively. The details of the continuous governing equations, the corresponding operator expression form, and the descriptions of all the variables used in GOMO are listed in the appendices A–C, respectively.
Figure 10 shows the basic flow diagram of GOMO. At the beginning, we
initialize OpenArray to make all operators suitable for GOMO. After loading
the initial values and the model parameters, the distance information is
input into the differential operators through grid binding. In the external
mode, the main consumption is computing the 2-D sea surface elevation
Flow diagram of GOMO.
Comparing GOMO with several variations in the POM.
When the user dives into the GOMO code, the main time stepping loop in GOMO
appears to run on a single processor. However, as described above, implicit
parallelism is the most prominent feature of the program using OpenArray.
The operators in OpenArray, not only the difference and average operators,
but also the “
Because the complicated parallel optimization and tuning processes are decoupled from the ocean modeling, we completely implemented GOMO based on OpenArray in only 4 weeks, whereas implementation may take several months or even longer when using the MPI or CUDA library.
In comparison with the existing POM and its multiple variations, including (to name a few) Stony Brook Parallel Ocean Model (sbPOM), mpiPOM, and POMgpu, GOMO has less code but is more powerful in terms of compatibility. As shown in Table 3, the serial version of POM (POM2k) contains 3521 lines of code. Models sbPOM and mpiPOM are parallelized using MPI, while POMgpu is based on MPI and CUDA-C. The codes of sbPOM, mpiPOM, and POMgpu are extended to 4801, 9680, and 30 443 lines, respectively. In contrast, the code of GOMO is decreased to 1860 lines. Moreover, GOMO completes the same function as the other approaches while using the least amount of code (Table 4), since the complexity has been transferred to OpenArray, which includes about 11 800 lines of codes.
Comparison of the amount of code for different functions.
In addition, poor portability considerably restricts the use of advanced hardware in oceanography. With the advantages of OpenArray, GOMO is adaptable to different hardware architectures, such as the Sunway processor. The modelers do not need to modify any code when changing platforms, eliminating the heavy burden of transmitting code. As computing platforms become increasingly diverse and complex, GOMO becomes more powerful and attractive than the machine-dependent models.
In this section, we first evaluate the basic performance of OpenArray using benchmark tests on a single CPU platform. After checking the correctness of GOMO through an ideal seamount test case, we use GOMO to further test the scalability and efficiency of OpenArray.
We choose two typical PDEs and their implementations from Rodinia v3.1, which is a benchmark suite for heterogeneous computing (Che et al., 2009), as the original version. For comparison, we re-implement these two PDEs using OpenArray. In addition, we added two other test cases. As shown in Table 5, the 2-D continuity equation is used to solve sea surface height, and its continuous form is shown in Eq. (1). The 2-D heat diffusion equation is a parabolic PDE that describes the distribution of heat over time in a given region. Hotspot is a thermal simulation used for estimating processor temperature on structured grids (Che et al., 2009; Huang et al., 2006). We tested one 2-D case (Hotspot2D) and one 3-D case (Hotspot3D) of this program. The average runtime for 100 iterations is taken as the performance metric. All tests are executed on a single workstation with an Intel Xeon E5-2650 CPU. The experimental results show that the performance of OpenArray versions is comparable to the original versions.
Four benchmark tests.
The seamount problem proposed by Beckmann and Haidvogel is a widely used
ideal test case for regional ocean models (Beckmann and
Haidvogel, 1993). It is a stratified Taylor column problem which simulates
the flow over an isolated seamount with a constant salinity and a reference
vertical temperature stratification. An eastward horizontal current of 0.1 m s
Using the seamount test case, we compare GOMO and sbPOM results. The configurations of both models are exactly the same. Figure 11 shows that GOMO and sbPOM both capture the anticyclonic circulation at 3500 m depth. The shaded plot shows the surface elevation, and the array plot shows the current at 3500 m. Panels 11a, b, and c show the results of GOMO, sbPOM, and the difference (GOMO-sbPOM), respectively. The differences in the surface elevation and deep currents between the two models are negligible (Fig. 11c).
Comparison of the surface elevation (shaded) and currents at 3500 m depth (vector) between GOMO and sbPOM on the 4th model day.
The seamount test case is used to compare the performance of sbPOM and GOMO
in a parallel environment. We use the X86 cluster at the National Supercomputing
Center in Wuxi, China, which provides 5000 Intel Xeon E5-2650 v2 CPUs for
our account at most. Figure 12a shows the result of a strong scaling
evaluation, in which the model size is fixed at
Performance comparison between sbPOM and GOMO on the X86 cluster.
Panels
We also test the scalability of GOMO on the Sunway platform. Supposing that
the baseline is the runtime of GOMO at 10 000 Sunway cores with a grid size
of
Parallel efficiency of GOMO on the Sunway platform.
As we mentioned in Sect. 1, the advantages of OpenArray are its ease of use, high efficiency, and portability. Using OpenArray, modelers without any parallel computing skill or experience can write simple operator expressions in Fortran to implement complex ocean models. The ocean models can be run on any CPU or Sunway platforms which have deployed the OpenArray library. We call this effect “write once, run everywhere”. Other similar libraries (e.g., ATMOL, ICON DSL, STELLA, and COARRAY) require users to manually control the boundary communication and task scheduling to some extent. In contrast, OpenArray implements completely implicit parallelism with user-friendly interfaces and programming languages.
However, there are still several problems to be solved in the development of OpenArray. The first issue is computational efficiency. Once a variable is in one of the processor registers or in the highest speed cache, it should be used as much as possible before being replaced. In fact, we should never move variables more than once each time step. The memory consumption brought by overloading techniques is usually high due to the unnecessary variable moving and unavoidable cache missing. The current efficiency and scalability of GOMO are close to sbPOM since we have adopted a series of optimization methods, such as memory pool, graph computing, JIT compilation, and vectorization, to alleviate the requirement of memory bandwidth. However, we have to admit that we cannot fully solve the memory bandwidth limit problem at present. We think that time skewing is a cache-oblivious algorithm for stencil computations (Frigo and Strumpen, 2005) since it can exploit temporal locality optimally throughout the entire memory hierarchy. In addition, the polyhedral model may be another potential approach, which uses an abstract mathematical representation based on integer polyhedral to analyze and optimize the memory access pattern of a program.
The second issue is that the current OpenArray version cannot support customized operators. When modelers try out another higher-order advection or any other numerical scheme, the 12 basic operators provided by OpenArray are not abundant. We considered using a template mechanism to support the customized operators. The rules of operations are defined in a template file, where the calculation form of each customized operator is described by a regular expression. If users want to add a customized operator, they only need to append a regular expression into the template file.
OpenArray and GOMO will continue to be developed, and the following three key improvements are planned for the following years.
First, we are developing the GPU version of OpenArray. During the development, the principle is to keep hot data in GPU memory or directly swapping between GPUs and avoid returning data to the main CPU memory. NVLink provides high bandwidth and outstanding scalability for GPU-to-CPU or GPU-to-GPU communication and addresses the interconnect issue for multi-GPU and multi-GPU/CPU systems.
Second, the data input/output is becoming a bottleneck of earth system models as the resolution increases rapidly. At present we encapsulate the PnetCDF library to provide simple I/O interfaces, such as load operation and store operation. A climate fast input/output (CFIO) library (Huang et al., 2014) will be implemented into OpenArray in the next few years. The performance of CFIO is approximately 220 % faster than PnetCDF because of the overlapping of I/O and computing. CFIO will be merged into the future version of OpenArray and the performance is expected to be further improved.
Finally, as for most of the ocean models, GOMO also faces the load imbalance issue. We are adding the more effective load balance schemes, including space-filling curve (Dennis, 2007) and curvilinear orthogonal grids, into OpenArray in order to reduce the computational cost on land points.
OpenArray is a product of the collaboration between oceanographers and computer scientists. It plays an important role in simplify the porting work on the Sunway TaihuLight supercomputer. We believe that OpenArray and GOMO will continue to be maintained and upgraded. We aim to promote it to the model community as a development tool for future numerical models.
In this paper, we design a simple computing library (OpenArray) to decouple ocean modeling and parallel computing. OpenArray provides 12 basic operators that are abstracted from PDEs and extended to ocean model governing equations. These operators feature user-friendly interfaces and an implicit parallelization ability. Furthermore, some state-of-the-art optimization mechanisms, including computation graphing, kernel fusion, dynamic source code generation, and JIT compiling, are applied to boost the performance. The experimental results prove that the performance of a program using OpenArray is comparable to that of well-designed programs using Fortran. Based on OpenArray, we implement a numerical ocean model (GOMO) with high productivity, enhanced readability, and excellent scalable performance. Moreover, GOMO shows high scalability on both CPU systems and the Sunway platform. Although more realistic tests are needed, OpenArray may signal the beginning of a new frontier in future ocean modeling through ingesting basic operators and cutting-edge computing techniques.
The source codes of OpenArray v1.0 are available at
The equations governing the baroclinic (internal) mode in GOMO are the
3-dimensional hydrostatic primitive equations.
The discrete governing equations of baroclinic (internal) mode expressed by
operators are shown as below:
The description of each symbol in the governing equations is listed below.
Descriptions of symbols.
XiaH led the project of OpenArray and the writing of this paper. DW, QW, SZ, and XinH designed OpenArray. XinH, DW, QW, SZ, MW, YG, and QT implemented and tested GOMO. All coauthors contributed to the writing of this paper.
The authors declare that they have no conflict of interest.
Xiaomeng Huang is supported by a grant from the National Key R&D Program of China (2016YFB0201100), the National Natural Science Foundation of China (41776010), and the Center for High Performance Computing and System Simulation of the Pilot National Laboratory for Marine Science and Technology (Qingdao). Xing Huang is supported by a grant from the National Key R&D Program of China (2018YFB0505000). Shixun Zhang is supported by a grant from the National Key R&D Program of China (2017YFC1502200) and Qingdao National Laboratory for Marine Science and Technology (QNLM2016ORP0108). Zhenya Song is supported by National Natural Science Foundation of China (U1806205) and AoShan Talents Cultivation Excellent Scholar Program supported by Qingdao National Laboratory for Marine Science and Technology (2017ASTCP-ES04).
This research has been supported by the National Key R&D Program of China (grant nos. 2016YFB0201100, 2018YFB0505000, and 2017YFC1502200), the National Natural Science Foundation of China (grant nos. 41776010 and U1806205), and the Qingdao National Laboratory for Marine Science and Technology (grant nos. QNLM2016ORP0108 and 2017ASTCP-ES04).
This paper was edited by Steven Phipps and reviewed by David Webb and two anonymous referees.