Improving data transfer for model coupling

Introduction Conclusions References Tables Figures


Introduction
Climate System Models (CSMs) and Earth System Models (ESMs) are fundamental tools for simulating, predicting and projecting the climate.A CSM or an ESM generally integrates several component models, such as an atmosphere model, a land surface model, an ocean model, and a sea-ice model, into a coupled system, to simulate the behaviors of and interactions between components of the climate system.More and Figures

Back Close
Full more ESMs have sprung up in the world.For example, the number of coupled model versions in the Coupled Model Intercomparison Project (CMIP) has increased from less than 30 (used for CMIP3) to more than 50 (used for CMIP5).High-performance computing is an essential technical support for model development, especially for higher and higher resolutions of models.Modern high-performance computers integrate an increasing number of processor cores for higher and higher computation performance.Therefore, efficient parallelization, which enables a model to utilize more processor cores for acceleration, becomes a technical focus in model development, and a number of component models with efficient parallelization have sprung up.For example, the Community Ice CodE (CICE; Hunke et al., 2008Hunke et al., , 2013) ) at 0.1 • horizontal resolution can scale to 30 000 processor cores on the IBM Blue Gene/L (Dennis et al., 2008); the Parallel Ocean Program (POP; Kerbyson, 2005;Smith et al., 2010) at 0.1 • horizontal resolution can also scale to 30 000 processor cores on the IBM Blue Gene/L and to 10 000 processor cores on a Cray XT3 (Dennis, 2007); the Community Atmosphere Model (CAM; Morrison et al., 2008;Neale et al., 2010Neale et al., , 2012) ) with the spectral element dynamical core (CAM-SE) at 0.25 • horizontal resolution can scale to 86 000 processor cores on a Cray XT5 (Dennis et al., 2012).To achieve an efficient parallelization of a coupled model, each component model requires to be efficiently parallelized.
A coupler is an important component in a coupled system.It links component models together to construct a coupled model, and controls the integration of the whole coupled model.A number of couplers now are available for model coupling, e.g., the Model Coupling Toolkit (MCT; Jacob et al., 2015), the Ocean Atmosphere Sea Ice Soil coupling software (OASIS) coupler (Redler et al., 2010;Valcke, 2013), the Earth System Modelling Framework (ESMF; Hill et al., 2004), the CPL6 coupler (Craig et al., 2005), the CPL7 coupler (Craig et al., 2012), the Flexible Modelling System (FMS) coupler (Balaji et al., 2006), the Bespoke Framework Generator (BFG; Ford et al., 2006;Armstrong et al., 2009), and the community coupler version 1 (C-Coupler1; Liu et al., 2014), among others.Most of the existing couplers provide fundamental coupling functions Introduction

Conclusions References
Tables Figures

Back Close
Full that include data transfer between component models and data interpolation between different model grids (Valcke et al., 2012).
A coupler generally has much smaller overhead than other component models.However, it is potentially a time-consuming component in an ESM in future.This is because there will be more and more component models (such as land-ice model, chemistry model and biogeochemical model) coupled into an ESM and the coupling between component models will be more and more frequent.Data transfer is a fundamental and most frequently used operation in a coupler.It is responsible for transferring data fields between the processes of component models and responsible for rearranging data fields among various processes of the same component model for parallel data interpolation.
A coupler may become a bottleneck for efficient parallelization of future coupled models.The most obvious reason is that the current implementation of data transfer in a state-of-the-art coupler is not efficient enough.For example, the data transfer from a component with a logically rectangular grid (of 1021 × 1442 grid points) to a component with a Gaussian Reduced T799 grid (with 843 000 grid points) can only scale to about 100 processor cores when using OASIS3 (Valcke, 2013) and to about 1000 processor cores when using OASIS3-MCT (Valcke et al., 2013); the data transfer from a component model with a horizontal grid (of 576 × 384 grid points) to another component model with another horizontal grid (of 3600 × 2400 grid points) can only scale to about 500 processor cores when using the CPL7 coupler (Craig et al., 2012).Therefore, it is highly desirable to improve the parallelization of couplers.
In this study, we propose a butterfly implementation of data transfer and then develop an adaptive data transfer library that is open to the public.Performance evaluation demonstrates that such a library significantly improves the performance of data transfer in most cases and does not decrease the performance in any cases.This library has been imported into C-Coupler1 with slight code modification.We believe it can be easily imported into other coupler versions for better performance of data transfer.Introduction

Conclusions References
Tables Figures

Back Close
Full The reminder of this paper is organized as follows.We briefly introduce the implementation of data transfer in existing couplers in Sect. 2. We analyze performance bottlenecks of the existing implementation in Sect.3. Details of the butterfly implementation and the adaptive data transfer library are presented in Sects.4 and 5, respectively.The performance of the butterfly implementation and the adaptive data transfer library is evaluated in Sect.6.Conclusion is given in Sect.7.

MCT
MCT works as a library for model coupling.It can be directly used to construct a coupled model with different component models, and can also be used to develop other couplers, such as OASIS3-MCT, the CPL6 coupler and the CPL7 coupler.It provides fundamental coupling functions, i.e., data transfer and data interpolation, in parallel.To achieve a parallel data transfer, MCT first generates a communication router (known as the data mapping between processes) according to the parallel decompositions of the tiple data fields that will be packed into one MPI message for better communication performance.
On the other hand, parallel interpolation can also introduce data exchange among processes of the same component model.Interpolation is generally performed by the calculation of matrix-vector multiplication.To achieve efficient parallelization of interpolation, MCT can rearrange the layout of the data field among processes, to enable the matrix-vector multiplication to be performed locally on each process.The data rearrangement is essentially a data transfer.

The OASIS coupler
The OASIS coupler is mainly developed by the European Centre for Research and Advanced Training in Scientific Computing (CERFACS) since 1991.OASIS3 (Valcke, 2013a) is a 2-D version of the OASIS coupler with broad usage.To transfer a field from one component model to another, a process of OASIS3 first gathers the field from the processes of the source component model and then scatters the field to the processes of the target component model.Each process of OASIS3 can transfer one model field, so that multiple model fields can be transferred in parallel.However, the parallelism of such an implementation is limited by the number of coupling fields.To solve this problem, MCT has been used to develop the latest version of the OASIS coupler (OASIS3-MCT).
OASIS4 is a 3-D version of the OASIS coupler.The data exchange library in the PRISM System Model Interface Library (PSMILe; Redler, 2010), which performs communication with MPI, is used to perform the data transfer in OSIS4.Similar to MCT, each process only needs to send or receive the data of its local decomposition.In OASIS3, the interpolation of a field is carried out by only one process.Like the implementation of data transfer in OASIS3, the data needed interpolation will be gathered from all processes of the corresponding component model before the interpolation, and will be scattered to all processes after the interpolation.In OASIS4 and OASIS3-MCT, the interpolation is performed in parallel, where all processes of the corresponding Introduction

Conclusions References
Tables Figures

Back Close
Full

ESMF
Earth System Modeling Framework (ESMF) is a widely used software framework for model development, which defines a superstructure for the architecture of component models and an infrastructure with common coupling functions for model coupling.In ESMF, the coupler components are responsible for regridding and transferring data among component models.The coupler components build the corresponding relationship between the data of the source model and the data of the target model according to their parallel decomposition.Then, the data are transferred in parallel according to the corresponding relationship.

The FMS coupler
FMS is a software framework developed by the Geophysical Fluid Dynamics Laboratory (GFDL).It supports the development, construction, execution, and scientific interpretation of models.The FMS coupler deploys an exchange grid to perform the coupling.
Given the grids of two component models, their exchange grid is generated by all the vertices in the two grids.The coupling fields from a source component model to a target component model are first interpolated onto the exchange grid, and then averaged onto the target grid.Data transfer among different processors is performed with MPI P2P communications.

The CPL6 coupler
The CPL6 coupler is a centralized coupler for the Community Climate System Model version 3 (CCSM3; Collins et al., 2006)  through the coupler.The CPL6 coupler integrates MCT for data transfer and data interpolation.Therefore, the data transfer between component models is processed in parallel with MPI P2P communications and can serve multiple model fields at the same time for better communication performance.

The CPL7 coupler
The CPL7 coupler is the latest coupler version from the NCAR.It has been used for the ESMs of the Community Climate System Model version 4 (CCSM4; Gent et al., 2011) and the Community Earth System Model (CESM; Hurrell et al., 2013).Similar to the CPL6 coupler, the CPL7 coupler is also a centralized coupler, where the data transfer between component models must go through the coupler.The CPL7 coupler also integrates MCT for data transfer and data interpolation.Moreover, the CPL7 coupler supports the coupling interface based on ESMF and can use the coupling functions in ESMF for data transfer and data interpolation.

C-Coupler1
C

Performance bottlenecks of existing implementations
The implementations of data transfer in the state-of-the-art couplers are similar, which can be concluded as the MPI P2P communication that transfers data among the processes according to the two corresponding parallel decompositions.In the following context, we call such an implementation "P2P implementation" for short.To reveal why the P2P implementation is inefficient, we first derive a benchmark from a real coupled model version GAMIL2-CLM3, where GAMIL2 (Li et al., 2013) is an atmosphere model and CLM3 (Oleson et al., 2004) is a land surface model.GAMIL2 and CLM3 share the same horizontal grid of 7680 (128 × 60) grid points.In this benchmark, there is only the data transfer with P2P implementation between two data models with the same grid as the horizontal grid of GAMIL2-CLM3.The parallel decompositions of the source and target data models are the same as those of CLM3 and GAMIL2, respectively.A high-performance computer named Tansuo100 at Tsinghua University, China is used for the performance testing.It has 700 computing nodes, each of which contains two six-core Intel Xeon X5670 CPUs and 32 GB main memory.All computing nodes are connected by a high-speed InfiniBand network with peak communication bandwidth of 5 GB s −1 .
To evaluate the parallel performance of the P2P implementation, 14 2-D coupling fields are transferred between the two data models.In each test, the two data models have the same number of processes.As there are 12 CPU cores on each computing node, the number of processes is set to be an integral multiple of 12.When the process number is less than 12, the two data models are located on two different computing nodes.The two data models do not share the same computing node, so the communication of the P2P implementation must go through the InfiniBand network.
Figure 1 demonstrates the poor performance of the P2P implementation.It is well known that the performance of communication heavily depends on message size.As shown in Fig. 2, the communication bandwidth achieved generally increases with message size; so when the message size is small (for example, smaller than 4 KB), the Introduction

Conclusions References
Tables Figures

Back Close
Full communication bandwidth achieved is very low.The message size in the P2P implementation decreases with increment of process number of models (Fig. 3), indicating that the communication bandwidth gets lower with increase of process number.The performance of a data transfer also heavily depends on the number of MPI messages.As shown in Fig. 4, the number of MPI messages in the P2P implementation increases with increment of process number.Here, we may conclude that the decrease of message size and the increase of number of MPI messages are primary reasons for the poor performance of the P2P implementation when increasing the process number.However, the ideal performance shown in Fig. 5 is much better than the actual performance.The ratio between the ideal performance and actual performance significantly increases with the increment of processor number.The significant gap between the ideal performance and actual performance is due to the jam of network communication.For example, when multiple P2P communications share the same source process or target process, they must wait in an order.

Butterfly implementation for better performance of data transfer
To improve the performance of data transfer, a new implementation should be able to overcome the drawbacks of the P2P implementation, which can be concluded as low communication bandwidth due to small message size, variable and big number of MPI messages, and jams in communications.We therefore propose a new implementation called the butterfly implementation.As shown in Fig. 6, it is similar to the butterfly diagram in Fast Fourier Transform (FFT; Heckbert, 1995).The most significant challenge to the butterfly implementation is that the process number needs to be 2 n , where n is a non-negative integer, while the process number of data transfer generally can be any positive integer.To resolve this challenge, we investigated how to efficiently map processes between the butterfly implementation and the sender/receiver.Next, we will introduce the butterfly implementation and the process mapping.Introduction

Conclusions References
Tables Figures

Back Close
Full

The butterfly implementation
The butterfly implementation aims to rearrange the data from the source parallel decomposition to the target parallel decomposition.As shown in Fig. 6, there are multiple stages in the butterfly implementation.Given the process number N = 2 n , the number of stages is n + 1.Each stage has a unique parallel decomposition.The parallel decompositions of the first stage and last stage are determined by the source and target parallel decompositions, respectively, while the parallel decompositions of the other stages are determined by the first and last stages.Between any two successive stages, all processes are split into a number of pairs and the two processes of each pair exchange data according to the corresponding parallel decompositions using MPI P2P communication.
Compared to the existing implementations of data transfer, the butterfly implementation has the following advantages: 1. Bigger message size for better communication bandwidth.The message size is M/(2N) on average, where M is the total size of data to be transferred and N is the process number.
2. Balanced number of MPI messages among processes.Each process performs log 2 N times of MPI communication.
3. Ordered communications among processes and fewer communications operated concurrently.The jam of network communication can be dramatically reduced.

Process mapping
Process number of the butterfly kernel must be 2 n , where n is a non-negative integer, while process number of sender or receiver can be any positive integer.The first question is how to decide the number of processes of the butterfly kernel?Any process of the sender or receiver can be used as a process of the butterfly kernel.Given that Introduction

Conclusions References
Tables Figures

Back Close
Full the total number of unique processes of the sender and receiver is N T , the process number of the butterfly kernel (N B ) can be any power of 2, which is no larger than N T .
For example, we can select the maximum number in order for maximum utilization of resources.When N B < N T , we prefer to pick out processes first from the sender, and then from the receiver if the sender does not have enough processes, in order to save the overhead of process mapping from the sender to the butterfly kernel.
The second question is how to decide process mapping from the sender to the butterfly kernel and from the butterfly kernel to the receiver.To minimize the overhead of process mapping from the butterfly kernel to the receiver, we make one or multiple processes of the butterfly kernel map to a process of the receiver if the butterfly kernel has more processes than the receiver; otherwise, we make a process of the butterfly kernel map to one or multiple processes of the receiver.In other words, there is no multipleto-multiple process mapping between the butterfly kernel and the receiver.Similarly, there is no multiple-to-multiple process mapping between the sender and the butterfly kernel.Processes of the sender or receiver may be unbalanced in terms of size of the data transferred, which may result in unbalanced communications between processes of the butterfly kernel.
As mentioned in Sect.4.1, at each stage of the butterfly kernel, all processes are split into a number of pairs, each of which is involved in P2P communications.To improve the balance of communications among the processes, one solution is to try to make the process pairs at each stage more balanced in terms of data size of P2P communications.To achieve balanced data size among process pairs, we propose to take consideration of the sorting order of the processes in terms of data size.For example, for the remaining processes that have not been paired, we can pair the process with the largest data size and the process with the smallest data size.The pairing of the processes should be conducted iteratively among stages of the butterfly kernel.All processes are taken as the input for the first stage, while output of the pairing for one stage will be the input for the next stage.After finishing the iterative pairing through all stages, all processes of the sender or receiver are reordered.Introduction

Conclusions References
Tables Figures

Back Close
Full The iterative pairing also requires the number of processes be a power of 2. Given that the number of processes of the sender (or receiver) is N C and the process number of the butterfly kernel is N B , we propose to first pad empty processes (the data size is 0) before the iterative pairing to make the number of the processes for the sender (or receiver) be a power of 2 (donated N P ), which is no smaller than N B .Therefore, the reordered N P processes after the iterative pairing can be divided into N B groups, each of which contains N P /N B processes with consecutive reordered indexes and maps to a unique process of the butterfly kernel.Figure 7 shows an example for further illustration of process mapping.

Adaptive data transfer library
Now, we have two implementations (the P2P implementation and the butterfly implementation) for data transfer.Although the butterfly implementation can effectively improve the performance of data transfer, it still has some drawbacks: (1) it generally has a larger total message size of communications than the P2P implementation; (2) its stage number is log 2 N (N is the number of processes for the butterfly kernel), which may be bigger than the average number of MPI messages per process in the P2P implementation.Therefore, it is possible that the P2P implementation outperforms the butterfly implementation in some cases (examples are given in Sect.6).To achieve optimal performance for data transfer, we propose an adaptive data transfer library that can keep the advantages of the two implementations in all cases.
As introduced in Sect.4, the butterfly implementation is divided into multiple stages.Each stage has a unique intermediate parallel decomposition.Actually, the data transfer between two successive stages can be viewed as a P2P implementation with only one MPI message per process.Inspired by this fact, we try to design an adaptive approach that can combine the butterfly and P2P implementations, where some stages in the butterfly implementation are skipped with the P2P implementations of more MPI messages per process.Figure 8 shows an example of the adaptive data transfer li-Introduction

Conclusions References
Tables Figures

Back Close
Full brary with 8 processes, where Stage 1 of the butterfly implementation is skipped with the P2P implementation of 3 MPI messages per process.The most significant challenge to such an adaptive approach is how to determine which stage(s) of the butterfly implementation should be skipped.The first solution is to design a cost model that can accurately predict the performance of data transfer in various implementations.We eventually gave up this solution because it was almost impossible to accurately predict the performance of the communications on a high-performance computer, especially when a lot of users share the computer to run various applications.Profiling which means directly measuring the performance of data transfer is more practical to determine an appropriate implementation, because the simulation for earth system modeling always takes a long time to run.To obtain an appropriate implementation of the adaptive data transfer library, we try to successively skip the stages of the butterfly implementation.If skipping one stage can achieve better performance, this stage will be skipped; otherwise, it will be kept.Figure 9 shows a flowchart for determining an appropriate implementation of the adaptive data transfer library.In the algorithm, a stage mask array (Stage_mask in the flowchart) specifies which stages are skipped.In detail, each array element corresponds to a stage of the butterfly implementation.If the value of an array element is false, its corresponding stage is skipped with a P2P implementation.Otherwise, its corresponding stage is kept.
The source code of the adaptive data transfer library is mainly written in C++, while the application programming interfaces (APIs) are written in Fortran because most couplers and models are programmed in Fortran.Table 1 lists the APIs, and Fig. 10 shows an example of how to use these APIs.The adaptive data transfer library can transfer 2-D and 3-D fields at the same time.Now, it is publicly available at a website (see the code availability section).

Conclusions References
Tables Figures

Back Close
Full

Performance evaluation
In order to improve the performance of data transfer for model coupling, we propose the butterfly implementation and an adaptive data transfer library that combines the butterfly implementation and the traditional P2P implementation.In this section, we empirically evaluate the adaptive data transfer library, through comparing it to the butterfly implementation and P2P implementation.Both toy models and realistic models (GAMIL2-CLM3 and CESM) are used for the performance evaluation.GAMIL2-CLM3 has been introduced in Sect.3. CESM is a state-of-the-art ESM developed by the NCAR.All the experiments are run on the high performance computer Tansuo100 that has been introduced in Sect.3.
In the following context, we will respectively evaluate the overhead of initialization, the performance in data transfer and the performance in data rearrangement for interpolation.

Overhead of initialization
We first evaluate the overhead of initialization of different implementations of data transfer.As shown in Fig. 11, the overheads of initialization of all the three implementations increase with the increment of core number.The initialization overhead of the butterfly implementation is a little higher than that of the P2P implementation, while the initialization overhead of the adaptive data transfer library is 4-5 folds higher than that of the P2P implementation, because the adaptive data transfer library uses extra time on performance profiling.Considering that one data transfer instance should be initialized only one time at the beginning and executed many times in an ESM, we can conclude that the initialization overhead of the adaptive data transfer library is reasonable, especially when the simulation is executed for a very long time.Introduction

Conclusions References
Tables Figures

Back Close
Full

Performance of data transfer between toy models
In this subsection, we evaluate the performance of data transfer (excluding the initialization overhead) with two toy models that use the same logically rectangular grid (of 192 × 96 grid points).In each test, the two toy models have the same process number and each process has the same MPI message number.The MPI message number of one process can be modified through adjusting the parallel decompositions of the toy models.The factors that impact the performance of a data transfer implementation include the commutation number, the size of the data to be transferred (also known as the number of fields in this evaluation) and the number of processes.Next, we evaluate the performance of data transfer through varying these factors.
Given a fixed process number of 192 and a fixed 2-D coupling field number of 10, Fig. 12 shows the execution time of one data transfer of different implementations when varying the MPI message number of each process from 1 to 96.The P2P implementation can outperform the butterfly implementation when the MPI message number is small (say, smaller than 12 in Fig. 12), while the butterfly implementation can outperform the P2P implementation when the MPI message number is big (say, bigger than 12 in Fig. 12).Our adaptive data transfer library can completely keep the best performance of the P2P and butterfly implementations.Moreover, it further improves the performance based on the butterfly implementation when the MPI message number is big, because some butterfly stages in the adaptive data transfer library have been skipped with the P2P implementation.When the MPI message number per process is 96, the adaptive data transfer library can achieve a 13.9-fold performance speedup compared to the P2P implementation.
Given different numbers of processes and different numbers of MPI messages per process, Fig. 13 shows the execution time of one data transfer in different implementations when varying the number of 2-D coupling fields to be transferred.The results show that the execution time of each implementation increases with the increment of data size.When the MPI message number per process is small (Fig. 13a and b), the Introduction

Conclusions References
Tables Figures

Back Close
Full performance of the butterfly implementation is poorer than that of the P2P implementation, especially when the number of 2-D coupling fields gets bigger.However, the adaptive data transfer library achieves similar performance with the P2P implementation.When the MPI message number per process is big (Fig. 13c and d), both the butterfly implementation and adaptive data transfer library significantly outperform the P2P implementation, and the adaptive data transfer library further achieves better performance than the butterfly implementation.Given a fixed MPI message number per process 24 and a fixed 2-D coupling field number 10, Fig. 14 shows the execution time of one data transfer in different implementations when varying the number of cores.The results show that both the butterfly implementation and adaptive data transfer library achieve better parallel scalability than the P2P implementation.The execution time of the P2P implementation slightly increases with the increment of the number of cores used.However, the execution times of the butterfly implementation and adaptive data transfer library slightly decrease with the increment of the number of the cores used.The butterfly implementation outperforms the P2P implementation, and the adaptive data transfer library achieves better performance than the butterfly implementation.

Performance of data transfer between realistic models
Previous evaluation with toy models reveals that the adaptive data transfer library can achieve the best performance among different implementations.In this subsection, we evaluate the performance with two realistic models: GAMIL2-CLM3 (horizontal resolution of 2.8 • × 2.8 • ) and CESM (resolution of 1.9 × 2.5_gx1v6).For CESM, we use the data transfer between the coupler CPL7 (Craig et al., 2012) and the land surface model CLM4 (Oleson et al., 2004), where 32 2-D coupling fields on the CLM4 horizontal grid (the grid size is 144 × 96 = 13 824) are transferred.Figure 15 shows the performance of one data transfer of different implementations when increasing the process number of both CPL7 and CLM4 from 6 to 192.When the process number is small (say, smaller than 24 in Fig. 15), the butterfly implementation is much poorer than the P2P 8997 Introduction

Conclusions References
Tables Figures

Back Close
Full implementation, and the adaptive data transfer library achieves similar performance as the P2P implementation.However, when the process number gets bigger (say, larger than 24 in Fig. 15), the adaptive data transfer library dramatically outperforms the P2P implementation with more speedup and also outperforms the butterfly implementation.
When each component uses 192 cores, the adaptive data transfer library is 4.01 times faster than the P2P implementation.For GAMIL2-CLM3, we use the data transfer from CLM3 to GAMIL2 where 14 2-D coupling fields on the GAMIL2 horizontal grid (the grid size is 128 × 60 = 7680) are transferred.Figure 16 shows the execution time of one data transfer of each implementation when increasing the process number of both GAMIL2 and CLM3 from 6 to 192.The results in Fig. 16 confirm that the adaptive data transfer library can constantly keep the best performance among different implementations.Compared to the P2P implementation, the adaptive data transfer library achieves an 11.68-fold performance speedup when the process number is 96, but achieves a much lower speedup (only 3.48-fold) when the process number is 192.This is because that the average MPI message number per process reduces from 32 to 18 when the number of process increases from 96 to 192.

Performance of data rearrangement for interpolation
For model coupling, besides the data transfer between different component models, there is the other kind of data transfer that rearranges the data inside a model in order for parallel interpolation of fields between different grids.Here, we use the data rearrangement for the parallel interpolation from the atmosphere grid (the grid size is 144×96 = 13 824) to the ocean grid (the grid size is 320×384 = 122 880) in the coupled model CESM for further evaluation.The results show that the butterfly implementation is much poorer than the P2P implementation (Fig. 17).This is because the MPI message number is very small (for example, average MPI message number per process is only 6.49 when each model uses 96 cores) for data rearrangement.As a result, the adaptive data transfer library achieves almost the same performance as the P2P implementation.

Conclusion
Data transfer is the fundamental and most frequently used operation in a coupler.This paper demonstrated the current implementation (which is named as the P2P implementation in this paper) of data transfer in most state-of-the-art couplers is not efficient.To improve the parallel performance of data transfer, we proposed a butterfly implementation.However, the butterfly implementation has advantages and disadvantages, comparing with the P2P implementation.The evaluation results showed that the butterfly implementation did not always outperform the P2P implementation.To completely achieve better parallel performance of data transfer, we built an adaptive data transfer library, which combines the advantages of the butterfly implementation and P2P implementation.The evaluation results demonstrated the adaptive data transfer library can always achieve the best performance, comparing with the butterfly implementation and P2P implementation.That is to say the adaptive data transfer library can effectively improve the performance of data transfer in model coupling.

GMDD Introduction Conclusions References
Tables Figures

Back Close
Full  call data_transfer_register_field (instance_id, data_buf, input) This API registers a coupling field to enable one data transfer instance to access the memory space of this field.One data transfer instance can register multiple coupling fields.
This API takes the instance index in-stance_id, the memory space of this field data_buf and the action of this field input (true stands for input field and false stands for output field) as input.
call data_transfer_register_mask (instance_id, mask_array) This API registers a mask array to enable one data transfer instance to transfer different coupling fields at different coupling steps.
This API takes the instance index instance_id and the mask array mask_array as input.
call data_transfer_init_instance (instance_id) This API initializes one data transfer instance.
This API takes the instance index in-stance_id as input.
call data_transfer_exec_instance (instance_id) This API executes one data transfer instance.
This API takes the instance index in-stance_id as input.
call data_transfer_final_instance (instance_id) This API finalizes one data transfer instance.Introduction

Conclusions References
Tables Figures

Back Close
Full   An example of process mapping, given that the sender has 5 processes (S 0 -S 4 ), the receiver has 10 processes (R 0 -R 9 ) (there is no common process between the sender and receiver), and the butterfly kernel contains 8 processes (B 0 -B 7 ).Panels (a) and (b) show how to iteratively pair processes of the sender and receiver, respectively.There are multiple stages in the iterative pairing of processes of the sender and receiver.In each stage, the processes in the same color are grouped into one pair.Panel (c) shows how to map the reordered processes of the sender and receiver to processes of the butterfly kernel.All the 5 processes of the sender are used for the butterfly kernel.Each process of the sender is mapped to a process of the butterfly kernel, while each two processes of the receiver are mapped to one process of the butterfly kernel.Introduction Full Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | two component models, and next uses the point-to-point (P2P) communication of the Message Passing Interface (MPI) to transfer data.A data field will be transferred from a process of the source component model to a process of the target component model, only when the two processes have common grid points.A data transfer can serve mul-Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | component model cooperatively perform the interpolation at the same time.The data rearrangement for the parallel interpolation is implemented by PSMILe in OASIS4 and by MCT in OASIS3-MCT.
developed at the National Center for Atmospheric Research (NCAR).The data transfer between component models must go Discussion Paper | Discussion Paper | Discussion Paper | -Coupler1 is a Chinese community coupler for Earth system modeling.It achieves 3-D coupling with flexible 3-D interpolation, and supports direct coupling without a specific coupler component to improve the parallel performance.Its implementation of data transfer is derived from the corresponding implementation in MCT.In other words, C-Coupler1 first generates a communication router according to the parallel decompositions of the component models, and then uses the MPI P2P communication to transfer the coupling fields in parallel.To further improve the communication performance, model fields with different data types, different model grids, or different parallel decompositions can be served by the same data transfer.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Figure 1 .
Figure 1.Average execution time of the P2P implementation when transferring 14 2-D fields from CLM3 to GAMIL2.In each test, the atmosphere model GAMIL2 and the land surface model CLM3 use the same number of cores and do not share the same computing node.The horizontal grid of the 14 2-D fields contains 7680 (128 × 60) grid points.

Figure 2 .Figure 3 .Figure 5 .Figure 6 .
Figure 2. Variation of bandwidth (y axis) of an MPI P2P communication with the increment of message size.The two processes of the P2P communication run on two different computing nodes.

Figure 8 .Figure 10 .Figure 12 .Figure 13 .
Figure 8.An example of the adaptive data transfer library given 8 processes, where Stage 1 of the butterfly implementation is skipped with the P2P implementation of 3 MPI messages per process.

Table 1 .
The application program interfaces (APIs) of the adaptive data transfer library.Couplers or component models can improve the performance of data transfer through calling these APIs.