A new adaptive data transfer library for model coupling

Data transfer means transferring data fields from a sender to a receiver. It is a fundamental and frequently used operation of a coupler. Most versions of state-of-the-art couplers currently use an implementation based on the pointto-point (P2P) communication of the message passing interface (MPI) (referred to as “P2P implementation” hereafter). In this paper, we reveal the drawbacks of the P2P implementation when the parallel decompositions of the sender and the receiver are different, including low communication bandwidth due to small message size, variable and high number of MPI messages, as well as network contention. To overcome these drawbacks, we propose a butterfly implementation for data transfer. Although the butterfly implementation outperforms the P2P implementation in many cases, it degrades the performance when the sender and the receiver have similar parallel decompositions or when the number of processes used for running models is small. To ensure data transfer with optimal performance, we design and implement an adaptive data transfer library that combines the advantages of both butterfly implementation and P2P implementation. As the adaptive data transfer library automatically uses the best implementation for data transfer, it outperforms the P2P implementation in many cases while it does not decrease the performance in any cases. Now, the adaptive data transfer library is open to the public and has been imported into the C-Coupler1 coupler for performance improvement of data transfer. We believe that other couplers can also benefit from this.


Introduction
Climate system models (CSMs) and Earth system models (ESMs) are fundamental tools for simulating, predicting, and projecting climate.A CSM or an ESM generally integrates several component models, such as an atmosphere model, a land surface model, an ocean model, and a sea-ice model, into a coupled system to simulate the behaviours of the climate system, including the interactions between components of the climate system.More and more coupled models have sprung up in the world.For example, the number of coupled model configurations in the Coupled Model Intercomparison Project (CMIP) has increased from less than 30 (used for CMIP3) to more than 50 (used for CMIP5).
High-performance computing is an essential technical support for model development, especially for higher and higher resolutions of models.Modern high-performance computers integrate an increasing number of processor cores for higher and higher computation performance.Therefore, efficient parallelization, which enables a model to utilize more processor cores for acceleration, becomes a technical focus in model development; a number of component models with efficient parallelization have sprung up.For example, the Community Ice CodE (CICE; Hunke and Lipscomb, 2008;Humpe et al., 2013) at 0.1 • horizontal resolution can scale to 30 000 processor cores on the IBM Blue Gene/L (Dennis and Tufo, 2008); the Parallel Ocean Program (POP; Kerbyson and Jones, 2005;Smith et al., 2010) at 0.1 • horizontal resolution can also scale to 30 000 processor cores on the IBM Blue Gene/L and 10 000 processor cores on a Cray XT3 (Dennis, 2007); the Community Atmosphere Model (CAM; Morrison and Gettelman, 2008;Neale et al., 2010Neale et al., , 2012) ) with a spectral element dynamical core (CAM-SE) at 0.25 • horizontal resolution can scale to 86 000 processor cores on a Cray XT5 (Dennis et al., 2012).
A coupler generally has much smaller overhead than the component models in current coupled systems.However, it is potentially a time-consuming component in future coupled models.This is because more and more component models (such as the land-ice model, chemistry model and biogeochemical model) will be coupled into a coupled model, and the coupling frequency between component models will be higher and higher.Data transfer is a fundamental and frequently used operation in a coupler.It is responsible for transferring data fields between the processes of two component models and for rearranging data fields among processes of the same component model for parallel data interpolation.
A coupler may become a bottleneck for efficient parallelization of future coupled models.The most obvious reason is that the current implementation of data transfer in a stateof-the-art coupler may be not efficient enough.For example, due to the low efficiency of data transfer, the coupling from a component model with a horizontal grid (576 × 384 grid points) to another component model with a different horizontal grid (3600 × 2400 grid points) can only scale to about 500 processor cores when using the CPL7 coupler (Craig et al., 2012).Therefore, it is highly desirable to improve the parallel data transfer of couplers.
In this study, we first propose a butterfly implementation of data transfer.Since the point-to-point (P2P) communication of the message passing interface (MPI) (referred to as "P2P implementation" hereafter) and the butterfly implementation can outperform each other in different cases (Sect.5), we next develop an adaptive data transfer library that includes both implementations and can adaptively implement the better one for data transfer.Performance evaluation demonstrates that such a library significantly outperforms the P2P implementations in most cases and does not degrade the performance in any case.This library has been imported into the C-Coupler1 with a slight code modification.We believe that other couplers can also benefit from it.The remainder of this paper is organized as follows.We briefly introduce the implementation of data transfer in existing couplers in Sect. 2. Details of the butterfly implementation and the adaptive data transfer library are presented in Sects.3 and 4, respectively.The performances of data transfer implementations are evaluated in Sect. 5. Conclusions are given in Sect.6.

P2P implementation
Almost all state-of-the-art couplers use a similar implementation for data transfer.To achieve parallel data transfer, MCT first generates a communication router (known as the data mapping between processes) according to the parallel decompositions (the distribution of grid points among the processes) of the sender and the receiver, and then uses the P2P communication of the MPI to transfer the data.A data field will be transferred from a process of the sender to a process of the receiver, only when the two processes have common grid points, i.e. "P2P implementation" for short.
Since MCT has already been imported into OASIS3-MCT, the CPL6 coupler, and the CPL7 coupler, these couplers also use the P2P implementation for data transfer.Although the other couplers, such as ESMF, OASIS4, the FMS coupler, and C-Coupler1, do not directly import MCT, they also use the P2P implementation for data transfer.

Performance bottlenecks of the P2P implementation
In this work, we first investigate the performance characteristics of the P2P implementation, and therefore derive a benchmark from a real coupled model GAMIL2 (Grid-Point Atmospheric Model of IAP LASG-Version 2)-CLM3 (Community Land Model version 3), which includes GAMIL2 (Li et al., 2013), i.e. an atmosphere model and CLM3 (Oleson et al., 2004;Dickinson et al., 2006), i.e. a land surface model.GAMIL2 and CLM3 share the same horizontal grid of 7680 (128 × 60) grid points, but have different parallel decompositions: GAMIL2 uses a regular two-dimensional (2-D) parallel decomposition, while CLM3 uses an irregular 2-D parallel decomposition where the grid points are assigned to the processes in a round-robin fashion.In this benchmark, there is only the data transfer with the P2P implementation between the sender and the receiver with the same horizontal grid as GAMIL2-CLM3.The parallel decomposition of the sender is derived from CLM3, and the parallel decomposition of the receiver is derived from GAMIL2.A high-performance computer called Tansuo100 at Tsinghua University, China, is used for the performance tests.It has 700 computing nodes, each of which contains two six-core Intel Xeon X5670 CPUs and 32 GB main memory.All computing nodes are connected by a high-speed In-finiBand network with peak communication bandwidth of 5 GB s −1 .
To evaluate the parallel performance of the P2P implementation, 14 2-D coupling fields are transferred between the sender and the receiver.In each test, the sender and the receiver use the same number of processes.Since there are 12 processor cores on each computing node, the number of processes is set to be an integral multiple of 12.The sender and the receiver are located on different computing nodes and the communication of the P2P implementation must go through the InfiniBand network.
Figure 1 demonstrates that the poor parallel scalability of the P2P implementation can be obtained when the parallel decompositions of the sender and receiver are different.It is well known that the communication performance heavily depends on message size.As shown in Fig. 2, the P2P communication bandwidth achieved generally increases with mes-sage size.So when the message size is small (for example, smaller than 4 KB), the communication bandwidth achieved is very low.The message size in the P2P implementation decreases when the number of model processes increases (Fig. 3), indicating that the communication bandwidth becomes lower when increasing the number of processes.The performance of data transfer also heavily depends on the number of MPI messages.As shown in Fig. 4, the variation of average number of MPI messages in the P2P implementation is consistent with the variation of the execution time in Fig. 1: both increase with the number of processes from 6 to 48, and go down with the number of processes from 96 to 192.A lower execution time of the P2P implementation will be obtained if more processes are used (the maximum number of processes in both Figs. 1 and 4 is limited to 192 because GAMIL2-CLM3 will not be further accelerated when using more processes) since the average number of MPI messages will further go down.
To further reveal possible reasons for the poor parallel scalability, we evaluate the ideal performance and actual performance in Fig. 5.The ideal performance is much better than the actual performance, and the ratio between the ideal performance and the actual performance significantly increases when increasing the number of processes.The significant gap between the ideal performance and the actual performance is due to the network contention.For example, when multiple P2P communications share the same sender process or receiver process, they must wait in order.

Butterfly implementation for better performance of data transfer
The drawbacks of the P2P implementation when the sender and the receiver use different parallel decompositions can be identified as low communication bandwidth due to small message size, variable and high number of MPI messages, as well as network contention.To overcome these draw- backs, a prospective solution is to organize the transfer of data using a better algorithm, e.g. the butterfly algorithm (Fig. 6), which has already been studied in computing sciences (Chong and Brewer, 1994;Foster, 1995;Heckbert , 1995;Hemmert and Underwood, 2005;Kim et al., 2007;Jan et al., 2013;Petagon and Werapun, 2016).With respect to hardware, the traditional butterfly algorithm and its transformation have been used to design networks (Chong and Brewer, 1994;Kim et al., 2007); with respect to software, the butterfly algorithm has been used to improve the parallel algorithms with all-to-all communications (Foster, 1995), e.g.fast Fourier transform (FFT; Heckbert, 1995;Hemmert and Underwood, 2005), matrix transposition (Petagon and Werapun, 2016), and sorting (Jan et al., 2013).
Unfortunately, the classical butterfly algorithm cannot be used as it is to improve data transfer, because it requires that one process communicates with every other process, that the communication load among processes is balanced, and that the number of processes must be a power of 2. In practice, data transfer for model coupling has different characteristics: one process needs to communicate with a part of other processes, the communication load among processes is always unbalanced, and the number of processes cannot be restricted to a power of 2. Therefore, we propose here a new implementation of data transfer involving an additional butterfly kernel to transfer data from the sender with the source parallel decomposition to the receiver with the target parallel decomposition.As the number of processes of the butterfly kernel must be a power of 2, while the number of processes of the sender or the receiver are not necessarily, the butterfly kernel has its own source and target parallel decompositions, and process mappings are required from the sender onto the butterfly kernel and from the butterfly kernel onto the receiver (see Fig. 7).Next, we present the butterfly kernel and the process mappings.

Butterfly kernel
The first question for the butterfly kernel is how to decide its number of processes.Any process of the sender or receiver can be used as a process for the butterfly kernel.Given that the total number of unique processes of the sender and receiver is N T , the number of processes of the butterfly kernel (N B ) can be any power of 2, which is no larger than N T .We propose to select the maximum number in order to maximize utilization of resources.We prefer to pick out unique processes first from the sender, and then from the receiver if the sender does not have enough processes.The butterfly kernel is responsible for rearranging the distribution of data among the processes from the source parallel decomposition to the target parallel decomposition.Given the number of processes N = 2 n , there are n stages in the Figure 6.An example of the butterfly kernel with eight processes.Each coloured row stands for one process (P 0 -P 7 ).There are multiple stages (each column of arrows represents a stage (stage 1 to stage 3)) in the butterfly kernel.Each arrow stands for an MPI P2P communication from one process to another.D i j means the data are originally in process P i according to the source parallel decomposition and is finally in process P j according to the target parallel decomposition.
Figure 7.The butterfly implementation, which is composed of three parts: the butterfly kernel, process mapping from the sender to the butterfly kernel, and process mapping from the butterfly kernel to the receiver.butterfly kernel.In a stage, all processes are divided into a number of pairs and the two processes of a pair use MPI P2P communication to exchange data.After each stage, the number of butterfly kernel processes that may have the data that will finally belong to any one process on the target parallel decomposition will become a half.Figure 6 is an example for further illustration, where D i j means the data are originally in process P i according to the source parallel decomposition and are finally in process P j according to the target parallel decomposition.Before the first stage, all processes (P 0 -P 7 ) may have the data of P 0 on the target parallel decomposition.After the first stage, only four processes (P 0 , P 2 , P 4 , and P 6 ) may have that data; and after the second stage, only two processes (P 0 and P 4 ) may have it.
To reveal the advantages and disadvantages of the two implementations, we measure the characteristics of the two implementations based on the benchmark introduced in Sect.2.2.The results show that the total amount of data transferred by the butterfly implementation is larger than that transferred by the P2P implementation (Fig. 8), which is the major disadvantage of the butterfly implementation.Meanwhile, compared with the P2P implementation, the butterfly implementation can have the following advantages: 1. bigger message size for better communication bandwidth (Fig. 9); 2. balanced and smaller number of MPI processes among processes (Fig. 10); 3. ordered communications among processes and fewer communications operated concurrently (Fig. 10), which can dramatically reduce network contention.

Process mapping
In this subsection, we will introduce the process mappings from the sender to the butterfly kernel and from the butterfly kernel to the receiver.To minimize the overhead of process mapping from the butterfly kernel to the receiver, we map one or multiple processes of the butterfly kernel onto a process of the receiver if the butterfly kernel has more processes than the receiver; otherwise, we map a process of the butterfly kernel onto one or multiple processes of the receiver.In other words, there is no multiple-to-multiple process mapping between the butterfly kernel and the receiver.Similarly, there is no multiple-to-multiple process mapping between the sender and the butterfly kernel.
Processes of the sender or the receiver may be unbalanced in terms of the data size transferred, which may result in unbalanced communications among processes of the butterfly kernel.As mentioned in Sect.3.1, at each stage of the butterfly kernel, all processes are divided into a number of pairs, each of which is involved in P2P communications.To improve the balance of communications among the processes in the butterfly kernel, one solution is to try to make the process pairs at each stage more balanced in terms of the data size of P2P communications, so we propose to reorder the processes of the sender or the receiver according to data size.At the first stage, we pick out the process with the largest data size and the process with the smallest data size from the remaining processes that have not been paired, to generate a process group.For the next stage, the outputs of two process groups from the previous stage are paired into bigger process groups in a similar way.After finishing the iterative pairing throughout all stages, all processes of the sender or the receiver are reordered.
The iterative pairing also requires the number of processes to be a power of 2. Given that the number of processes of the sender (or receiver) is N C and the number of processes of the butterfly kernel is N B , we first pad empty processes (whose data size is zero) before the iterative pairing to make the number of processes of the sender (or receiver) be a power of 2 (donated N P ), which is no smaller than N B .Therefore, the reordered N P processes after the iterative pairing can be divided into N B groups, each of which contains N P /N B processes with consecutive reordered indexes and maps onto a unique process of the butterfly kernel.
Figure 11 shows an example of the process mapping, where the sender has 5 processes (S 0 -S 4 in Fig. 11a), the receiver has 10 processes (R 0 -R 9 in Fig. 11b), and the butterfly kernel uses 8 processes (B 0 -B 7 in Fig. 11c).At first, empty processes are padded to the sender (S 5 -S 7 in Fig. 11a) and the receiver (R 10 -R 15 in Fig. 11b).Next, the iterative pairing is conducted for the sender and the receiver.The iterative pairing has three stages for the sender.At the first stage, the eight processes of the sender are divided into four groups {S 1 , S 7 }, {S 0 , S 6 }, {S 2 , S 5 }, and {S 4 , S 3 } (Fig. 11a), according to the data size corresponding to each process.These four process groups are divided into two bigger groups {{S 4 , S 3 } , {S 2 , S 5 }} and {{S 1 , S 7 } , {S 0 , S 6 }} at the second stage (Fig. 11a).Finally, one process group {{{S 4 , S 3 } , {S 2 , S 5 }} , {{S 1 , S 7 } , {S 0 , S 6 }}} is obtained at the third stage (Fig. 11a), and the eight processes of the sender are reordered as S 4 , S 3 , S 2 , S 5 , S 1 , S 7 , S 0 , and S 6 , each one being mapped onto 1 process of the butterfly kernel Figure 11.An example of process mappings, given that the sender has 5 processes (S 0 -S 4 ), the receiver has 10 processes (R 0 -R 9 ) (there is no common process between the sender and receiver), and the butterfly kernel contains 8 processes (B 0 -B 7 ).Panels (a) and (b) show how to iteratively pair processes of the sender and receiver, respectively.There are multiple stages in the iterative pairing of processes of the sender and receiver.In each stage, the processes in the same colour are grouped into one process pair.Panel (c) shows how to map the reordered processes of the sender and receiver onto the processes of the butterfly kernel.

Adaptive data transfer library
Now, we have two kinds of implementations (the P2P implementation and the butterfly implementation) for data transfer.Although the butterfly implementation can effectively improve the performance of data transfer in many cases (examples are given in Sect.5), it has some drawbacks: (1) it generally has a larger total amount of data transferred than the P2P implementation; (2) its number of stages is log 2 N (where N is the number of processes for the butterfly kernel) (Foster, 1995), which may be bigger than the average number of MPI messages in the P2P implementation in some cases (for example, when the sender and the receiver use the similar parallel decompositions).Therefore, it is possible that the P2P implementation outperforms the butterfly implementation in some cases.To achieve optimal performance for data transfer, we propose an adaptive data transfer library that can take the advantages of the two implementations in all cases.
As introduced in Sect.3.1, the butterfly implementation is divided into multiple stages.Actually, the data transfer in one stage can be viewed as a P2P implementation with only one MPI message per process.Inspired by this fact, we try to design an adaptive approach that can combine the butterfly and P2P implementations, where some stages in the butterfly implementation are skipped and replaced by P2P communication of more MPI messages per process.When all stages of the butterfly implementation are skipped, the adaptive data transfer library completely switches to the original P2P implementation.That is to say, the adaptive data transfer can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation.Figure 12 shows an example of the adaptive data transfer library with eight processes, where stage 2 of the butterfly implementation is skipped and replaced by P2P communication of three MPI messages per process.
The most significant challenge of such an adaptive approach is to determine which stage(s) of the butterfly implementation should be skipped.The first attempt was to design a cost model that can accurately predict the performance of data transfer in various implementations.We eventually gave up this approach as it was almost impossible to accurately predict the performance of the communications on a highperformance computer, especially when a lot of users share the computer to run various applications.Performance profiling, which means directly measuring the performance of data transfer, is more practical to determine an appropriate implementation, because the simulation of Earth system modelling always takes a long time to run. Figure 13 shows our flow chart of how the adaptive data transfer library determines an appropriate implementation.It consists of an initialization segment and a profiling segment.The initialization segment generates the process mappings and a candidate implementation that is a butterfly implementation with no skipped stages.The profiling segment iterates through each stage of the butterfly implementation to determine whether the current stage should be skipped or kept.In an iteration, the profiling segment first generates a temporary implementation based on the candidate implementation where the current stage is skipped, and then runs the temporary implementation to get the time the data transfer takes.When the temporary implementation is more efficient than the candidate implementation, the current stage is skipped and the temporary implementation replaces the candidate implementation.When the profiling segment finishes, the appropriate implementation is set to be the candidate implementation.To reduce the overhead introduced by the adaptive data transfer library, the profiling segment truly transfers the data for model coupling.In other words, before obtaining an optimal implementation, the data is transferred by the profiling segment.

Performance evaluation
In this section, we empirically evaluate the adaptive data transfer library, through comparing it to the P2P implementation and the butterfly implementation.Both toy models and realistic models (GAMIL2-CLM3 and CESM -Community Earth System Model) are used for the performance evaluation.GAMIL2-CLM3 has been introduced in Sect.2.2.CESM (Hurrell et al., 2013) is a state-of-the-art ESM developed by the National Center for Atmospheric Research (NCAR).All the experiments are run on the highperformance computer Tansuo100.
Next, we will evaluate the overhead of initialization, the performance of transferring data fields between two toy models and between different realistic component models, and the performance of rearranging data fields within a component model for parallel interpolation.

Overhead of initialization
We first evaluate the initialization overhead of data transfer implementations.As shown in Fig. 14, the initialization overhead of each implementation increases when increas- the number of processes.The initialization overhead of the butterfly implementation is a little higher than that of P2P implementation, while the initialization overhead of adaptive transfer library is 2-3-fold higher than that of the P2P implementation, because the adaptive data fer library uses extra time on the performance (see Sect. 4).Considering that one data transfer instance only be initialized at the beginning executed many times in a coupled model, we can that the initialization overhead of the adaptive data transfer library is reasonable, especially when the simulation is executed for a very long time.

Performance of data transfer between toy models
The factors that can impact the performance of a data transfer implementation generally include the number of MPI messages, the size of the data to be transferred (also referred to as the number of fields in this evaluation) and the number of processes used.In this subsection, we evaluate the impact of each factor on the performance of data transfer for different implementations.We first build two toy models that both use the same logically rectangular grid of 192 × 480 grid points.Coupling fields are transferred between the two toy models.For any test, the two toy models use the same number of processes.Next, we evaluate the performance of data transfer through varying one factor while fixing the other two factors.In the first experiment, we fix the number of processes to be 1024 and the number of coupling fields to be 10, while varying the number of MPI messages in the P2P implementation.In each test, all processes of the sender have the same number of MPI messages.As the number of MPI messages is determined by the parallel decompositions of the sender and the receiver, we design an algorithm (Algorithm 1) that can generate the parallel decompositions of the two toy models according to the average number of MPI messages of the sender in the P2P implementation.Figure 15  Determine the parallel decomposition of the sender Considering that the numbers of grid points are always balanced among the processes of a model, assign around Grid_pnts/M grid points to each process of the sender.
2 Determine the number of MPI messages of each process of the sender 2.1 If the flag Is_balanced is set to true, set the number of MPI messages of each process of the sender to be Avg_send_msgs; 2.2 Otherwise, randomly determine the number of MPI messages of each process of the sender 2.2.1 Initialize the number of MPI messages of each process of the sender to be 1 2.2.2 Randomly select a process of the sender whose number of MPI messages does not exceed N and Grid_pnts/M, and then increase its number of MPI messages by 1, until the average number of MPI messages of all processes of the sender reaches Avg_send_msgs.
3 Determine the grid points of each MPI message For each process of the sender, assign the corresponding grid points to all MPI messages of this process (a grid point belongs to only one MPI message) 3.1 If the flag Is_balanced is set to true, assign the grid points to all MPI messages evenly.

3.2
Otherwise, assign the grid points to each MPI message randomly 3.2.1 Assign one grid point to each MPI message 3.2.2 For each of remaining grid points, randomly select an MPI message for it 4 Determine the parallel decomposition of the receiver through assigning the grid points in each MPI message to a process of the receiver For each process of the sender, assign the grid points in each MPI message of it to a distinct receiver process: to make the numbers of grid points balance among the processes of the receiver in the final parallel decomposition, an MPI message with bigger number of grid points will be assigned to a receiver process with smaller total number of grid points that have been assigned to it.

2
tations when increasing the number of MPI messages per sender process in the P2P implementation from 1 to 90.The P2P implementation can outperform the butterfly implementation when the number of MPI messages is small (e.g.smaller than 12 in Fig. 15), while the butterfly implementation can outperform the P2P implementation when the number of MPI messages is big (e.g.bigger than 12 in Fig. 15).The adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation and, moreover, it improves the performance based on the butterfly implementation when the number of MPI messages is big, since some butterfly stages of the butterfly implementation are skipped.When the num-ber of MPI messages is 90, the adaptive data transfer library can achieve a 19.2-fold performance speed-up compared to the P2P implementation.
In the second experiment, we fix the number of processes and the number of MPI processes per sender process in the P2P implementation, and vary the number of coupling fields transferred.Figure 16 shows the execution time of one data transfer with different implementations in this experiment.The results show that the execution time of each implementation increases with the increment data size.When the number of MPI processes per sender process in P2P implementation is small (Fig. 16a, the performance of the butterfly is poorer than that of the P2P im-Figure 16.Average execution time (y axis) of one data transfer between two toy models with the same rectangular grid (of 192 × 480 grid points) when varying the number of coupling fields transferred (x axis).There are four simulation tests for the evaluation.In simulation (a), each toy model is run with 256 processes, and the number of MPI messages per sender process in the P2P implementation is 12.In simulation (b), each toy model is run with 1024 processes, and the number of MPI messages per sender process is in the P2P implementation 12.In simulation (c), each toy model is run with 256 processes, and the number of MPI messages per sender process in the P2P implementation is 48.In simulation (d), each toy model is run with 1024 processes, and the number of MPI messages per sender process in the P2P implementation is 48.plementation, especially when the number of 2-D coupling fields gets bigger.When the number of MPI messages per sender process in the P2P implementation is big (Fig. 16c,  d), the butterfly implementation significantly outperforms the P2P implementation; however, the advantage of the butterfly implementation decreases when increasing the number of coupling fields.The results also demonstrate that the adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation, and can further improve the performance based on the butterfly implementation.
In the third experiment, we fix the number of MPI messages per sender process in the P2P implementation to be 24 and the number of coupling fields transferred to be 10, and vary the number of processes used.Figure 17 shows the execution time of one data transfer with different implementations when varying the number of processes.The P2P implementation outperforms the butterfly implementation when a small number of processes are used (e.g.smaller than 256 in Fig. 17), while the butterfly implementation outperforms the P2P implementation when a large number of processes are used (e.g.larger than 256 in Fig. 17).Similar to the above two experiments, the adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation.
The resolution of models becomes higher and higher these days.How about the performance of the data transfer implementations when model resolution becomes higher?Higher model resolution means that a model will use more processes for accelerating a simulation, while the average number of grid points per process can remain constant.Considering that the numbers of grid points are always balanced among the processes of a model, we make each process (which runs on a unique processor core) of the toy models evenly have around 96 grid points in this evaluation, while enabling processes to have different number of MPI messages and different message sizes (the average number of MPI messages of the sender in P2P implementation is 34).As shown in Fig. 18, although the execution times of all data transfer implementations increase when increasing the number of processes (from 64 to 1024), the butterfly implementation significantly outperforms the P2P implementation.So the adaptive data transfer library adaptively chooses the butterfly implementation, and further slightly outperforms the butterfly implementation when each model uses more than 512 processes because some butterfly stages are skipped.

Performance of data transfer between realistic models
In this subsection, we evaluate the performance using two realistic models: GAMIL2-CLM3 (horizontal resolution of 2.8 For CESM, we use the data transfer between the coupler CPL7 (Craig et al., 2012) and the land surface model CLM4 (Oleson et al., 2004), where 32 2-D coupling fields on the CLM4 horizontal grid (the grid size is 144 × 96 = 13 824) are transferred.Figure 19 shows the performance of one data transfer of different implementations when increasing the number of processes of both CPL7 and CLM4 from 6 to 192.When the number of processes is small (e.g.smaller than 24 in Fig. 19), the butterfly implementation is much poorer than the P2P implementation.In this case, the adaptive data transfer library chooses the P2P implementation as the optimal implementation.However, when the number of processes gets bigger (e.g.larger than 24 in Fig. 19), the butterfly implementation outperforms the P2P implementation.In this case, the adaptive data transfer library, based on the butterfly implementation, skips some stages, outperforming the butterfly implementation.Figure 19 also shows that the butterfly implementation and the adaptive transfer library seem to converge when increasing the number of processes per model.When each model uses 192 processes, the adaptive data transfer library is 4.01 times faster than the P2P implementation.
For GAMIL2-CLM3, we use the data transfer from CLM3 to GAMIL2 where 14 2-D coupling fields on the GAMIL2 horizontal grid (whose grid size is 128 × 60 = 7680) are transferred.Figure 20 shows the execution time of one data transfer of each implementation when increasing the number of processes of both GAMIL2 and CLM3 from 6 to 192.The results in Fig. 20 confirm that the adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation.Compared to the P2P implementation, the adap- tive data transfer library achieves an 11.68-fold performance speed-up when the number of processes is 96, but achieves a much lower speed-up (only 3.48-fold) when the number of processes is 192.This is because the average number of MPI messages per process in the P2P implementation reduces from 32 to 18 when the number of process increases from 96 to 192.

Performance of data rearrangement for interpolation
Besides data transfer between different component models, there is another kind of data transfer in model coupling that rearranges data inside a model for parallel interpolation of fields between different grids.Here, we use the data rearrangement for the parallel interpolation from the atmosphere grid (whose grid size is 144 × 96 = 13 824) to the ocean grid (whose grid size is 320 × 384 = 122 880) in the coupled model CESM for further evaluation.As shown on Fig. 21, the P2P implementation significantly outperforms the butterfly implementation.This is because the parallel decompositions before and after data rearrangement are always similar, which leads to small number of MPI messages.For example, the average number of MPI messages in the P2P implementation corresponding to Fig. 21 is only 6.49 when the model uses 96 processes.In this case, the P2P implementation is chosen as the optimal implementation of the data transfer library, so the data transfer library does not provide real benefit compared to the P2P implementation.

Performance improvement for a coupled model
With the performance improvement of data transfer, we expect that the adaptive data transfer library will improve the performance of coupled models.For this evaluation, we first imported the adaptive data transfer library into C-Coupler1, used it in the coupled model GAMIL2-CLM3, and measured performance results.As shown in Fig. 22, the adaptive data transfer library achieves higher speed-up with respect to the whole model time (when the P2P implementation is used as the baseline) for GAMIL2-CLM3 when using more than 16 processes.When each component model uses 128 processes, the butterfly implementation achieves ∼ 4.6 % performance improvement, and the adaptive data transfer library achieves ∼ 6.9 % performance improvement.Therefore, the data transfer library can improve the performance of data transfer, and then improve the performance of the whole coupled model.

Conclusions
Data transfer is a fundamental and frequently used operation in a coupler.This paper showed that the P2P implewww.geosci-model-dev.net/9/2099/2016/Geosci.Model Dev., 9, 2099-2113, 2016 mentation currently used in most state-of-the-art couplers for data transfer is inefficient when the parallel decompositions of the sender and the receiver are different, and further revealed the corresponding performance bottlenecks.We showed that the butterfly implementation can outperform the P2P implementation in many cases but degrades the performance in some cases, for example when a small number of processes are used to run models or when the parallel decompositions of the sender and receiver are similar.We therefore designed and implemented an adaptive data transfer library that automatically chooses an optimal implementation between the P2P implementation and the butterfly implementation and also further improves the performance based on the butterfly implementation through skipping some butterfly stages.Compared to the P2P implementation, the adaptive data transfer library can improve the performance of data transfer when the parallel decompositions of the sender and the receiver are different.
The initialization overhead for the adaptive data transfer library could become expensive when using a large number of processes.In the future version, the adaptive data transfer will allow users to record the results of performance profiling offline to save the time used for performance profiling in the next run of the same coupled model.

Figure 1 .
Figure 1.Average execution time of the P2P implementation when transferring 14 2-D fields from CLM3 to GAMIL2.In each test, the atmosphere model GAMIL2 and the land surface model CLM3 have the same number of processes; they do not share the same computing nodes.The horizontal grid of the 14 2-D fields contains 7680 (128 × 60) grid points.

Figure 2 .
Figure 2. Variation of bandwidth (y axis) of an MPI P2P communication with respect to the message size (x axis).The results are generated from our benchmark.In the benchmark, one process sends messages with different sizes to the other process.The two processes of the P2P communication run on two different computing nodes of Tansuo100.

Figure 5 .
Figure5.Ideal and actual bandwidths of the P2P implementation (y axis) in GAMIL2-CLM3 when gradually increasing the number of processes per model (x axis).The experimental set-up is similar to that shown in Fig.1.The ideal bandwidth is calculated from the message size and the MPI bandwidth measured in Fig.2; and the actual bandwidth is calculated from Fig.1.

Figure 8 .Figure 9 .
Figure8.Total amount of data transferred by P2P implementation and butterfly implementation (y axis) in GAMIL2-CLM3, when varying the number of processes per model (x axis).The experimental set-up is similar to that shown in Fig.1.

Figure 12 .
Figure 12.An example of the adaptive data transfer library with eight processes, where stage 2 of the butterfly implementation is skipped and replaced by P2P communication of three MPI messages per process.

Figure 13 .
Figure 13.A flow chart for determining an appropriate implementation of the adaptive data transfer library.

Figure 14 .Figure 15 .
Figure14.Initialization time (y axis) of one data transfer between two toy models using a rectangular grid (of 192 × 96 grid points) when varying the number of processes per model (x axis).There are 10 2-D coupling fields transferred from the source toy model to the target toy model.In each test, all processes of the sender in the P2P implementation have the same number of MPI messages.If the number of processes per model is less than 24, the number of MPI messages per sender process in the P2P implementation is equal to the number of processes per model; otherwise, the number of MPI messages per sender process in the P2P implementation is 24.The parallel decompositions of the sender and the receiver for a given average number of MPI messages are generated by Algorithm 1.

Algorithm 1 .
shows the execution time of one data transfer with different implemen-1 Generating the parallel decompositions of the sender and the receiver according to an average number of MPI messages of the sender in the P2P implementation.Input Number of processes of the sender: M Number of processes of the receiver: N Number of points in the grid: Grid_pnts Average number of MPI messages per process of the sender in the P2P implementation: Avg_send_msgs, Avg_send_msgs ≤ N The flag that specifies whether the number of MPI messages among processes are the same: Is_balanced Output Parallel decomposition of the sender Parallel decomposition of the receiver 1

Figure 20 .
Figure 20.Average execution time (y axis) of one data transfer between the atmosphere model GAMIL2 and the land surface model CLM3 in GAMIL2-CLM3 when varying the number of processes per model (x axis): 14 coupling fields on the GAMIL2 horizontal grid (the grid size is 128 × 60 = 7680) are transferred from the land surface model CLM3 to the atmosphere model GAMIL2.

Figure 21 .Figure 22 .
Figure 21.Average execution time (y axis) of one data rearrangement for the parallel interpolation from the atmosphere grid (the grid size is 144 × 96 = 13 824) to the ocean grid (the grid size is 320 × 384 = 122 880) in CESM when varying the number of processes per model (x axis).
Figure10.Maximum number of MPI messages, average number of MPI messages and minimum MPI messages in P2P implementation and butterfly implementation (y axis), when varying the number of processes per model (x axis) in GAMIL2-CLM3.The experimental set-up is similar to that shown in Fig.1.