Data intensive simulations are often limited by their I/O (input/output) performance, and “novel” techniques need to be developed in order to overcome this limitation. The software package pnetCDF (parallel network Common Data Form), which works with parallel file systems, was developed to address this issue by providing parallel I/O capability. This study examines the performance of an application-level data aggregation approach which performs data aggregation along either row or column dimension of MPI (Message Passing Interface) processes on a spatially decomposed domain, and then applies the pnetCDF parallel I/O paradigm. The test was done with three different domain sizes which represent small, moderately large, and large data domains, using a small-scale Community Multiscale Air Quality model (CMAQ) mock-up code. The examination includes comparing I/O performance with traditional serial I/O technique, straight application of pnetCDF, and the data aggregation along row and column dimension before applying pnetCDF. After the comparison, “optimal” I/O configurations of this application-level data aggregation approach were quantified. Data aggregation along the row dimension (pnetCDFcr) works better than along the column dimension (pnetCDFcc) although it may perform slightly worse than the straight pnetCDF method with a small number of processors. When the number of processors becomes larger, pnetCDFcr outperforms pnetCDF significantly. If the number of processors keeps increasing, pnetCDF reaches a point where the performance is even worse than the serial I/O technique. This new technique has also been tested for a real application where it performs two times better than the straight pnetCDF paradigm.
The Community Multiscale Air Quality (CMAQ) model (Byun and Schere, 2006) is
a regional air quality model which is widely used in air quality research
and regulatory applications (e.g., Fu et al., 2012). This model was developed
in the 1990s by the US Environmental Protection Agency (US EPA) and it has
continued to evolve. Recently, CMAQ was combined with the WRF (Weather Research and Forecasting model) to form a WRF-CMAQ
two-way coupled model (Wong et al., 2012) with direct aerosol effects on
radiation. CMAQ has been and continues to be extensively used to provide
guidance in rule making such as CSAPR (Cross-State Air Pollution Rule,
CMAQ uses IOAPI (Input/Output Applications Programming Interface;
The independent I/O and collective I/O are the two most common I/O strategies in parallel applications. However, a shortcoming of the independent I/O approach is the servicing of the I/O requests of each process individually (Chen et al., 2010). The collective I/O provides a better solution of managing non-contiguous portions of a file with multiple processes interleaved (Thakur et al., 1999). Several collective I/O techniques are hence developed to improve the parallel I/O performance at various levels by enabling the compute nodes to cooperate with efficient parallel access to the storage system. Examples include two-phase I/O (del Rosario et al., 1993), data sieving (Thakur et al., 1999), and the collective buffering (Nitzberg and Lo, 1997).
To optimize the I/O performance, software is designed to access
non-contiguous patterns by implementation of collective I/O. Data is
rearranged and aggregated in memory prior to writing to files, which reduces
the number of disk accesses and the seek-time overhead due to large amounts
of non-contiguous write requests. Improved I/O efficiency is observed
through split writing and hierarchical striping of data (Yu et al., 2007).
The benefits of utilizing the improved parallel I/O techniques on
applications in various research areas have been recognized (Li et al.,
2003; Kordenbrock and Oldfield, 2006; Huang et al., 2014). The approach to
parallelize the I/O by using the network Common Data Form (
File data striping on parallel file systems also influences I/O performance. Data is distributed using a fixed block size in a round-robin manner among available I/O servers and disks based on a simple striping data distribution function. An optimal striping setup on parallel file systems can significantly reduce the I/O time (Nisar et al., 2012) while inappropriate settings may incur striping overhead for both metadata and file read/write operations (Yu et al., 2007). Research work has shown degradation of parallel I/O efficiency when large numbers of processors are applied to scientific applications such as CMAQ (Kordenbrock and Oldfield, 2006). To overcome these shortcomings, we re-engineered the current CMAQ I/O module to better utilize more processors on high-performance computational machines as well as quantifying the optimal data-striping setup on Lustre file systems.
The Community Multiscale Air Quality (CMAQ) modeling system, an active open-source development project of the US Environmental Protection Agency, is an air quality model for regulatory and policy analysis. The interactions of atmospheric chemistry and physics are studied through this three-dimensional Eulerian atmospheric chemistry and transport modeling system. The primary goal for CMAQ is to simulate ozone, particulate matter, toxic airborne pollutants, visibility, and acidic and nutrient pollutant species within the troposphere and across spatial scales ranging from local to hemispheric.
IOAPI, a third-party software, was created concurrently with the initial development of the CMAQ model. It provides a simple interface to handle read and write data in netCDF format in CMAQ. It originally operated in serial mode and was later expanded to run on SMP (Symmetric Multiprocessing) machines using OpenMP. It has never been implemented with capability to run on a distributed system.
Conceptual diagrams of four I/O modules: serial I/O used in current
CMAQ with netCDF data format
When CMAQ was parallelized in late 1998, a “pseudo” parallel I/O library, PARIO, was created to enable CMAQ to run on a distributed system. PARIO was built on top of the IOAPI library to handle regular data operations (read and write) from each MPI (Message Passing Interface) process. Each individual processor can read its subdomain portion of data straight from the input files. However, for output, PARIO requires that all processors send their portion of data to the designated processor, i.e., processor 0, which will stitch all data together and write en masse to the output file (Fig. 1a). Clearly, there are a few shortcomings of this strategy: (1) as the number of processors increases, the network will be flooded with more MPI messages and require longer synchronization time to accomplish an output task; (2) if the domain size remains the same but the number of processors increases, the output data size in each processor decreases which will lower the I/O efficiency; and (3) it requires extra memory for processor 0 to hold the entire data set before writing to the file.
Besides the shortcomings mentioned in Sect. 3, IOAPI has another major drawback, which is not taking advantage of existing technology advancements such as parallel file systems and a parallel I/O framework, for example, pnetCDF. Kordenbrock and Oldfield (2006) have shown an enhancement of model I/O performance with the adoption of pnetCDF. Our new approach not only utilizes advanced parallel I/O technology, it also addresses all the shortcomings directly. This new approach performs I/O through pnetCDF using a collective parallel netCDF API (applications programming interface) on a parallel file system basis, thus eliminating the first and third shortcomings discussed above.
Spatial domain decomposition is widely used in parallelizing scientific models such as CMAQ. The key characteristic of this new technique is data aggregation which can be considered as mitigation for the second shortcoming described above. Generally speaking, data can be either aggregated along the row dimension or column dimension of a rectangular horizontal grid to enhance the I/O efficiency. During aggregation, a small number of localized MPI communication processes were introduced, which does not diminish the overall benefit of the technique.
In order to determine the performance of this new technique, a small-scale code was devised. This smaller code, which is designed to mimic the CMAQ model, contains a time step loop with artificial workload. Data is output at the end of each time step. This small-scale code was tested with three time steps and was run on two different machines. The following two subsections provide brief information about the machines as well as how the test was setup.
The experiments were performed on two HPCs to examine the CMAQ I/O
performance with various methods. (1) Edison: a Cray XC30 system with 236 Tflop s
The file system on both HPCs is managed by Lustre, a massive parallel-distributed file system that has the ability to distribute the segments of a single file across multiple object storage targets (OSTs). A striping technique is applied when a file with a linear sequence of bytes is separated into small chunks. Through this technique, the bandwidth of accessing the file and the available disk space for storing the file both increase as read and write operations can access multiple OSTs concurrently. The default value of stripe count is 1 OST of stripe count and 1 MB of stripe size on both Kraken and Edison.
To examine the I/O performance of each module, a small-scale model (pseudo-code I/O module) written in Fortran90, which includes the basic functions, reading data, writing data and performing arithmetic in between read and write operations, was tested to imitate the complex CMAQ model with the emphasis on the I/O behavior. The code cycles three times to represent three time steps as in regular CMAQ simulations. The pseudo code of this small-scale model (is available on request) looks like this:
Three domain sizes were chosen to represent the typical 12 km resolution
settings in the CMAQ community: a small domain that covers the entire State
of California and its vicinity (CA),
The results provided by the small-scale model serve as a basis to determine
the optimal striping information (count and size) for further experiments
with the pre-released CMAQ version 5.0.2. The 1-day simulations of CMAQ at a
4 km resolution EUS domain (
Regional representation of the CA (blue box), EUS (red box) and CONUS domains.
In Figs. 3–8 and 10, a relative performance calculation shown in formula (1) is plotted against the stripe counts and sizes:
The CA case, which represents a relatively small domain, shows a general
trend where performance degrades as the stripe count increases and/or the
stripe size increases. For this case, pnetCDF performance can be worse than
the serial approach using regular netCDF (Fig. 4). With the data aggregation
technique, aggregation along the row dimension is better. Overall, for data
aggregation along row dimension, pnetCDFcr outperforms pnetCDF. Setting the
stripe count to 5 and stripe size to 1 MB seems to be the “optimal”
settings on both machines and among all processor configurations.
Furthermore, as the number of processors increase, the relative performance
of pnetCDF drops (from
Relative I/O performance of pnetCDF to rnetCDF, and pnetCDFcc and
pnetCDFcr to pnetCDF on the CA domain with
Relative I/O performance of pnetCDF to rnetCDF, and pnetCDFcc and
pnetCDFcr to pnetCDF on the CA domain with
Relative I/O performance of pnetCDF to rnetCDF, and pnetCDFcc and
pnetCDFcr to pnetCDF on the EUS domain with
Relative I/O performance of pnetCDF to rnetCDF, and pnetCDFcc and
pnetCDFcr to pnetCDF on the EUS domain with
Relative I/O performance of pnetCDF to rnetCDF, and pnetCDFcc and
pnetCDFcr to pnetCDF on the CONUS domain with
Relative I/O performance of pnetCDF to rnetCDF, and pnetCDFcc and
pnetCDFcr to pnetCDF on the CONUS domain with
The impact of stripe count and size on parallel netCDF I/O performance on the CONUS domain. Left: various stripe sizes with fixed 11-stripe counts. Right: various stripe counts with fixed 2 MB stripe size.
Relative performance of pnetCDF and pnetCDFcr with respect to rnetCDF in a large number of processors scenario on Kraken. Red color denotes positive value in relative performance while blue color denotes negative value in relative performance.
The EUS case, which represents a moderately large domain, shows similar
result as in the CA case. Relative performance of aggregation along the row
dimension is much better than along the column dimension (Figs. 5, 6). With
a small number of processor scenarios,
The CONUS case represents a relatively large domain, showing similar results
for the CA and EUS cases (Figs. 7, 8). When the number of processors
increases, the relative performance of pnetCDF decreases (from
The maximum data size (left panel) and I/O rate (right panel) among all I/O processors in the CA (top), EUS (middle) and CONUS domains (bottom), respectively, in the pseudo-code experiment running on Edison.
Stripe size and stripe count are two of the key factors that affect I/O performance as shown in Figs. 3–8. The CONUS domain is chosen with various stripe counts (2, 5, and 11) and stripe sizes (1, 2, 4, 8, and 16 MB) here to summarize these effects (Fig. 9). Among all stripe counts, the cases using stripe counts of 11 demonstrate the best performance compared to other stripe counts; for stripe sizes, the 2 MB cases were better than the other stripe sizes. As more processors were applied, larger stripe sizes resulted in decreasing performance in writing out data while 2 MB cases had relatively better results compared to the other four sizes. Shorter writing time was found when fewer processors were requested. The stripe count effect showed that stripe counts of 11 had the best performance compared to the other two stripe count cases. The differences became more significant when more processors were used.
Section 5.1 has shown pnetCDF performance decreases as the number of processors increases. When the number of processors continues to increase, the performance of pnetCDF reaches a point that is worse than the serial I/O scheme (Fig. 10). In contrast, the pnetCDFcr scheme continues to improve significantly as the number of processors increases.
Total write time in a 1-day CMAQ simulation by different I/O approaches on a 4 km EUS domain with stripe size of 2 MB and stripe count of 11 (Kraken left and Edison right).
The I/O efficiency is defined as the rate of data being output. In parallel applications with a spatial domain decomposition strategy, the domain size in each processor becomes smaller as the number of processors increase (Fig. 11 left panel). It is known that the I/O rate is higher if a large chunk of data is being output. Figure 11 (right panel) reflects this assertion which was tested on Kraken. When the data is aggregated, no matter whether it is along row or column dimension, it will increase the data size in the processor which is responsible for the I/O. This is clearly shown in Fig. 11 left panel. With data aggregation (pnetCDFcc or pnetCDFcr), the data size decreases slower than the pnetCDF approach as the number of processors increases. This translates into a higher I/O rate in aggregated schemes than the pnetCDF approach with respect to the same number of processors. pnetCDFcc is worse than pnetCDFcr due to the internal data alignment in the netCDF internal format (row major).
Based on this small-scale code experiment, the setting of 11-stripe count
and 2 MB stripe size is selected to employ in a real CMAQ application:
a 1-day simulation on a 4 km resolution EUS domain (
We performed a series of experiments with four different I/O modules to examine their I/O efficiencies in CMAQ. First, a small-scale code was tested on three different domains: CA, EUS and CONUS, which represent small, moderately large, and large data sizes of CMAQ outputs. The I/O modules include serial mode which is currently used in CMAQ, direct application of parallel netCDF (pnetCDF), and a new technique based on data aggregation which can be along row or column dimension (pnetCDFcr and pnetCDFcc) before applying the parallel netCDF technique. The experiment results show: (1) pnetCDFcr performs better than pnetCDFcc; (2) pnetCDF performance deteriorates as the number of processors increases and becomes worse than serial mode when certain large numbers of processors are used; and (3) even though pnetCDFcr does not perform as well as pnetCDF in the small number of processors scenarios, it does outperform pnetCDF once the number of processors becomes larger. In addition, an overall “optimal” setting has been shown based on the experiments: 5-stripe count and 1 MB stripe size for small domain, 11-stripe count and 2 MB stripe size or 5-stripe count and 2 MB stripe size for the moderately large domain, and 11-stripe count and 2 MB stripe size for the large domain.
This data aggregation I/O module was also tested for a 1-day, 4 by 4 km EUS domain using CMAQ compared to the serial I/O mode, which is currently implemented in CMAQ, and conventional parallel netCDF method. The results show significant reduction of I/O writing time when this new data aggregated pnetCDF (pnetCDFcr) technique is used compared with serial I/O approach and with application of straight pnetCDF. With this finding, the overall runtime of scientific applications which require I/O will be significantly reduced. A more important implication is that it allows users to use a large number of processors to run applications and still maintain a reasonable parallel speedup thereby deferring speedup degradation governed by Amdahl's law. Furthermore, the technique can be transferred to other environmental models that have large I/O burdens.
Yang Gao was partly supported by the Office of Science of the US Department of
Energy as part of the Regional and Global Climate Modeling Program. The
Pacific Northwest National Laboratory is operated for DOE by the Battelle
Memorial Institute (DE-AC05-76RL01830). The Kraken is a supercomputing
facility through National Science Foundation TeraGrid resources provided by
the National Institute for Computational Sciences (NICS) under grant numbers
TG-ATM110009 and UT-TENN0006.