A new ensemble-based consistency test for the Community Earth System Model (pyCECT v1.0)

,


Introduction
The Community Earth System Model (CESM) is a state-ofthe-art, fully coupled, global climate model whose development is centered at the National Center for Atmospheric Research (NCAR) (Hurrell et al., 2013).Earth's global climate is complex, and CESM is widely used by scientists around the world to further our understanding of the future, present and past states of the climate system.For large simulation models such as CESM, verification and validation are crit-ical to establishing and maintaining a model's credibility, particularly when the model is used to make decisions (e.g., Carson II, 2002).Note that differences in interpretation exist among scientific communities in regards to the terms verification and validation (e.g., Oberkamf and Roy, 2010), and the term "evaluation" has been advocated as a more appropriate term than "validation" in some literature (e.g., Orsekes et al., 1994;Orsekes, 1998).Generally, though, validation focuses on how well the model represents the real world phenomena that are being modeled, and verification involves determining whether the implementation of a model is correct and matches the intended description and assumptions for the model (see, e.g., Carson II, 2002;Sargent, 2011;Whitner and Balci, 1989;Oberkamf and Roy, 2010;Goosse et al., 2014).
Software verification necessarily requires the detection and reduction of errors or "quality assurance" (Oberkamf and Roy, 2010), and we focus on this component of verification for CESM.As with many scientific codes, development of CESM is ongoing: features are continually added; improvements are made; software and hardware environments change.The primary motivation for this work is to ensure that changes during the development life cycle of CESM do not adversely affect the simulation.In particular, changes during CESM development that result in simulation output that is no longer bit-for-bit (BFB) identical to previous output data require attention to ensure that the output still produces the same climate (i.e., an error has not been introduced).Note that CESM simulations are expected to produce BFB reproducible output on the same machine and processor counts when the CESM version and parameters are "identical".The approach to detecting potential errors in CESM has historically been a cumbersome process at best.For exam-

A. H. Baker et al.: A new ensemble-based consistency test for the Community Earth System Model
ple, porting the CESM code to a new machine architecture results in non-BFB model output, and the current approach is as follows.First, a climate simulation of several hundred years (typically 400) is run on the new machine.Next, data from the new simulation is analyzed and compared to data from the same simulation run on a "trusted" machine, and, lastly, all results are given to a senior climate scientist for approval.This informal process is not overly rigorous and relies largely on subjective evaluations.Further, running a simulation for hundreds of years is resource intensive, and this expense is exacerbated as the model grows larger and more complicated.Clearly a more rapid, objective, and accessible solution is needed, particularly because a port of CESM to a new machine is just one example of a non-BFB change that requires quality assurance testing.Other common situations that can lead to non-BFB results include experiments with new compiler versions or optimizations, code modifications that are not expected to be climate-changing, and many new exascale-computing technologies.The lack of a straightforward metric for accessing the quality of the simulation output has limited the ability of CESM users and developers to introduce potential code modifications and performance improvements that result in non-BFB reproducibility.The need for a more quantitative solution for ensuring code quality prompted our development of a new tool for assessing the impact of non-BFB changes in CESM.While verification always involves some degree of subjectivity and one cannot absolutely prove correctness (Carson II, 2002;Oberkamf and Roy, 2010), we aim to facilitate the detection of hardware, software, or human errors introduced into the simulation.
The quality assurance component of code verification implies that a degree of consistency must exist (Oberkamf and Roy, 2010).Our new method evaluates climate consistency in CESM via an ensemble-based approach that simplifies and formalizes the quality assurance piece of the current verification process.In particular, the goal of our new CESM ensemble consistency test tool, referred to as CESM-ECT, is to easily determine whether or not a change in a CESM simulation is statistically significant.The ability of this simple tool to quickly assess changes in simulation output is a significant step forward in the pursuit of more qualitative metrics for the climate modeling community.The tool has already proven invaluable in terms of providing more feedback to model developers and increasing confidence in new CESM releases.Note that we do not discuss verification of the underlying numerical model in this work, which is considered at other stages in the development of individual CESM components.Further, we do not address model validation, but mention that it is primarily conducted via hindcasts and comparisons to real world data, e.g., the Intergovernmental Panel on Climate Change Data Distribution Centre has a large collection of observed data (IPCC Data Collection Centre, 2015).
This paper is organized as follows.We give additional background information in Sect. 2. We describe the new CESM-ECT tool in Sect.3. In Sect.4, we provide results from experiments with CESM-ECT, and in Sect. 5 we give examples of the utility of the new tool in practice.Finally, we give concluding remarks and discuss future work in Sect.6.

Background
Climate science has a strong computational component, and the climate codes used in this discipline are typically complex and large in size (e.g., Easterbrook et al., 2011;Pipitone and Easterbrook, 2012), making the thorough evaluation of climate model software quite challenging (Clune and Rood, 2011).In particular, the CESM code base, which has been developed over the last 20 years, currently contains about one and a half million lines of code.CESM consists of multiple geophysical component models of the atmosphere, ocean, land, sea ice, land ice, and rivers.These components can all run on different grid resolutions, exchanging boundary data with each other through a central coupler.Because CESM supports a variety of spatial resolutions and timescales, simulations can be run on both state-of-the-art supercomputers as well as on an individual scientist's laptop.The myriad of model configurations available to the user contribute to the difficulty of exhaustive software testing (Clune and Rood, 2011;Pipitone and Easterbrook, 2012).A particularly fascinating and in-depth description of the challenges of scientific software in general, and climate modeling software in particular, can be found in Easterbrook and Johns (2009).Furthermore, the societal importance of better understanding Earth's climate is such that every effort must be made to verify climate codes as well as possible (e.g., Easterbrook et al., 2011).
In general, scientific codes are often in a near-constant state of development as new science capabilities are added and requirements change, and this is certainly true for CESM and other global climate models.However, despite the complexity of climate software, both the constant enrichment of the code base and the manner in which it has evolved over time has resulted in an overall quality of software superior to that of other open-source projects (Pipitone and Easterbrook, 2012).Yet the pace of evolution of the code requires that issues of correctness, reproducibility and software quality are frequently being addressed.Coarse-grained testing is a common practice in climate modeling, and this global approach is useful for detecting the existence of errors in the software or input stack or the software and hardware environment (Clune and Rood, 2011).This approach does not offer information as to the source of the error but rather as to whether or not one may exist.The goal of coarse-grained testing is not to prove correctness but to point out potential incorrectness.Fine-grained testing is needed to identify the source of errors and typically occurs within the individual CESM component models.Our focus in this work is on a coarse-grained approach to software quality assurance and, for climate models, this global approach typically takes the form of analysis of simulation output (Easterbrook and Johns, 2009).Visualizations of model output are commonly examined by climate scientists, and achieving BFB identical results has been quite important to the climate community (Easterbrook and Johns, 2009;Pipitone and Easterbrook, 2012).If changes in the source code or software and hardware environment yield BFB results to the previous version, then this verification step is trivial.However, depending on the nature of the change, achieving BFB results from one run to the next is not always possible.For example, in the context of porting the code to a new machine architecture, machine-rounding level changes can propagate rapidly in a climate model (Rosinski and Williamson, 1997).In fact, changes in hardware, software stack, compiler version, and CESM source code can all cause round-off level or larger changes in the model simulation results, and the emergence of some heterogeneous computing technologies inhibit BFB reproducibility as well.
Some of the difficulties caused by differences due to truncation and rounding in climate codes that result in non-BFB simulation data are discussed in Clune and Rood (2011).In particular, the authors cite the need for determining acceptable error tolerances and the concern that seemingly minor software changes can result in a different climate if the simulation is not run for a sufficient amount of time.The work in Rosinski and Williamson (1997) is also of interest and aims to determine the validity of a simulation when migrating to a new architecture.They minimize the computational expense of a long run by setting tolerances for rounding accumulation growth based on the growth of a small perturbation in the atmospheric temperature after several days.However, this test is no longer applicable to the atmospheric component of CESM, called the Community Atmosphere Model (CAM), because the parameterizations in CAM5 are ill-conditioned in the sense that small perturbations in the input produce large perturbations in the output.The result is that the tolerances for rounding accumulation growth are exceeded within the first few time steps.Our work builds on this idea of gauging the effects of a small temperature perturbation on the simulation, though improvements in software and hardware allow us to extend the simulation duration well beyond several days.Further, by looking only at climate signals, we relax the restriction on how the parameterizations respond.

A new method for evaluating consistency
In this section, we present and discuss a new ensemble consistency test for CESM, called CESM-ECT.We first give a broad overview, followed by more details in the subsequent subsections.As noted, CESM's evolving code base and the demand to run on new machine architectures often result in data that are not BFB identical to previous data.Therefore, our new tool for CESM must determine whether or not the new configuration (e.g., code generated with a different com-piler option, on a new architecture, or after a non-climate changing code modification) should be accepted.For our purposes, we accept the new configuration if its output data is statistically indistinguishable from the original data, where the original data refers to data generated on a trusted machine with an accepted version of the software stack.Our tool must -determine whether or not data from a new configuration is consistent with the original data -indicate the level of confidence in its determination (e.g., false positive rate) -be user-friendly in terms of ease of use and minimal computational requirements for the end-user.
Note that this new tool takes a coarse-grained approach to detecting statistical differences.Its purpose is not to isolate the source of an inconsistency but rather to indicate the likelihood that one exists.To this end, the CESM-ECT tool works as follows.The first step requires the creation of an ensemble of simulations in an accepted environment representing the original data.The second step uses the ensemble data to determine the statistical distributions that describe the original data.Next, several simulations representing the new data are obtained.And, finally, a determination is made as to whether the new data are statistically similar to the original ensemble data.

Preliminaries
CESM data are written to "history" files in time slices in NetCDF format for post-processing analysis.Data in history files are of single precision (by default).For this initial work, we focus on history data from the CAM component in CESM, which is actively developed at NCAR.We chose to begin with CAM because the timescales for changes propagating through the atmosphere are relatively short compared to the longer timescales of other components, such as the ocean, ice, or land models.Further, the set of CAM global output variables is diverse, and the default number for our CESM configuration (detailed in the next section) is on the order of 130.An error in CAM would certainly affect the other model components in fully coupled CESM situations; however, we cannot assume that CAM data passing CESM-ECT implies that the remaining components would also pass.Data from other components (e.g., ocean, ice, and land) will be addressed in future work, though we give an example in Sect. 5 of detecting errors stemming from the ice component with CESM-ECT.

An ensemble method
The development of a tool like CESM-ECT necessitates the determination of error tolerances that can be used to evaluate whether differences in climate data are significant.Requiring that the difference be less than the natural variability of the climate system makes sense intuitively and is along the lines of Condition 2 in Rosinski and Williamson (1997).However, characterizing the natural variability is difficult with a single run of the original simulation.Therefore, we extend the sampling of the original data to an ensemble from which we can obtain a statistical distribution.An ensemble refers to a collection of multiple realizations of the same model simulation, generated to represent possible states of the system (e.g., Dai et al., 2001).Generally, small perturbations in the initial conditions are used to generate the ensemble members, and the idea is to characterize the climate system with a representative distribution (as opposed to a single run).Ensembles are commonly used in climate modeling and weather forecasting (see, e.g., Dai et al., 2001;Zhu and Toth, 2008;von Storch and Zwiers, 2013;Zhu, 2005;Sansom et al., 2013) to enhance model confidence, indicate uncertainly, and improve predictions.For example, the ensemble in Kay et al. (2015) was created by small perturbations to the initial temperature condition in CAM and is being used to study internal climate variability.
We generate our ensemble for CESM-ECT by running simulations that differ only in a random perturbation of the initial atmospheric temperature field of O(10 −14 ).These perturbations grow to the size of NWP (numerical weather prediction) analysis errors in a few hours.Each simulation is 1 year in length, which is short enough to be computationally reasonable, yet of sufficient length to allow the effects of the perturbation to propagate through the system.A perturbation of this size should not be climate-changing and, while 1 year is inadequate to establish a climate, it is sufficient for generating the statistical distribution that we need.In particular, while the trajectories of the ensemble members will rapidly diverge due to the chaotic nonlinearity of the model, the statistical properties of the ensemble members are expected to be the same.Determining the appropriate number of ensemble members requires a balance between computational and storage costs and the quality of the distribution.The lower bound on the size is constrained by our use of principal component analysis (PCA), which is described in the next subsection.PCA requires that the number of ensemble members be larger than the number of CAM variables.We chose an initial ensemble size, denoted by N ens , of 151 for CESM-ECT.At this size, the coefficient of variation for each CAM variable is well under 5 %, save for two variables that are known to have large distributions across the ensemble (meridional surface stress and meridional flux of zonal momentum).The cost to generate the ensemble is reasonable because all N ens , members can be run in parallel, resulting in a much faster turn around time than for a single multi-century run (a single 1-year simulation can run in a couple hours on less than a thousand cores).Note that, as explained further in Sect.3.5, an ensemble is only generated for the control and not for the code to be tested.Hence, the ensemble creation does not impact the CESM-ECT user.
In summary, the CESM-ECT ensemble consists of N ens = 151 1-year climate simulations, denoted by E = {E 1 , E 2 , . .., E N ens } and is produced on a trusted machine with an accepted version, model, and configuration of the climate code.The data for these 1-year ensemble runs consists of annual temporal averages at each grid point for the selected grid resolution for all N var variables, which are either two-or three-dimensional.Retaining only the annual temporal averages for each variable helps to reduce the cost of storing the ensemble simulation output and has proved sufficient for our purposes.We denote the data set for a variable X as X = {x 1 , x 2 , . .., x N X }, where x i is a scalar that represents the annual (temporal) average at grid point i and N X is the total number of grid points in X (determined by whether X is a 2-D or 3-D variable).

Characterizing the ensemble data
The next stage in our process is the creation of the statistical distributions that describe the ensemble data.In particular, information collected from the ensemble simulations helps to characterize the internal variability of the climate model system.Results from new simulations (resulting from a non-BFB change) can then be compared to the ensemble distribution to determine consistency.
First, based on the ensemble simulation output, CESM-ECT calculates the global area-weighted mean distributions, providing climate scientists with an indication of the average state and variability across the control ensemble for each variable.However, determining whether or not the climate in the new run is consistent with the ensemble data based on the number of variables that fall within the global mean distribution (or other specified tolerance) is difficult without a linearly independent set of variables.For the CESM 1.3.xseries, 134 variables are output by default for CAM.We exclude several redundant variables as well as those with zero variance across the ensemble (e.g., specified variables common to all ensemble runs) from our analysis, resulting in N var = 120 variables total (see Appendix A for more detail).A correlation analysis shows that many of these variables are highly correlated (> 0.9).In fact, 52 variables are highly correlated in the global mean.Determining objective and statistically motivated criteria (such as false positive rates) necessitated a transformation of our variable-based data to a linearly independent data space.We use PCA, a popular tool in data analysis, to determine the orthogonal transform needed to convert the ensemble variable values into a set of principal component scores.The principal components are orthogonal and indicate the directions in which there is the most variance, i.e., in which the data is the most "spread out", thereby exposing underlying structure in the data that might otherwise be overlooked (e.g., Shlens, 2014).A second wellknown advantage of PCA is that most of the variance in the system ends up being represented by many fewer components than the original number of variables, which simplifies analysis, particularly when there are large number of variables.
CESM-ECT applies PCA-based testing to the global mean data, and the implementation of the PCA-based testing strategy into our tool entails the following steps.First, for each ensemble member m, the global area-weighted mean is calculated for each variable X across all grid points i and denoted by X m .Next, we standardize the N var × N ens matrix containing the global means for each variable in each ensemble member and denote the result by V gm .Note that N var = 120 and N var < N ens .Standardization of the data involves subtracting the ensemble mean and dividing by the ensemble standard deviation for each variable and is important because the CAM variables have vastly different units and magnitudes.Next, we calculate the transformation matrix, or "loadings", that project the variable space V gm into principal component (PC) space.Loading matrix P gm has the size N var × N var and corresponds to the eigenvector decomposition of the covariance of V gm , ordered such that the first PC corresponds to the largest eigenvalue and decreasing from there.Finally, we apply the transformation to V gm to obtain the PC scores, S gm , for our ensemble: Now instead of using a distribution of variable global means to represent our ensemble, the N var × N ens matrix S gm forms a distribution of PC scores that represents the variance structure in the data.These scores have a mean of zero, so we only need to calculate the standard deviation of the ensemble scores in S gm , which we denote by σ S gm .To summarize, this first stage computes the following data: -

Determining a pass or fail
The last step in the CESM-ECT procedure evaluates whether the new output data that has resulted from the non-BFB change is statistically distinguishable from the original ensemble data, as represented by the ensemble summary file.For simplicity of discussion, assume that we want to evaluate whether the results obtained on a new machine, Yosemite, are consistent (i.e., not statistically distinguishable) with those on Yellowstone.To do this, we collect data from a small number (N new ) of randomly selected ensemble runs on Yosemite.Variables in the new data sets are denoted by X, where X = { x 1 , x 2 , . .., x N X }.The CESM-ECT tool then decides whether or not the output data from simulations on Yosemite are consistent with the ensemble data and issues an overall pass or fail result.
CESM-ECT determines an overall pass or fail in the following manner.First, the weighted area global means for each variable X in all N new runs are calculated, X k (k = 1 : N new ).These new variable means are then standardized using the mean and standard deviations of the control ensemble given in the summary file (µ V gm and σ V gm ).Second, the standardized means are converted to scores via the loading matrix P gm from the summary file.Next, we determine whether the first N PC scores of the new runs are within m σ standard deviations of the mean, using the standard deviation of the zero-mean scores for the ensemble in the summary file (σ S gm ).Then, for each of the N new Yosemite simulations, the PC scores that fall outside the m σ confidence interval are tagged as a "fail" for that particular run.Finally, CESM-ECT decides whether the simulations on Yosemite are consistent with those on Yellowstone by counting the number of times that each PC failed at least N runFails runs, where N runFails ≤ N new .If at least N pcFails PCs fail at least N runFails runs, then CESM-ECT returns an overall "failure".
In typical applications, PC scores with small contributions to the total variability are neglected, and one only examines the first N PC components in an analysis.However, in the context of detecting errors in the hardware or software system, the PCs that are responsible for the most variability are not necessarily the most relevant.Recalling that each PC is a linear combination of all of the variables, we use a value for N PC that both contains sufficient information to detect errors in any of the variables and allows for a low false positive rate.Our extensive testing indicates that N PC = 50 is sufficient to detect errors for our particular setup.
The parameters m σ , N new , N pcFails , and N runFails are also chosen to obtain a desired false positive rate.We performed an empirical simulation study and tested a variety of combinations of parameters.We found that choosing m σ = 2 (which corresponds to the 95 % confidence level), N new = 3, N pcFails = 3, and N runFails = 2 yields our desired false positive rate of 0.5 %.To summarize, we run three simulations on Yosemite, and if at least three of the same PCs fail for at least two of these runs, then CESM-ECT issues a "failure".We intentionally err on the conservative side by choosing a low false positive rate, hedging against the possibility that our ensemble may not be capturing all the variability that we want to accept.Also note that while perturbing the initial temperature condition is a common method of ensemble creation for studying climate variability, other possibilities exist, and we are currently conducting further research on the initial ensemble composition and its representation of the range of variability, particularly in regard to compilers and machine modifications.

CESM-ECT software tools
Finally, we further discuss the software tools needed to test for ensemble consistency that are included in the CESM public releases (see Sect. 6 for details).Generating the ensemble simulation data by setting up and running the N ens = 151 1-year simulations is the most compute-intensive step in this ensemble consistency-testing process.The CESM Software Engineering group generates ensembles as needed.For example, generating new ensemble simulation data is now routine when a CESM software tag is created that contains a scientific change known to alter the climate from the previous tag.(The frequency of such tag creation varies, but is several times a year on average.)While the utility used to generate the ensemble runs is included in CESM releases, the typical end-user does not need to generate their own ensembles.Note that our consistency-testing methodology can be extended to other simulation models and, in that case, an application-specific tool to facilitate the generation of N ens simulations would be needed for the new application.
Whenever a new ensemble of simulations is generated, a summary file (as described in Sect.3.3) must be created for the ensemble.The ensemble summary utility (pyEnsSum), written in parallel Python, creates a NetCDF summary for any specified number (N ens ) of output files.This step requires far less time than it takes to run the simulations themselves.As an example, generating the summary file for 151 ensemble members on 42 cores of Yellowstone takes about 20 min (we chose the number of cores to be equal to the number of 3-D variables).Note that the summary creation takes less than a minute when we only compute the information needed for the PCA test (i.e., exclude optional calculations of quantities such as the root-mean-squared Z scores).Each CESM software tag now includes the corresponding ensemble summary file.Including the summary file in the CESM releases facilitates tracking data changes in the software life cycle and enables CESM users to run CESM-ECT without creating an ensemble of simulations themselves.Note that the storage cost for a single summary file is minor compared to the cost of storing the simulation output for the entire ensemble.
In addition to an ensemble summary file, our Python tool CESM-ECT (pyCECT) requires N new = 3 1-year simulations from the configuration that is to be tested.For a CESM developer or advanced user, this may mean using a development version of code with a modification that needs to be tested.For a basic CESM-user, this may mean verifying that the user's installation of CESM on their personal machine is acceptable.In either case, a simple shell script that creates 1-year CESM run cases (with random initial perturbations) for this purpose is also included in CESM releases, though advanced users can certainly generate more custom simulations if desired.Regardless, after the N new simulations have been completed, pyCECT determines whether results from the new configuration are consistent with the original ensemble data based on the supplied new CAM output files and specified ensemble summary file.Then pyCECT reports whether of not the new configuration has passed or failed the consistency test, as well as which PCs in particular have passed or failed each of the N new simulations contributing to the overall pass/fail rating.In addition, the user may assign values to the pyCECT parameters m σ , N new , N pcFails , N runFails , and N PC via input parameters if the defaults are not desired.
For clarity, Fig. 2 illustrates the workflow for the CESM-ECT process.The two Python tools are indicated by green circles.The dashed blue box delineates the work done prerelease by the CESM-software engineers.If a CESM user wants to evaluate a new configuration, the user simply executes the steps in the dashed red box.

Experimental studies
As noted in Sect. 1, a verification process necessarily includes some degree of subjectivity.The decision to designate our initial ensemble distribution as "accepted" is critical to our methodology and yet, despite ongoing research, we can- not (ever) be absolutely sure that this distribution is "correct" in terms of capturing all signatures that lead to the same climate.Our confidence in this initial ensemble distribution is due, in part, to the vast experience and intuition of the CESM climate scientists.However, we gain further confidence with a series of tests of trusted scenarios (i.e., scenarios that we expect to produce the same climate) and verify that those scenarios pass the CESM-ECT.Similarly, we sample scenarios that we expect to be climate-changing and should, therefore, fail.

Preliminaries
We obtained the results in this work from the 1.3 release series of CESM, using a present-day F compset (active atmosphere and land, data ocean, and prescribed ice concentration) and CAM5 physics.We examine 120 (out of a possible 134) variables from the CAM history files, as redundant variables and those with no variance are excluded.Of the 120 variables, 78 are two-dimensional and 42 are three-dimensional variables.This spectral-element version of CAM uses a ne= 30 resolution ("ne" refers to the number of elements on the edge of the cube), which corresponds approximately to a 1 • global grid containing a total of 48 602 horizontal grid points and 30 vertical levels.Unless otherwise noted, simulations were run with 900 MPI (Message Passing Interface) tasks and two OpenMP (Open Multi-Processing) threads per task on the Yellowstone machine at NCAR.The default compiler on Yellowstone for our CESM version is Intel 13.1.2with −O2 optimization.

Non-climate changing modifications
First we look at modifications that lead to non-BFB results but are not expected to be climate-changing.Such modifications include equivalent code formulations that result in the reordering in floating-point arithmetic operations, thus affecting the rounding error.Two common CESM configurations that induce reordering in arithmetic operations include removing thread-level parallelism from the model and certain compiler changes.We expect that the following tests on Yellowstone will not be climate-changing and, thus, will be consistent with our initial ensemble distribution.
-INTEL-15: changing the Intel compiler version to 15.0.0.
These five scenarios differ from the control run used to generate the ensemble only in the single aspect listed above.We first generate N new = 3 simulations on Yellowstone corresponding to each test scenario, where each simulation is given a perturbation selected at random from the perturbations used to create the initial ensemble.Table 2 lists the pass/fail result from pyCECT and indicates that none of these modifications caused a failure.Recall that our criteria for failure in pyCECT is that at least three PCs must fail at least two of the runs.Table 2 shows that at most two PCs failed two runs for these particular test scenarios.

CAM climate-changing parameter modifications
CESM-ECT also must successfully detect changes to the simulation results that are known to be climate-changing and return a failure.To this end, climate scientists provided a list of CAM input parameters thought to affect the climate in a non-trivial manner.Parameter values were modified to be those intended for use with different CAM configurations (e.g., high-resolution and finite volume).We ran the following test scenarios which were identical to the default ensemble case with the exception of the noted CAM parameter change (the name of the CAM parameter is indicated in italics and its original default value in parenthesis).
From Table 1, most of these tests fail by a lot more than three PCs, indicating that the new simulation data is quite different from the original ensemble data.However, contrary to our initial expectations, one scenario was found to be consistent and passed.Upon further investigation, the change caused by NU likely did affect some aspects of the climate in a way that would not be detected by the test.The issue is that modifications to NU cause changes at the small scales (but not to the mean of the field the diffusion is applied to) and generally affect the extremes of climate variables (such as precipitation).
Because CESM-ECT looks at variable annual global means, the "pass" result is not entirely surprising as errors in smallscale behavior are unlikely to be detected in a yearly global mean.Developing the capability to detect the influence of small-scale events is a subject for future work.

Modifications with unknown outcome
Now we present results for simulations in which we had less confidence in the expected outcome.These include running our default CESM simulation on other CESM-supported machines as well as changing to a higher level of optimization on Yellowstone (−O3).We expected that the tests on other machines supported by CESM would pass, and, for each machine, we list the machine name and location below (and give the processor and compiler type in parentheses).The effect of −O3 compiler options was not known as the CESM code base is large and level-three optimizations can be quite aggressive.The following simulations were performed.
Note that we use the CESM-specified default compiler option for each CESM-supported machine.Table 3 indicates that most of the CESM-supported machine configurations pass (the nine test scenarios above the horizontal line), and the few that fail are all near the pass/fail threshold.In other words, these machine failures are in contrast to the more egregious failures obtained by changing CAM parameters as in Table 1.
However, ideally all CESM-supported machines would pass our test (assuming the absence of error in their hardware and software environments), so a better understanding of the variability introduced by the environments of other machines (i.e., not Yellowstone) is needed.Therefore, as a first step, we ran additional tests on Mira and Bluewaters with the goal of better understanding (and substantiating) the failures in Table 3.For each machine, we ran seven more sets of three randomly perturbed simulations.Thus, we have a total of eight experiments each for Mira and Bluewaters, counting the original in Table 3.Furthermore, we created three additional ensembles of 151 simulations based on the PGI, GNU, and NO-OPT scenarios listed in Sect.4.2 and created a summary file for each.Thus, we can test the eight new cases for consistency on both machines against a total of four ensembles to better understand the effect of the compiler on the consistency assessment.Results from these experiments are shown for Bluewaters and Mira in Figs. 3 and 4, respectively.Note that the Intel ensemble is the default "accepted" ensemble that we have used thus far in our experiments and the No-Opt option is also the Intel compiler (with −O0).
The results in Figs. 3 and 4 indicate that the compiler choice for the control ensemble on Yellowstone results in differences in the numbers of PC scores that fail each individual test case.However, the overall outcome from all four control ensembles is similar in that the test results are split in terms of passes and fails, indicating that these are in fact borderline cases for CESM-ECT with the current failure criteria, which requires at least three PCs to fail at least two runs.Test scenarios that very nearly pass or fail, such as these for Bluewaters and Mira underscore the difficulty in distinguishing a bug in the hardware or software from the natural variability present in the climate system.Certainly we do not expect to perfect CESM-ECT to the point where a pass or fail is a definitive indication of the absence or presence of a problem, though we have obtained a large amount of data to date that we will explore in detail to better characterize the effects of compiler and architecture differences on the variability.We expect to report on our further analysis in future work.Finally, another difficulty for our tool is that while PCA will indicate the existence of different signatures of variability between new simulations and the ensemble, the differences detected may not necessarily be important in terms of the produced climate and the decision on whether to accept or reject Test name that climate (e.g., because the definition of climate requires more than 1 year and involves spatial distributions).The last three experiments listed above and in Table 3 involve either modifying the optimization to a more aggressive level (INTEL13-O3) or additionally upgrading the compiler version (INTEL14-O3 and INTEL15-O3).Our results for INTEL15-O3 suggest that there is an issue with that version of the compiler.Note that because of the size of the CESM code base, pinpointing a problem with a specific compiler version is time intensive, and we find it more productive not to use that compiler.

CESM-ECT in practice
CESM-ECT has already been successfully integrated into the CESM software engineering workflow.In particular, the creation of a new beta release tag in the CESM development trunk (that is not BFB with the previous tag) requires that CESM-ECT be run for the new tag on all CESM-supported platforms (e.g., the machines listed in Sect.4.4 and the supported compilers on those platforms (e.g, Intel, GNU and PGI, all with −O2, on Yellowstone).Results from these tests are kept in the CESM testing database.Failure on one or more of the test platforms signals that an error may exist in the new tag or on a particular machine, spawning an investigation and delay of the beta tag release.CESM-ECT has proven its utility on numerous occasions, and we now provide several specific examples of the success of this consistency-testing methodology in practice.The first example concerns an early success for our ensemble-based testing methodology.The consistency test for a CESM.1.2series beta tag test on the Mira machine failed decisively, while the consistency tests on all other platforms passed.The CESM-ECT failure prompted an extensive investigation of the Mira simulation data which resulted in the discovery that the CAM energy balance was incorrect.Eventually an error was discovered in the stochastic cloud generator code that only manifested itself on big-endian systems (Mira was the only big-endian machine in the group of CESM-supported machines).Because this particular success occurred early in the research and development stages of CESM-ECT (when we were initially looking at root-mean-squared Z scores), it provided the impetus to move forward and further refine our ensemble-based consistency-testing strategy.
A second, more recent success for CESM-ECT was the detection of errors in a new version of the Community Ice Code (CICE).In particular, CICE5 replaced CICE4 in the CESM.1.3series development trunk, and this upgrade was purported to not change the climate.However, when the soft-ware tag with CICE5 was tested with CESM-ECT, failures occurred on all of the CESM-supported platforms.Recall that CESM-ECT uses an F compset (e.g., Sect.4.1), which means that CICE runs in prescribed mode.Prescribed mode is intended for atmospheric experiments and uses the thermodynamics in the sea ice model (the dynamics are deactivated) with a pre-specified ice distribution.The CESM-ECT failures for the new development tag raised a red flag that resulted in the detection and correction of a number of errors and necessary tuning parameter changes in the CICE5 prescribed mode.Pre-integration component-level testing for stand-alone CICE, however, allowed errors to go undetected in prescribed mode until run with CESM-ECT.Table 4 lists the results of CESM-ECT for three test scenarios on Yellowstone (Intel, GNU, and PGI compilers) with CICE5 and CICE4, showing that the difference was quite significant.
Finally, CESM-ECT has been essential in the evaluation of lossy compression schemes for CESM climate data.Lossy compression schemes result in data loss when the compressed data are reconstructed (i.e., uncompressed).Evaluating the impact of the loss in precision and/or accuracy in the reconstructed data is critical to the adoption of lossy compression methods in the climate modeling community.In particular, we advocate for compression levels that result in reconstructed data that is not statistically distinguishable from the original data.The CESM ensemble-consistency methodology has been invaluable in making this determination (e.g., Baker et al., 2014).

Conclusions and future work
Software quality assurance is critical for building (and retaining) confidence in widely used scientific codes such as the Community Earth System Model.The size of the code, diversity of both the user and developer base, societal impact, and near-constant state of development for CESM require a verification technique that is easy to use and has minimal computational requirements.Further, the increasing difficulty in achieving BFB identical results due to differences across hardware and software environments dictates that a verification tool determines acceptable error tolerances.This manuscript presents an ensemble-based consistency test that evaluates whether a new CESM configuration (e.g., resulting from a code modification, compiler change, or new hardware platform) is consistent with the original "accepted" (or control) configuration.The original configuration is represented by an ensemble that captures the natural variability in the modeled climate system.CESM-ECT has already been effectively incorporated into the CESM software development workflow.Our many experiments and its successes in practice have increased our confidence in this methodology for detecting and reducing errors in CESM.Furthermore, the utility of CESM-ECT in a number of scenarios has become apparent: -port-verification (new CESM-supported machines); -quality assurance for software release tags; -exploration of new algorithms, solvers, compiler options; -feedback for model developers; -detection of errors in the software or hardware environment; and -assessment of the effects of lossy data compression.
Despite our successes with this new consistency-testing methodology, the natural variability present in the climate system makes the detection of subtle errors in CESM challenging.While no verification tool can be absolutely correct, we consider CESM-ECT in its current form to be preliminary work, as many avenues remain to be explored.We are currently conducting a more detailed analysis of large ensembles from different compilers and machines in an attempt to better characterize the effects of those types of perturbations.
We have also begun to evaluate spatial patterns in addition to global (spatial) means, as these patterns may be revealing in such contexts as boundaries between ocean and land, and less chaotic systems like the coarse-resolution ocean.In addition, we are interested in other important climate statistics like extremes.Finally, we intend to evaluate relationships between variables in cross-covariance studies.
var × N ens global means -N var means of ensemble global mean values (µ V gm ) -N var standard deviations of ensemble global mean values (σ V gm ) -N var × N var loadings (P gm ) -N var standard deviations of ensemble global mean scores (σ S gm ), which are written to the CESM-ECT ensemble summary file.This summary file (in NetCDF format) is generated for each CESM software tag on the Yellowstone machine at NCAR with the default compiler options (more details follow in Sect.3.5).The distribution of global mean scores from the ensemble, represented by the standard deviations in σ S g m , can be used to evaluate data from a new simulation.Note that most of the variance in the climate data is now largely represented by a few PCs.In fact, the coefficients on the first PC explain about 21 % of the variance and the coefficients on the second explain about 17 % of the variance, as shown in Figure1.

Figure 1 .
Figure 1.Percentage of variability explained for global mean by component scores.

Figure 3 .Figure 4 .
Figure 3.Additional CESM-ECT results on Bluewaters, comparing against four different ensemble distributions.Bars extending above the dashed line indicate an overall failure.

Table 1 .
CESM modifications expected to change the climate.

Table 2 .
CESM modifications expected to produce the same climate.

Table 3 .
CESM modifications with unknown outcomes.

Table 4 .
CESM development tag with two versions of the CICE component run with different compilers on Yellowstone.