Best practice regarding the three P’s: proﬁling, portability and provenance when running HPC geoscientiﬁc applications

. Geoscientiﬁc modeling is constantly evolving, with next generation geoscientiﬁc models and applications placing high demands on high performance computing (HPC) resources. These demands are being met by new developments in HPC architectures, software libraries, and infrastructures. New HPC developments require new programming paradigms leading to substantial investment in model porting, tuning, and refactoring of complicated legacy code in order to use these resources effectively. In addition to the challenge of new massively parallel HPC systems, reproducibility of simulation and analysis 5 results is of great concern, as the next generation geoscientiﬁc models are based on complex model implementations and proﬁling, modeling and data processing workﬂows. Thus, in order to reduce both the duration and the cost of code migration, aid in the development of new models or model components, while ensuring reproducibility and sustainability over the complete data life cycle, a streamlined approach to proﬁling, porting, and provenance tracking is necessary. We propose a run control framework (RCF) integrated with a workﬂow 10 engine which encompasses all stages of the modeling chain: 1. preprocess input, 2. compilation of code (including code instrumentation with performance analysis tools), 3. simulation run, 4. postprocess and analysis, to address these issues. Within this RCF, the workﬂow engine is used to create and manage benchmark or simulation parameter combinations and performs the documentation and data organization for reproducibility. This approach automates the process of porting and tuning, proﬁling, testing, and running a geoscientiﬁc model. We show that in using our run control framework, testing, benchmarking, proﬁling, 15 and running models is less time consuming and more robust, resulting in more efﬁcient use of HPC resources, more strategic code development, and enhanced data integrity and reproducibility.


Introduction
Geoscientific modeling is constantly evolving, leading to higher demands on HPC resources.We distinguish four main developments which increase HPC demands.(i) Higher spatial resolution, where the added value inherent to simulations at high spatial resolutions has been shown for example in many studies of regional convection permitting climate simulations (e.g., Kendon et al., 2017;Prein et al., 2015;Eyring et al., 2015;Heinzeller et al., 2016) and continental hyper-resolution hydrological In this article, we present a streamlined run control framework (RCF) to porting, profiling, and documenting legacy code using the script-based benchmarking framework JUBE (Lührs et al., 2016) as a workflow engine.We developed profiling, run control, and testing frameworks which are dynamically built with user input and run using JUBE.While the use case for this portable run control, profiling, and testing workflow engine system discussed in this paper is the software application ParFlow, an integrated parallel watershed model, the RCF is generic and can be applied to any other simulation software.In the remainder of this paper we outline this approach and highlight the subsequent developments scheduled for implementation born out of the extensive profiling of ParFlow.Additionally, we highlight other uses for using a workflow engine which have enabled us to streamline the run control process for model runs.
2 RCF approach to porting, profiling, and provenance tracking In this section, the run control framework which could be described as a run harness, along with JUBE is introduced, where a harness in this case is used to describe the framework of scripts and other supporting tools that are required to execute a workflow.This is followed by the description of the standard profiling toolset which is currently built into our RCF to aid in the ParFlow hydrological model development and for production simulation run control.The numerical experiment test case, which is used as a baseline for the profiling and model improvement, is presented alongside the hardware characteristics of the supercomputers used in this study.The RCF for the case study is given in the Appendix B.

JUBE as a workflow engine
Benchmarking scientific code could assess impacts of changes of the underlying HPC software stack (e.g., compiler or library upgrades) and hardware (e.g., interconnect upgrades), aid in testing as part of software engineering and code refactoring, and finding optimum numerical model configurations.Benchmarking a numerical model system usually involves several runs with different configurations (compiler, domain, physics parameterizations, solver settings, load balancing), including compilation, instrumentation (i.e., the injection of special monitoring "hooks" into the program to enable profiling and/or event tracing), various simulations, profiling, result verification, and analysis.However, with increasing model and HPC environment complexity, the parameter space for benchmarking can be large.To avoid errors in managing benchmark parameter combinations, to reduce the overall temporal effort, and to ensure reproducibility and comparability, benchmarking must be automated.This task can be accomplished using a workflow engine, which is an application for workflow automation, like the JUBE benchmarking environment (Lührs et al., 2016).
JUBE is a script-based framework designed to efficiently and systematically define, setup, run, and analyze benchmarks and production simulations.The current JUBE v2.1.4 is a Python-based implementation released under GNU GPLv3 actively developed at the Jülich Supercomputing Centre (JSC, https://www.fz-juelich.de/ias/jsc/EN).JUBE allows to easily define benchmark sets in which the workflow and parameter sweeps are specified by an XML configuration file.JUBE controls the automatic execution of the designed workflow and takes care of the underlying file structure to allow an individual execution per run.
Automatic bookkeeping separates the different runs and parameter combinations and allows reproducible executions.To gen-Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2017-242Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 October 2017 c Author(s) 2017.CC BY 4.0 License.erate an overview of the overall workflow execution, the user can configure JUBE how to analyze the different output files to extract information such as the overall runtime or other application-specific data.This allows the system to create a combined overview of the underlying parametrization and the application outputs.
Leveraging the generic JUBE framework, we developed a run control framework (RCF), suitable for a typical geoscience model, from a series of XML files integrated with Python scripts to be executed with JUBE (see Figure 1 and Appendix B).
These jobs are usually run with the following modeling chain: 1. preprocess input, 2. compilation of code (includes code instrumentation with profiling tools) 3. simulation run 4. postprocess and analysis.The current run control framework is under version control and can be cloned from GitLab (Hethey, 2013).Details about the individual steps in this modeling chain are outlined in the following subsections.

Code portability
Within the RCF, we have separated all the information pertinent to the existing compilers, required environment modules, and workload manager job submission specifics for a given system into a single XML file (see Figure 1, platform.xmland Appendix B2.2).This platform.xmlfile can easily be extended to include any new system.When compilers, environment modules, and workload managers are updated or new features or functionality are added, the platform.xmlcan be easily altered to include these updates.For example, as new C and Fortran compiler versions are released with improved code generation and potentially new optimization flags, it is useful to reassess which compilation flags give the best run time without compromising the accuracy of the result.In the case of our RCF, this is as simple as altering the compiler flags parameter in Appendix B2.2).In order to ensure accuracy is not compromised we built in a result comparison test in the postprocessing and analysis step, in order to compare the result with previously generated output.

Code profiling
In order to analyze ParFlow's runtime behavior, determine optimal runtime settings, as well as identify performance bottlenecks during model development, we use several complementary performance analysis tools.Set up, compilation wrappers, and analysis profiling steps were built into our RCF (Figure 1: profiler.xml,Appendix A) with support for the following tools: Score-P v3.1 (Knüpfer et al., 2012) and Scalasca v2.3.1 (Geimer et al., 2010;Zhukov et al., 2015), where results collected with Score-P and Scalasca can be examined using the interactive analysis report explorer Cube v4.3.5 (Saviankou et al., 2015), Allinea Performance Reports v7.0.4 (January et al., 2015), Extrae v3.4.3 (Alonso et al., 2012), Paraver v4.6.3 (Labarta et al., 2006), Intel Vectorization Advisor 2015 (Rane et al., 2015), and Darshan v3.0.0 (Carns et al., 2011) (see Table A1 in Appendix A for a more detailed description of each performance analysis tool listed above).The modeling chain for the profiling workflow is as follows: 1. Prepare the input data; 2. Load environment modules and set up performance analysis tool specific parameters; 3. Compile or link ParFlow using scripts and wrappers, depending on what is required by the profiling tool; e.g., Score-P requires compilation and linking using wrappers, whereas Darshan requires only linking after compilation; 4. ParFlow execution with the necessary tool flags (e.g., Scalasca has various runtime measurement, collection, and analysis flags which can be turned on or off); 5. Parse and analyze the results interactively (e.g., using interactive visual explorers like Paraver, Cube, etc.) or generate a performance metric report via a post processing step.
Note that code instrumentation with performance analysis tools-that is, the insertion of tool-specific measurement calls into the application code which are executed at relevant points (events) during runtime-can introduce significant overhead, which can be assessed by comparing to an uninstrumented reference run.If the runtime of the instrumented version of the code under inspection is much longer than the reference run (more than 10-15%), it is recommended to reduce instrumentation overhead as the measurement may no longer reflect the actual runtime behavior of the application.Typical measures to reduce the runtime overhead include turning off automatic compiler instrumentation, filtering out short but frequently called functions, and applying manual instrumentation using specific APIs provided by the tools.
A typical workflow when performing an initial "health examination" on a scientific code can be described as follows: 1. "Which function(s) or code region(s) in my program consume(s) the most wallclock time?"This question can usually be answered using a flat profile, which breaks down the application code into separate functions or manually annotated This ascertains the area(s) of interest in order to streamline performance analysis efforts.
Typical diagnosis tools: Allinea Performance Reports, Score-P + Cube, Extrae + Paraver 2. "Does my application scale as expected?"Typically all scientific applications aim to perform well at scale.To address this question, profiles need to be collected with varying numbers of processes and the scalability of functions/code regions within the areas of interest can be examined.
There are two types of scaling: strong and weak scaling.In case of strong scaling, the overall problem size (workload) stays fixed but the number of processes increases.Here, the runtime is expected to decrease with increasing number of processes.By contrast, in case of weak scaling, the workload assigned to each process remains constant with an increase of processors and, thus, the runtime is ideally expected to be constant as well.
Typical diagnosis tools: Allinea Performance Reports, Score-P + Cube, Extrae + Paraver 3. "Does my program suffer from load imbalance?"If this is the case, some processes will be performing significantly more or less work than the others.Load balance is an indication of how well the load is distributed across processors.
If a code is not well balanced, HPC resources will be used inefficiently as imbalances usually materialize as wait states in communication/synchronization operations between processes/threads.Thus, this may be an area to concentrate code refactoring efforts.
Note that load imbalance can either be static or dynamic.While the former can usually be easily identified in profiles, pinpointing the latter may require more heavyweight measurement and analysis techniques such as event tracing, as imbalances may cancel out each other in aggregated profile data.
Typical diagnosis tools: Score-P + Cube, Score-P + Scalasca + Cube, Extrae + Paraver 4. "Is there a disproportionate time spent in communication or synchronization?"Communication and synchronization overheads can be caused by network latencies (e.g., due to an inefficient process placement onto the compute resources), or wait states and other inefficiency patterns (e.g., caused by load or communication imbalances).If these overheads rise significantly with increase in resources, this can be a further barrier to scalability.
Typical diagnosis tools: Allinea Performance Reports, Score-P + Scalasca + Cube, Extrae + Paraver 5. "Is my application limited by resource bounds?"There are several bounds one can reach, such as (a) CPU bound, i.e., the rate at which processes operate is limited by the speed of the CPU.For example, a tight loop that can be vectorized and operates only on a few values that can be held in CPU registers is likely to be CPU bound.
(b) Cache bound, i.e., the simulation is limited by the amount and the speed of the cache available.For example, a kernel operating on more data than can be held in registers but which fits into cache is likely to be cache bound.(c) Memory bound, i.e., the simulation is limited by the amount of memory available and/or the memory access bandwidth.For example, a kernel operating on more data than fits into cache is likely to be memory bound.
(d) I/O bound, i.e., the simulation is limited by the speed of the I/O subsystem.For example, counting the number of lines in a file is likely to be I/O bound.
Typical diagnosis tools: Score-P + PAPI + Cube, Extrae + Paraver, Intel Vectorization Advisor, Darshan 6.There are additional questions one can add to the survey, for example: "How many pipeline stalls, cache misses, and mis-predicted branches are occurring?",""What is the Instructions Per Cycle (IPC) ratio?", etc.
The typical diagnosis tools mentioned in the above section are implemented in the RCF for various HPC platforms (see Appendix A and B for more details).
The results shown in this study are obtained using the diagnosis tools available on the JUQUEEN platform, namely 1. Score-P profile measurements, including hardware performance counters collected via PAPI (Moore et al.), 2. Score-P trace measurements followed by a subsequent automatic Scalasca trace analysis, 3. Manual analysis of measurements from the abovementioned steps with an interactive visual browser, i.e., Cube.Additionally we used Darshan for I/O profiling.As Alinea Performance Reports and Intel Vectorization Advisor are not available for the Blue Gene/Q platform, it was not possible to use them in the analysis.However, if they are available it is recommended to use them as they can provide additional valuable insight into the potential bottlenecks of the application.
In order to track the health of ParFlow with increasing simulation time, we developed an automated performance metric extraction workflow to obtain key performance indicators such as MPI wait time, memory footprint, cache intensity, etc. in order to quickly assess whether new developments or additions to the code improve or degrade the overall performance.An example of such a workflow output is given in Figure 2.

Provenance tracking
JUBE has many provenance tracking features and tools.JUBE automatically stores the benchmark suite data for each workflow execution, which can be parsed by JUBE's analysis tools.Workflow metadata is automatically parsed by JUBE and then compiled into a report detailing the run and which settings were used for each suite.Subsequent analysis procedures can be predefined, added, or altered by the user after the experiment to automate data processing.These features and tools are designed to facilitate documentation and archiving.Additionally, JUBE's workflow execution directory structure allows for run time provenance tracking.JUBE's workflow management system automatically creates a suite of the parameter sets and steps for each workflow.JUBE then creates a unique execution unit or work package for a specific step and parameter combination.
Each workflow execution has its own directory named by a unique numeric identifier which gets incremented for subsequent runs.Inside this directory, JUBE handles the workflow execution's metadata and creates a directory for each separate work We added extra provenance tracking features to ParFlow simulation runs including converting unannotated ParFlow binary file model simulation output to a more portable NetCDF output containing standardized meta-data enrichment for data provenance tracking using CMOR and CF standards and incorporation of all ParFlow model settings (Eaton et al., 2003).This was developed in accordance with state-of-the-art data lifecycle management and to maintain interoperability (Stodden et al., 2016).
In addition, the postprocess and analysis step we developed contains an archive process at the end of the modeling chain, which documents and collates the model input, model simulation scripts, model submission scripts, postprocessed output, and application code in such a fashion that the archived output can be downloaded and rerun following the instructions in the simulation documentation, without need for any additional input, a practice recommended by Hutton et al. (2016).

Test case experimental design
In order to demonstrate the applicability of the RCF, a weak scaling demonstration study with an idealized overland flow test case was set up for ParFlow (Maxwell et al., 2015).ParFlow (v3.2, https://github.com/parflow) is a massively parallel, physics-based integrated watershed model, which simulates fully coupled, dynamic 2D/3D hydrological, groundwater and land-surface processes suitable for large scale problems.ParFlow is used extensively in research on the water cycle in idealized and real data setups as part of process studies, forecasts, data assimilation experiments, hind-casts as well as regional climate change studies from the plot-scale to the continent, ranging from days to years.Saturated and variably saturated subsurface flow in heterogeneous porous media are simulated in three spatial dimensions using a Newton-Krylov nonlinear solver (Ashby and Falgout, 1996;Jones and Woodward, 2001;Maxwell, 2013) and multigrid preconditioners, where the three-dimensional Richards equation is discretized based on cell centered finite differences.ParFlow also features coupled surface-subsurface flow which allows for hillslope runoff and channel routing (Kollet and Maxwell, 2006).Because it is fully coupled to the Common Land Model (CLM), a land surface model, ParFlow can incorporate exchange processes at the land surface including the effects of vegetation (Maxwell and Miller, 2005;Kollet and Maxwell, 2008).Other features include a parallel data assimilation scheme using the Parallel Data Assimilation Framework (PDAF) from Nerger and Hiller (2013), with an ensemble Kalman filter, allowing observations to be ingested into the model to improve forecasts (Kurtz et al., 2016).An octree space partitioning algorithm is used to depict complex structures in three-dimensional space, such as topography, different hydrologic facies, and watershed boundaries.ParFlow parallel I/O is via task-local and shared files in a binary format for each time step.ParFlow is also part of fully coupled model systems such as the Terrestrial Systems Modeling Platform (TerrSysMP) (Shrestha et al., 2014) or PF.WRF (Maxwell et al., 2011), which can reproduce the water cycle from deep aquifers into the atmosphere.
A three-dimensional sinusoidal topography as shown in Figure 3 was used as the computational domain with a lateral spatial discretization of ∆x = ∆y = 1 m and a vertical grid spacing of ∆z = 0.5 m; the grid size, n, was set to nx = ny = 50 and nz = 40 resulting in 100,000 unknowns per CPU core, with one MPI task per core.In order to simulate surface runoff from the high to the low topographic regions with subsequent water pooling and infiltration, a constant precipitation flux of 10 mm/hour was applied.The water table was implemented as a constant head boundary condition at the bottom of the domain In this section, we present the different steps and results of the benchmarking and profiling study, which demonstrate the usefulness of our RCF.For this study we use the highly scalable IBM Blue Gene/Q system JUQUEEN.JUQUEEN features a total of 458 752 cores from 1 024 PowerPC A2 16-core, four-way simultaneous multithreading CPUs in each of the 28 racks and a total of 448 TB main memory with a Linpack performance of 5.0 Petaflops.ParFlow is running under Linux microkernels on compute nodes using IBM XL compilers and a proprietary MPI library and a GPFS filesystem.See appendix A for an overview of the RCF used for this case study, including excerpts of the XML files used.

Porting ParFlow to JUQUEEN
When porting ParFlow onto JUQUEEN, we found the optimal commonly used compilation flag which did not compromise accuracy with the IBM XL C compiler (v12.1) for ParFlow to be -O3 -qhot -qarch=qp -qtune=qp (see Table 1).
Accuracy was determined to six significant figures.This is in agreement with the results with the fully coupled TerrSysMP -O3 -qhot -qarch=qp -qtune=qp 110 Table 1.Time taken to run the ParFlow test case on JUQUEEN (IBM Blue Gene/Q) with 16 MPI processes using three different commonly used compiler flag optimizations.

Profiling results
The following section describes the results from the demonstration scaling study, following the code performance "health check" protocol given in section 2.1.2.

Analysis of time spent in ParFlow functions
As a first step, a breakdown of the time spent in each annotated region of ParFlow was obtained via internal timings in ParFlow and a Score-P profile measurement (see Appendix B2.1 for Score-P profiling settings), visualized as a bar chart in Figure 4. From the breakdown it is clear that the core component of ParFlow is the computation of the solution to a system of nonlinear equations, reflected in Figure 4, where most of the wallclock time is spent in the blue regions which make up the time spent getting a solution via a nonlinear solve step.A large part of the nonlinear solver's workflow can be summarized in two steps, which are as follows: the initialization of the problem for the specific input and the actual solver loop.The last two steps reside in the KINSol routine which is a component of the SUNDIALS solver library (Hindmarsh et al., 2005).Therefore, the nonlinear solve routine and its components are the focus of interest for reducing ParFlow's runtime.The nonlinear solve loop performs the computational process of computing an approximate linear solution (KINSpgmrSolve) where the intermediate solution is updated every iteration until the desired convergence or tolerance is reached.Those two aforementioned steps within the nonlinear solve loop are manually annotated in the source code and we will focus on these for our results.For simplicity, we have shortened these two steps to setup_solver and solver_loop respectively and will refer to them by this nomenclature henceforth.

Scalability
Our scalability analysis of ParFlow is again based on the Score-P profile measurement.Figure 5 shows a plot of the execution time versus the number of MPI processes when running the weak scaling experiment as outlined in Section 2.2, broken down into the two regions of interest: setup_solver and solver_loop.The behavior of both regions show an increase of execution time with an increase in the number of processes, though the setup_solver region shows better performance in comparison to the solver_loop.However, the strong scaling efficiency profile is comparable to similar codes (Mills et al., 2007).At 32 768 MPI processes, the weak scaling efficiency E ws , which is computed as a relation of the amount of time to complete a work unit with one process to the amount of time to complete N of the same work units with N processes (see Equation 1), drops to approximately 21%.In order characterize parallel applications, Rosas et al. ( 2014) developed auxiliary efficiency metrics, i.e., load balance efficiency, communication efficiency, and parallel efficiency, which are described in more detail below.Using Cube's derived metric feature (Zhukov et al., 2015), we can derive these efficiency metrics from the Score-P profile data automatically.
Load balance efficiency, E lb , is defined as a relation between average computation, T , and maximal computation time, T max , Communication efficiency E com is defined as a relation between T max and total execution time, T tot , Parallel efficiency, E par , is defined as a product of load balance and communication efficiencies Thus, for 32 768 MPI processes, the parallel efficiency drops to a reasonable value of 63% (Table 2).Further inspection of the breakdown between computation and communication (Figure 6) shows that total communication is considerably increasing at scale, whereas setup_solver and solver_loop communication time do not contribute much to it.The slight increase in communication time in those two routines could be attributed to MPI_Allreduce in setup_solver and MPI_Waitall in solver_loop by further breaking down the communication routines (see Figure 7), however, the main scalability breaker is outside of those code sections, e.g., in the initialization phase or the preconditioner.This is an example where more in-depth analysis is needed after the initial health examination to clarify which part of the code must be improved.

Communication
Time spent in communication grows with increasing number of processes (see Figures 6 and 7).For example, the communication time constitutes 37% of the total time when running the test case with 32 768 MPI processes.The main contributors to communication time within the regions of interest are MPI_Allreduce in setup_solver and MPI_Waitall in solver_loop.However, the main communication problem is outside of setup_solver and solver_loop at the initialization phase.A trace analysis using Scalasca identified a costly wait-state pattern constituting 23% of total time in the initialization phase occurring in MPI_Allreduce in the preconditioner of the HYPRE v2.10.1 library.
Serial performance A Score-P profile measurement with hardware performance counters was used to inspect serial performance.To describe serial performance, we use the instructions per cycle metric (IPC), i.e., the ratio of total instructions executed and the total number of cycles.The serial performance for the test case (Section 2.2) with 1 024 MPI processes shows lower than ideal values of IPC.For example, solver_loop has an IPC value of 0.31 out of 2. Potential reasons for low IPC are pipeline stalls, cache misses, and mis-predicted branches (John and Rubio, 2008).Therefore, additional measurements with hardware counters were collected.Analysis of these counters show that a significant number of cache misses in L1, stalls, and mis-predicted branches occur in the following routines: RichardsJacobianEval, PhaseRelPerm, Saturation, and NlFunctionEval.Since JUQUEEN is based on an in-order instruction execution model, meaning instructions are fetched,   executed, and committed in compiler-generated order, in case of an instruction stall, all ensuing instructions will stall as well.
Branching on JUQUEEN is therefore very expensive and can cause pipeline stalls.Thus, the aforementioned routines may account for the low IPC values.

Memory
Using the idealized weak scaling test case described in Section 2.2, Score-P was used to track memory usage of the test case on JUQUEEN.Due to the idealized behaviour of the test case all MPI ranks needed roughly the same amount of memory.For example, at 1 024 processes, each rank needed roughly 95 MB and at 32 678 processes roughly 325 MB.Memory usage per MPI rank increases with scale as the entire grid information is redundantly stored on each MPI rank.This becomes a scalability breaker for ParFlow as we can see from Figure 8; there is a point at which the memory required will eclipse the memory available (at around 64 000 cores for this test case).In ParFlow, the most memory consuming routines are: GetGridNeighbors, 10 PFMGInitInstanceXtra, KinSolPC, and AllocateVectorData.

Reproducibility
All simulation runs in the scaling study are separated into different subdirectories for each simulation run.All subdirectories include the XML scripts used by JUBE, the job submission scripts, the job logs, the model scripts, the postprocessing analysis that the model can be rerun without using any other external tools or files.After the simulation is run and the postprocessing step has been executed, the directory is automatically archived for long term storage.

Outcomes of profiling case study
The detailed profiling work illuminated the main bottlenecks to scalability.These are memory use, time spent in communication, and time spent in acquiring the solution for each time step.In addition, for most geoscientific software including ParFlow, big data is also a barrier to scalability, with costly time spent processing data from the endian sensitive binary format to the highly portable NetCDF format along with time spent in data analysis where raw model output is processed to produce useful field variables.To tackle these barriers, the following development tasks have been initiated (Figure 9).To reduce the memory used, to reduce the time spent in communication, and to address load imbalances due to inactive regions in model domains, Adaptive Mesh Refinement (AMR) is currently being implemented into ParFlow.As it stands, all cells store information about every other cell.This is reduced to neighboring cells under AMR which results in a decrease in memory use and a decrease in time spent in communication (Burstedde et al., 2017).To improve solution time and to enable use of heterogeneous architectures, different accelerator-enabled numerical libraries are being tested using our run harness, for a simplified version of ParFlow, across different HPC architectures.To reduce time spent in preprocessing model input and postprocessing model output, a NetCDF reader and writer is under development.To reduce the amount of output needed to be written to disk, in-situ visualization is under development, where raw output can be processed on-the-fly to useful field variables.There is still room for improvement with regards to serial performance.However to tackle this problem effectively, more in depth profiling is required before significant refactoring can occur.Naturally, we will use our run harness to then validate the effectiveness of these new developments and tuning efforts.

Conclusions
Adapting to new developments in HPC architectures, software libraries, and infrastructures while ensuring reproducibility of simulation and analysis results has become challenging in the field of geoscience.Next generation massively parallel HPC systems require new coding paradigms and next generation geoscientific models are based on complex model implementations, and profiling, modeling and data processing workflows.Thus there is a strong need for a streamlined approach to model simulation runs, including profiling, porting, and provenance tracking.
In this article, we presented our streamlined approach, using JUBE as a workflow engine, so that a workflow can be executed and reproduced in a formalized way.We chose JUBE as it is lightweight and written in Python, and thus is easily portable to any supercomputer platform.Implementing a run control framework using a workflow engine for the complete modeling chain consisting of preprocessing, simulation run, and postprocessing leads to code that can be ported easily and tuned to any platform and combination of compilers including dependencies.Each simulation is automatically documented, accounting for provenance tracking, which leads to better supplementary code sharing and ultimately reproducibility.The relevant profiling toolset can be applied on any platform, leading to identification of bottlenecks, code tuning, refactoring, and ultimately more

B2 templates
Template scripts are used in our run harness to capture default settings for HPC platform and model related parameters.Specific default settings can be overwritten when specified in the custom job XML script.In the case study, the template scripts used contain HPC platform related default settings for profiling tools, compilers, and job submission.In addition there are templates for each type of ParFlow model, where the idealized overland flow model is called terrain.sine(See 2.2), and the type of scaling typically used for a benchmark suite, i.e., weak scaling and strong scaling.

B2.1 templates/profiler.xml:scorep
This template XML script sets the default settings for each profiling tool where the following XML excerpt depicts the settings for the profiling tool Score-P on JUQUEEN.

Figure 1 .
Figure 1.Schematic overview of the modeling chain as supported by our JUBE-based run harness.Each step is annotated with a brief description (top) as well as the respective RCF infrastructure (XML files and scripts, bottom).

Figure 2 .
Figure 2. Example output from a performance metric extraction workflow.Where time[s]: application total wallclock time, will be considered as a reference run, time_io[s]: average time spent in input/output operations for each rank, time_mpi[s]: average time spent in MPI for each rank, mem_vs_comp: memory vs. compute bound (close to 1.0 means strongly compute bound, close to 2.0 means strongly memory bound), load_imbalance: ratio of the load imbalance overhead towards the critical path duration, io_volume[MB]: total amount of data in I/O, io_calls[nb]: total number of I/O calls, io_throughput[MB/s]: speed of I/O, avg_io_ac_size[kB]: average amount of data per I/O call, num_p2p_calls[nb]: average number of point-to-point MPI operations per MPI rank, p2p_comm_time[s]: average time spent in point-topoint MPI operations per MPI rank, p2p_message_size[kB]: average size of point-to-point MPI messages per MPI rank, num_coll_calls[nb]: average number of collective MPI operations per MPI rank, coll_comm_time[s]: average time spent in collective MPI operations per MPI rank, delay_mpi[s]: total amount of MPI time spent in waiting caused by inefficient communication patterns, delay_mpi_ratio: ratio of waiting time caused by MPI to total time spent in MPI, time_omp[s]: total time spent in OpenMP parallel regions, delay_omp[s]: total amount of OpenMP synchronization overhead, delay_omp_ratio: ratio of synchronization overhead time in OpenMP to total time spent inOpenMP, memory_footprint[kB]: average memory footprint per MPI rank, cache_usage_intensity: ratio of total number of cache hits to the total number of cache accesses, IPC: ratio of total instructions executed to the total number of cycles, time_no_vec[s]: wall clock time without compiler vectorization, vec_eff: ratio of total wall time of the reference run to the total wall time without vectorization, time_no_fma: total wall time with disabled FMA, fma_eff: ratio of total wall time of the reference run to the total wall time without FMA

Figure 3 .
Figure 3. Model set up, showing cross-sectional domain and sinusoidal topography variation from the top of the model (z=20) for each processor model system on JUQUEEN inGasper et al. (2014).Using the -O3 compilation flag resulted in a speedup of close to factor Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2017-242Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 October 2017 c Author(s) 2017.CC BY 4.0 License. of 2 when running with 16 MPI tasks on 16 CPU cores (one compute node on Blue Gene/Q, no multithreading).The timing results were compiled using JUBE's result parser functionality, which was run as a post processing step.IBM XL COMPILER FLAGS TIME [S] -O1 -qhot -qarch=qp -qtune=qp 203 -O2 -qhot -qarch=qp -qtune=qp 203

Figure 4 .
Figure 4. Time spent in ParFlow functions or routines, where the functions/routines can be divided into four categories, set up, clean up, I/O and solve.The functions in the category "set up" are depicted in green: SubsrfSim-setting up the domain, Solver setup-initializing the solver.The functions in the category "clean up" are depicted in yellow: Solver cleanup-finalizing the solver.The functions in the category "I/O" are depicted in orange: PFB I/O-ParFlow binary I/O.The functions in the category "solve" are depicted in blue: Porosity-calculation of the porosity matrix, Geometries-calculation of the simulation domain, MatVec-matrix and vector operations, PFMG:Geometric Multigrid Preconditioner from HYPRE, Solver functions:miscellaneous functions, HYPRE_Copies-copying data within HYPRE, NL_F_Eval-setting up the physics and field variables for the next iteration, PhaseRelPerm-setting up the permeability matrix, KINSol functions-nonlinear solver functions from SUNDIALS Dev. Discuss., https://doi.org/10.5194/gmd-2017-242Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 October 2017 c Author(s) 2017.CC BY 4.0 License.

Figure 5 .
Figure 5. Execution time versus the number of MPI processes for the regions of interest, running the weak scaling experiment up to 32768 MPI processes.

Figure 6 .Figure 7 .
Figure 6.Wallclock time for communication versus computation for the regions of interest for the weak scaling experiment using the test case described in Section 2.2

Figure 8 .
Figure 8. Memory usage of the weak scaling experiment described in Section 2.2 versus the total amount of memory available.

Figure 9 .
Figure 9. Outcome of the case study: Current developments addressing (i) portability and big data (ParFlow with NetCDF4 parallel I/O), (ii) memory bottleneck and scientific load balance (ParFlow with p4est), and (iii) next generation HPC architecture (ParFlow with GPU/MIC enabled solver)

Table 2 .
Weak scaling efficiency, load balance efficiency, communication efficiency, and parallel efficiency, running the weak scaling experiment up to 32 768 MPI processes.

Table A1 .
Description of of the performance analysis tools supported by our run harness infrastructure.
ParaverParaver is a flexible and configurable performance analysis tool based on traces collected by the Extrae measurement system.It supports time-line views as well as histogram/statistics views on the trace data.https://tools.bsc.es/paraverhttp://icl.utk.edu/papi