An Approach to Computing Discrete Adjoints for MPI-Parallelized Models Applied to the Ice Sheet System Model

. Within the framework of sea-level rise projections, there is a strong need for hindcast validation of the evolution of polar ice sheets in a way that tightly matches observational records (from radar, gravity, and altimetry observations mainly). However, the computational requirements for making hindcast reconstructions possible are severe and rely mainly on the evaluation of the adjoint state 5 of transient ice-ﬂow models. Here, we look at the computation of adjoints in the context of the NASA/JPL/UCI Ice Sheet System Model, written in C++ and designed for parallel execution with MPI. We present the adaptations required in the way the software is designed and written but also generic adaptations in the tools facilitating the adjoint computations. We concentrate on the use of operator overloading coupled with the AdjoinableMPI library to achieve the adjoint computation of 10 ISSM. We present a comprehensive approach to 1) carry out type changing through ISSM, hence facilitating operator overloading, 2) bind to external solvers such as MUMPS and GSL-LU and 3) handle MPI-based parallelism to scale the capability. We demonstrate the success of the approach by computing sensitivities of hindcast metrics such as the misﬁt to observed records of surface altimetry on the North-East Greenland Ice Stream, or the misﬁt to observed records of surface velocities 15 on Upernavik Glacier, Central West Greenland. We also provide metrics for the scalability of the approach, and the expected performance. This approach has the potential of enabling a new generation of hindcast-validated projections that make full use of the wealth of datasets currently being collected, or alreay collected in Greenland and Antarctica. MPI by preﬁx AMPI and have few pa- rameters adjoint functionality. provides additional and predeﬁned symbols. Discussing the internal design of AMPI is the scope of this the to ISSM is the ﬁrst large scale practical use of AMPI and in the following we will discuss the steps taken to use it in the ISSM code base.

1 Introduction 20 Constant monitoring of polar ice sheets through remote sensing, in particular since the advent of altimeter, radar, and gravity sensors such as ICESat-1, CryoSat, RADARSAT-1, ERS-1 and ERS-2, Envisat, and GRACE, has created a large amount of data that has yet to find its way through Ice Sheet Models (ISMs) and hindcast reconstructions of polar ice sheet evolution. In particular, as evidenced by the wide discrepancy between ISMs involved in the SeaRISE and Ice2Sea projects 25 (Nowicki et al., 2013;Bindschadler et al., 2013) significant improvements in modeled projections of the mass balance of polar ice sheets and their contribution to future sea-level rise has not resulted from the increase in availability of data, but rather from improvements in the type of physics captured in forward models. One reason for this is the lack of data assimilation capabilities embedded in the current generation of state-of-the-art ISMs. In the past 10 years, great strides have been made in 30 improving model initialization by using steady-state model inversions of basal friction (MacAyeal, 1993;Morlighem et al., 2010;Larour et al., 2012;Price et al., 2011;Arthern and Gudmundsson, 2010), ice rheology (Rommelaere and MacAyeal, 1997;Larour et al., 2005;Khazendar et al., 2007) and bedrock elevation (Morlighem et al., 2014) among others. However, these approaches aim at improving our knowledge of poorly constrained input parameters and boundary conditions as long 35 as the ice-flow regime is captured in a steady-state configuration. These inversions rely on analytically derived adjoint states of forward stress-balance or mass-transport models, but do not extend to transient regimes of ice flow.
Applications to transient models and long temporal time series such as the ICESat/CryoSat continuous altimetry record from 2003 to present-day, have been much more rare, and to our knowledge, 40 are limited to a few studies such as Heimbach and Bugnion (2009), Goldberg and Sergienko (2011), Goldberg and Heimbach (2013), Larour et al. (2014) and Goldberg et al. (2015) among others. The main issue here precluding widespread application of transient data assimilation lies in the difficulty of deriving temporal adjoints of transient models. Actually, in many cases, a manual derivation of the adjoint state of a forward model is not possible, especially where ice-flow physics are not 45 differentiable. This is the case for example in thermal transient models, where the melting-point is a physical constraint (of the threshold type) that is imposed on temperature. This numerical issue can be mitigated by adopting different approaches, such as: 1) ensemble runs, as in Applegate et al. (2012), where model runs compatible with observations are selected; 2) methods similar to the flux-correction methods implemented in Aschwanden et al. (2013); Price et al. (2011) where 50 boundary conditions are corrected in order to match time series of observations (tuning approach); 3) quasi-static approaches, where snapshot inversions are carried out in time, as in Habermann et al. (2012Habermann et al. ( , 2013 and 4) sampling methods, which have the main drawback of being computationally very expensive (each sample at the cost of one forward run). Though this is not an exhaustive list of all available methods, the main advantage of adjoint driven inversions is that it relies on the exact 55 sensitivity of a forward model to its inputs, hence ensuring a physically meaningful inversion.
Understanding sensitivities of a forward ice-flow model, which is needed to physically constrain a temporal inversion, requires computation of derivatives of model outputs to model inputs. If such derivatives are approximated by finite-difference schemes, they are subject to the tradeoff between approximation and truncation errors for the perturbation, which is aggravated for higher-order 60 derivatives. If the derivatives are computed using algorithmic differentiation (AD) (Griewank and Walther, 2008), also known as automatic differentiation, then one can attain derivatives with machine precision provided the underlying program implementation of the numerical model is amenable to the application of an AD tool. In particular, this approach does not depend on the type of physics relied upon, and it is transparent to the model equations, provided each step of the overall software 65 is differentiable. Indeed, the AD method assumes a numerical model M that is a function implemented as a computer program. The execution of the program implies the execution of a sequence of arithmetic operators such as +, −, * and intrinsics sin, e x , and so forth to which the chain rule can be applied. For each such elemental operation r = φ(a, b, . . .) with result r and arguments 70 a, b, . . . 1 we can write the total derivative as: For example, if φ is the multiplication operator * , i.e. r = ab then one will get the product rulė r = bȧ + aḃ. Applying the above rule to each elemental operation in the sequence gives a method to compute:ẏ = Jẋ with the Jacobian J = ∂f i ∂x j , i = 1 . . . m, j = 1 . . . n without explicitly forming the Jacobian. This method applies the chain rule in the computation order of the values in the program and is known as forward-mode AD. The opposite order of applying the chain rule to the elemental operations, known as reverse or adjoint-mode AD, yields projections This is achieved by applying to the elemental operations φ, in reverse order of their original execution, the rule: where the bar operator defines the corresponding adjoint variable. For the multiplication example 80 one therefore getsā =ā+br;b =b+ar;r = 0. In particular, for applications in which m n, the 1 In practice most φ are uni-or bivariate.
Published: 9 May 2016 c Author(s) 2016. CC-BY 3.0 License. reverse mode is advantageous because its computational cost does not depend on n. Typical problems in Cryosphere science involve computations of diagnostics which are scalar-valued cost functions (m = 1), such as for example the spatio-temporally averaged misfit between modeled surface elevation and observed surface topography (Larour et al., 2014). For these cases, one can compute 85 the gradient ∇f = J Tȳ withȳ = 1 as a single projection. Thus, for high-resolution models implying very large n, the reverse mode is an enabling and potentially very efficient technique. This significant capability in AD is what makes its application to data assimilation so efficient. Instead of evaluating a Jacobian for each one of the outputs of the forward model with respect to each input, which would be significantly consuming as it scales in n 2 , the reverse mode evaluates the gradient of one specific 90 output of interest with respect to the model inputs in only one sweep that scales in n.
Applying AD to large-scale frameworks such as ISSM (Larour et al., 2012), MITgm (Heimbach, 2008), SICOPOLIS (Heimbach and Bugnion, 2009) or DassFlow (Monnier, 2010) is a difficult proposition, but one which enables significant improvements in the way models can be initialized (Heimbach and Bugnion, 2009), hindcast validated (Larour et al., 2012), and calibrated (Goldberg 95 et al., 2015) towards better projections. Traditional approaches relying on Source-to-Source transformation have been developed, but for frameworks such as ISSM, which are C++ oriented, and highly parallelized, this type of approach breaks down. Our goal here is to demonstrate how the socalled operator-overloading AD approach can be implemented and validated for a framework such as ISSM, and what developments were necessary to make this capablity operational. Our approach is 100 discussed in section 2 of this manuscript, with section 3 describing the method validation as well as applications with ISSM. We discuss and conclude in the last section the applicability of such an approach to other frameworks, and on the opportunities these new developments afford for Cryosphere Science and data assimilation of remote sensing data in particular.

Type Change to Enable Overloaded Operators
Changing to the aforementioned active type is a significant effort to be undertaken in the model code.
Among the choices to effect this type change one should select one that is transparent to the model development process, is maintainable, and minimizes the manual effort. Interpreting the trace by R(T ) means that the space holding the data for the variables r, a, b ∈ V * (the set of instantiated program variables at run time) occurring in each operation r = φ(a, b) must 160 be represented by some mapping ω : V * → N + to pseudo addresses. In ADOL-C these addresses are called locations and represent indices in a work array held by R(T ). The pseudo addresses must be managed through the TA constructor and destructor in a fashion similar to the memory management of the actual program variables themselves, i.e. pseudo addresses are assigned from and returned to a pool Ω of available addresses. However, no distinction between heap and stack variables is 165 made and generally data locality will not be preserved. On the other hand, for special operations φ A with array arguments of size s, that is, calls to external solvers (see section 2.3) or MPI routines (see section 2.4), it would be counterproductive to record in T the pseudo-addresses of each of the s array elements rather than a consecutive range. The latter, however, imposes that the pool be primed by some call Ω(s) such that ω returns consecutive pseudo addresses when called by the constructor for 170 each element in the array. In ADOL-C this is done by calling ensureContiguousLocations immediately before an active array is instantiated. Avoiding copying of array values in M as an efficiency measure means that any given array has a good chance of being used in φ A , and to avoid littering the code with preprocessor-guarded calls to Ω(s), we decided in ISSM to instead add a call to ensureContiguousLocations in the TA specialization of xNew : 210 External functions f e that had been supported by ADOL-C had the signature f e (l x , x, l y , y) with inputs x, outputs y for the original call, l x , l y their respective array lengths, and f e (l y ,ȳ, l x ,x) for the adjoint counterpart. In the case of a linear system with A ∈ A, the input x is packed with both A and r while y contains the solution s of the system on return. This, however, was insufficient for binding to solvers from the GNU Scientific Library (GSL) (Galassi, 2009)  Regarding Q1 we know that in ISSM A ∈ A, and regarding Q2 the parameters passed to f e have no other uses and therefore, using the controls (E3), we avoid (re)storing their values. The direct solver from the GSL used here had no API control to back solve with the factors for the transposed 225 and we did not want to reverse engineer the permutation representation. Hence the refactoring was done as a matter of convenience for the sequential reference case requiring E1. The parallel and therefore practically more efficient MUMPS solver operates on sparse, distributed A, therefore requiring E2. MUMPS offers both the ability to store the factors to file and perform the back-solve for the transpose. However, the MUMPS portion of the runtime is comparatively small (see section 3).

230
Consequently the overhead for the file I/O when considering the factor data size after fill-in is not expected to yield much practical benefit in this context, answering Q3. Finally, for Q4, the extension E4 is exploited because transient runs of ISSM need to account for changes in the system size and a preallocation with the maximal buffer sizes therefore avoids some of the memory management overhead.

Handling Parallelism with the AdjoinableMPI Library
As is the case with ISSM, practically relevant science problems incur a computational complexity that necessitates execution on parallel hardware, often using MPI as the paradigm of choice. Sending data with MPI from a source buffer s to a target buffer t can be interpreted as a simple assignment t = s. This implies for the adjoint an increments =s+t, that is, the adjoint is a communication of the rameters where needed to enable the adjoint functionality. AMPI also provides additional types and predefined symbols. Discussing the internal design of AMPI is outside the scope of this paper. However, the application to ISSM is the first large scale practical use of AMPI and in the following we will discuss the steps taken to use it in the ISSM code base.
For testing and small scale experiments, the ISSM code, as many other models, treats MPI par-250 allelization as a compile-time option controlled by preprocessor macros. Furthermore, turning the adjoint capability on and off as suggested in section 2.2 would imply additional switching between the original MPI and AMPI with code duplication and the potential for errors. To avoid these undesirable consequences we decided to introduce in ISSM another wrapper layer (prefixed ISSM_MPI ) of calls and definitions to encapsulate completely the four functionality variants listed in Table 1.

255
The approach is shown for MPI_Reduce in Fig. 1    The approach is shown in Fig. 2. Assuming many AMPI calls in foo , this reduces the count of code locations where type errors may be introduced to the template instantiations. Finally, a practical concern for using the parallelized adjoint is the handling of sensitivities to quantities that are uniformly initialized across ranks, such as parameters. Frequently, as was the case within ISSM, these quantities are initialized from files or otherwise per process in the parallel case the same way as in the sequential case. In the parallel case that implies a replication of the same quantity across ranks. However, to obtain the correct sensitivities, the quantity q in question, should be unique, in other words that quantity must be uniquely initialized at one root rank and then broadcast to the other ranks. Otherwise, for r ranks, then at each rank one would obtain only a part q i of the totalq and would have to "manually" sum upq = q i . With an initial broadcast of q , however, the corresponding adjoint provided through AMPI by using AMPI_Bcast is that exact sum reduction and R(T ) yields the correct adjoint at the broadcast root. This notion similarly applies 285 to any situation where a conceptually unique quantity of active type is implicitly replicated on some ranks.

Validation
ISSM is validated in AD mode by continuously running a test suite within the Jenkins (Smart, 2011) integration and development framework (available here at http://issm.jpl.nasa.gov/developers/) . A 290 detailed description of the suite of benchmarks is given in Table 3) The aim is to 1) compare forward runs in ISSM with their counterparts when overloaded operators are switched on. The results should be identical within double precision tolerances; and 2) compare forward and reverse runs carried out with ISSM AD on and off, using the GSL and MUMPS solvers. Comparisons of gradients computed in AD mode with standard forward differences methods are also carried out to make sure 295 the gradients computed (essentially in reverse scalar mode, which is the mode of predilection in ISSM for data assimilation) in AD mode are accurate.

Application
The ongoing use of adjoint computations includes sensitivity studies and state estimation problems 300 for transient model runs. Because this paper concentrates on technical aspects, we show, only as exemplary evidence of the practical usefulness, some sensitivities of cost functions calibrated for two sensors commonly encountered in Cryosphere Science, altimeters (that measure surface elevation), and radars (that measure surface displacement, or velocity). In Fig. 3 Table 3. ISSM AD validation suite integrated within Jenkins for continuous integration and delivery (Smart, 2011). Tests 3001 to 3010 and 3101 to 3110 test the repeatability of forward runs with and without ADOL-C compiled, but with no AD drivers specifically called. The forward runs involved are the standard stress balance, mass transport and thermal solutions, with 2D SSA (Shelfy-Stream Approximation MacAyeal (1989)), 3D SSA, 3D HO (Higher-Order, (Blatter, 1995;Pattyn, 1996)), 3D Full-Stokes (Stokes, 1845)       with default optimization (-O2) in Fig. 5 (upper frame). While this plot indicates a small overhead factor of ≈ 4.5 in particular for the largest mesh case (distance 12.5 km) the reason for this becomes apparent in the plot on the lower frame. It shows that the majority of the run time is consumed by the GSL solver (libgsl) completely overshadowing any of the overhead caused by the adjoint for the 345 largest mesh. We want to emphasize that GSL was chosen not for its efficiency but for the simplicity of the setup which quickly enabled adjoint computations. Introducing AMPI and thereby moving to a more appropriate solver (MUMPS) causes the adjoint overhead to become more prominent. The ableMPI library, which is to our knowledge the first time this type of approach has been systematically applied to a software framework of this size and complexity. Despite the difficulties encountered rewriting the software, the overloaded approach is transparent to the user, which is critial given the size of the larger Cryosphere Science commuity that is not familiar with the adjoint work, and for which classic approaches such as source-to-source transformation have proven to be overly cumber-370 some. The flexibility of this approach allows in particular for quick turn-around in developing adjoint models of new parameterizations which are not easily hand derived. This is a major advantage in that it opens this approach to the wider community. This, given the large amount of remote sensing data currently being collected and under-utilized, could prove paramount if we are to hindcast validate projections of sea-level rise. Further work is of course required to bring in additional observations 375 such as gravity sensors, or radar stratigraphy observations, which will involve development of new cost functions, and scalability in 3D. Though this is complex in that it requires integrated resiliency and adjoint checkpointing schemes for long running transient modeling scenarios, our approach has proven flexible, and should lead to a brand new set of data assimilation capabilities that have already been available to other Earth Science communities for a long time. Indeed, by allowing temporal 380 data assimilation for a large number of sensors and models, such as demonstrated here with the use of altimetry and radar sensors for mass transport and stress balance models respectively, ISSM paves the way for wider integration between the modeling and observational Cryosphere community.

Code Availability
The ISSM code and its AD components are available at http://issm.jpl.nasa.gov. The instructions for 385 the compilation of ISSM in AD mode, along with test cases is presented in the supplement attached to this manuscript.