The recently developed 3-D TenStream radiative transfer solver was integrated into the University of California, Los Angeles large-eddy simulation (UCLA-LES) cloud-resolving model. This work documents the overall performance of the TenStream solver as well as the technical challenges of migrating from 1-D schemes to 3-D schemes. In particular the employed Monte Carlo spectral integration needed to be reexamined in conjunction with 3-D radiative transfer. Despite the fact that the spectral sampling has to be performed uniformly over the whole domain, we find that the Monte Carlo spectral integration remains valid. To understand the performance characteristics of the coupled TenStream solver, we conducted weak as well as strong-scaling experiments. In this context, we investigate two matrix preconditioner: geometric algebraic multigrid preconditioning (GAMG) and block Jacobi incomplete LU (ILU) factorization and find that algebraic multigrid preconditioning performs well for complex scenes and highly parallelized simulations. The TenStream solver is tested for up to 4096 cores and shows a parallel scaling efficiency of 80–90 % on various supercomputers. Compared to the widely employed 1-D delta-Eddington two-stream solver, the computational costs for the radiative transfer solver alone increases by a factor of 5–10.

To improve climate predictions and weather forecasts we need to understand the delicate linkage between clouds and radiation. A trusted tool to further our understanding in atmospheric science is the class of models known as large-eddy simulations (LESs). These models are capable of resolving the most energetic eddies and were successfully used to study boundary layer structure as well as shallow and deep convective systems.

Radiative heating and cooling drives convective motion and influences cloud
droplet growth and
microphysics

While radiative transfer is probably the best-understood physical process in atmospheric models, it is extraordinarily expensive (computationally) to use fully 3-D radiative transfer solvers in LES models.

One reason for the computational complexity involved in radiative transfer
calculations is the fact that solvers are not only called once per time step
but the radiative transfer has to be integrated over the solar and thermal
spectral ranges. A canonical approach for the spectral integration are so-called “correlated-k”
approximations

However, even when using simplistic 1-D radiative transfer solvers and correlated-k methods for the spectral integration, the computation of radiative heating rates is very demanding. As a consequence, radiation is usually not calculated at each time step but rather updated infrequently. This is problematic, in particular in the presence of rapidly changing clouds. Further strategies are needed to render the radiative transfer calculations computationally feasible.

One such strategy was proposed by

Another reason for the computational burden is the complexity of the
radiation solver alone. Fully 3-D solvers such as
Monte Carlo

To that end, there is still considerable effort being put into the
development of fast parameterizations to account for 3-D effects. Recent
works incorporate 3-D effects in low-resolution subgrid cloud-aware
models (GCMs) by means of overlap assumptions or additional horizontal
exchange coefficients

The TenStream solver

Section

In Sect.

The LES that we coupled the TenStream solver to is the
UCLA-LES model. A description and details of the LES model can be found
in

In the case of 3-D radiative transfer we need to solve the
entire domain for one spectral band at once. This is in contrast to 1-D radiative transfer solvers where the heating rate

The TenStream radiative transfer model is a parallel
approximate solver for the full 3-D radiative transfer
equation

The coupling of radiative fluxes in the TenStream solver can be written as a huge but sparse matrix (i.e., most entries are zero). The TenStream matrix is positive definite (strictly diagonal dominant) and asymmetric. Equation systems with sparse matrices are usually solved using iterative methods because direct methods such as Gaussian elimination or LU factorization usually exceed memory limitations. The PETSc library includes several solvers and preconditioners to choose from.

For 3-D systems of partial differential equations with many degrees of freedom, iterative methods are often more efficient computationally and memory wise.

The three biggest classes in use today are conjugate gradient (CG),
generalized minimal residual method (GMRES) and biconjugate gradient
methods

Perhaps even more important than the selection of a suitable solver is the
choice of matrix preconditioning. In order to improve the rate of
convergence, we try to find a transformation for the matrix that increases
the efficiency of the main iterative solver. We can use a preconditioner

This study suggests two preconditioners for the TenStream solver. We are fully aware that our choices are probably not an optimal solution but they give reasonable results.

The first setup uses a so-called stabilized biconjugate gradient solver with
incomplete LU factorization (ILU). Direct LU factorizations tend to fill up
the zero entries (sparsity pattern) of the matrix and quickly become
exceedingly expensive memory wise. A workaround is to only fill the
preconditioner matrix until a certain threshold of filled entries are
reached. A fill level factor of 0 prescribes that the preconditioner
matrix has the same number of nonzeros as the original matrix. The ILU
preconditioner is only available sequentially and in the case of parallelized
simulations, each processor applies the preconditioner independently (called
“block Jacobi”). Consequently, the preconditioner can not propagate
information beyond its local part and we will see in Sect.

The second setup uses a flexible GMRES with geometric algebraic multigrid
preconditioning (GAMG). Traditional iterative solvers like Gauss–Seidel or
block Jacobi are very efficient in reducing local residuals at adjacent
entries (often termed high-frequency errors). This is why they are called
“smoothers”. However, long-range (low-frequency) residuals, e.g., a
reflection at a distant location, are dampened only slowly. The general idea
of a multigrid is to solve the problem on several coarser grids
simultaneously. This way, the smoother is used optimally in the sense that on
each grid representation the residual which is targeted is rather high-frequency error.
This coarsening is done until ultimately the problem size is
small enough to solve it with direct methods. Considerable effort has been
put into the development of black-box multigrid preconditioners.
In this context, black-box means that the user, in this case the TenStream solver, does
not have to supply the coarse grid representation. Rather, the coarse grids
are constructed directly from the matrix representation. The PETSc solvers
are commonly configured via command-line
parameters (see Listing

There are two reasons why radiative transfer is so expensive computationally.
On one hand, a single monochromatic calculation is already quite complex. On
the other hand, radiative transfer calculations have to be integrated over a
wide spectral range. Even if correlated-k methods are used, the number of
radiative transfer calculations is on the order of 100. As a result, it
becomes unacceptable to perform a full spectral integration at every
dynamical time step, even with simple 1-D two-stream solvers. This means
that in most models, radiative transfer is performed at a lower rate than
other physical processes.

There, they used the model setup for the DYCOMS-II
simulation (details in

Intercomparison of the DYCOMS-II simulation, once forced with the full radiation (solid line), with the original Monte Carlo spectral integration (dotted) and with the uniform version (dashed). The dash-dotted line is a calculation with full spectral integration but with the four-stream solver instead of the two-stream solver. The top panel displays the vertically integrated turbulent kinetic energy, the middle panel displays the mean liquid water content (conditionally sampled and weighted by physical height), and the bottom panel displays the mean cloud top height.

Volume-rendered perspective on liquid water content and solar
atmospheric heating rates of the warm-bubble experiment (initialized without
horizontal wind). The two upper panels depict a simulation which was driven
by 1-D radiative transfer and the two lower panels show a simulation where
radiative transfer is computed with the TenStream solver (solar zenith angle

To determine the parallel scaling behavior when using an increasing number of processors, one usually conducts two experiments. First, a so-called strong-scaling experiment is performed, where the problem size stays constant while the number of processors is gradually increased. We speak of linear strong-scaling behavior if the time needed to solve the problem is reduced proportional to the number of used processors. Second, a weak-scaling experiment where the problem size and the number of processors are increased linearly, i.e., the workload per processor is fixed. Linear weak-scaling efficiency implies that the time to solution remains constant.

Two strong-scaling tests for a clear sky and a strongly forced
scenario. Vertical axis is the increase of computational time normalized to a
delta-Eddington two-stream calculation (solvers only). Horizontal axis is
for different solar zenith angles (

We hypothesized earlier (Sect.

Both scenarios have principally the same setup with a domain length of
10 km at a horizontal resolution of 100 m. The
model domain is divided into 50 vertical layers with 70 m
resolution at the surface and a vertical grid stretching of 2 %.
The atmosphere is moist and neutrally stable (see Sect.

Both scenarios are run forward in time for an hour for different solar zenith
angles and with varying matrix solvers and preconditioners (presented in
Sect.

Figure

The performance of GAMG is less affected by parallelization. The number of iterations until convergence stays close to constant (independent of the number of processors). The GAMG preconditioning outperforms the ILU preconditioning for multicore systems whereas the setup cost of the coarse grids as well as the interpolation and restriction operators are more expensive if the problem is solved on a few cores only. In summary, we expect the increase in runtime compared to traditionally employed 1-D two-stream solvers to be in the range of 5–10 times.

Details on the computers used in this work.
Mistral and Blizzard are Intel–Haswell and
IBM Power6 supercomputers at DKRZ, Hamburg, respectively.
Thunder denotes a Linux Cluster at ZMAW, Hamburg.
Columns are the number of MPI ranks used per compute node, the number
of sockets and cores, and the maximum memory bandwidth per node as measured by the streams

We examine the weak-scaling behavior using the
earlier presented simulation (see Sect.

Figure

Weak-scaling efficiency running UCLA-LES with interactive radiation
schemes. Experiments measure the time for the radiation solvers only (i.e., no
dynamics or computation of optical properties).
Timings are given as a best of 10 runs.
Weak-scaling efficiency is given for the TenStream solver (triangle markers)
as well as for a two-stream solver (hexagonal markers). Scaling
behavior compared to single core computations (remaining on one compute
node)(left). Compute node parallel scaling (normalized against a single node)(right).
The individually colored lines correspond to different
machines (see Table

We described the necessary steps to couple the 3-D TenStream radiation solver to the UCLA-LES model. From a technical perspective, this involved the reorganization of the loop structure, i.e., first calculate the optical properties for the entire domain and then solve the radiative transfer.

It was not obvious that the Monte Carlo spectral integration would still be
valid for 3-D radiative transfer. To that end, we conducted numerical
experiments (DYCOMS-II) in close resemblance to the work
of

The convergence rate of iterative solvers is highly dependent on the applied matrix preconditioner. In this work, we tested two different matrix preconditioners for the TenStream solver: first, an incomplete LU decomposition and second, the algebraic multigrid preconditioner, GAMG. We found that the GAMG preconditioning is superior to the ILU in most cases and especially so for highly parallel simulations.

The increase in runtime is dependent on the complexity of the simulation (how much the atmosphere changes between radiation calls) and the solar zenith angle. We evaluated the performance of the TenStream solver in a weak and strong-scaling experiment and presented runtime comparisons to a 1-D delta-Eddington two-stream solver. The increase in runtime for the radiation calculations ranges from a factor of 5–10. The total runtime of the LES simulation increased roughly by a factor of 2–3. An only 2-fold increase in runtime allows extensive studies concerning the impact of 3-D radiative heating on cloud evolution and organization.

This study aimed at documenting the performance and applicability of the TenStream solver in the context of high-resolution modeling. Subsequent work has to quantify the impact of 3-D radiative heating rates on the dynamics of the model.

The UCLA-LES model is publicly available at

To obtain a copy of the TenStream code, please contact one of the authors. This study used the TenStream model at git revision “e0252dd9591579d7bfb8f374ca3b3e6ce9788cd2”. For the sake of reproducibility, we provide the input parameters for the here-mentioned UCLA-LES computations along with the TenStream sources.

Biconjugate gradient squared iterative solver. The block Jacobi preconditioner does an incomplete LU preconditioning on each rank with fill level 1 independent of its neighboring ranks.

Flexible GMRES solver with algebraic multigrid preconditioning. This uses plain aggregation to generate coarse representation (dropping values less than .1 to reduce coarse matrix complexity) and uses up to five iterations of SOR on coarse grids.

This work was funded by the Federal Ministry of Education and Research (BMBF) through the High Definition Clouds and Precipitation for Climate Prediction (HD(CP)2) project (FKZ: 01LK1208A). Many thanks to Bjorn Stevens and the DKRZ, Hamburg for providing us with the computational resources to conduct our studies. Edited by: K. Gierens