We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of three-dimensional high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10–20 layers as long as the underlying mesh is well ordered. We characterize the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretizations. On meshes with realistic numbers of layers the performance achieved is between 70 and 90 % of a theoretical hardware-specific limit.

In the field of numerical simulation of fluids and structures, there is traditionally considered to be a tension between the computational efficiency and ease of implementation of structured grid models, and the flexible geometry and resolution offered by unstructured meshes.

In particular, one of the grand challenges in simulation science is modelling
the ocean and atmosphere for the purposes of predicting the weather or
understanding the Earth's climate system. The current generation of
large-scale operational atmosphere and ocean models almost all employ
structured meshes

The ocean and atmosphere are thin shells on the Earth's surface, with typical domain aspect ratios in the thousands (oceans are a few kilometres deep but thousands of kilometres across). Additionally the direction of gravity and the stratification of the ocean and atmosphere create important scale separations between the vertical and horizontal directions. The consequence of this is that even unstructured mesh models of the ocean and atmosphere are in fact only unstructured in the horizontal direction, while the mesh is composed of aligned layers in the vertical direction. In other words, the meshes employed in the new generation of models are the result of extruding an unstructured two-dimensional mesh to form a layered mesh of prismatic elements.

This layered structure was exploited in

Exploiting the anisotropic nature of domains has seen various software
developments in various fields. For example,

A key motivation for this work was to provide an efficient mechanism for the
implementation of the layered finite element numerics which have been adopted
by the UK Met Office's Gung Ho programme to develop a new atmospheric
dynamical core. The algorithms here have been adopted by the Met Office for
this purpose

We generalize the numbering algorithm in

We demonstrate the effectiveness of the algorithm with respect to absolute hardware performance limits.

In this section we briefly restate the data model for unstructured meshes
introduced in

When describing a mesh, we need some way of specifying the neighbours of a
given entity. This is always possible using

In what follows we start with a

We will also employ the definition of a

A mesh is a decomposition of a simulation domain into non-overlapping polygonal or polyhedral cells. We consider meshes used in algorithms for the automatic numerical solution of partial differential equations. These meshes combine topology and geometry. The topology of a mesh is composed of mesh entities (such as vertices, edges, and cells) and the adjacency relationships between them (cells to vertices or edges to cells). The geometry of the mesh is represented by coordinates which define the position of the mesh entities in space.

Every mesh entity has a topological dimension given by the minimum number of
spatial dimensions required to represent that entity. We define

A mesh can be represented by several graphs. Each graph consists of a
multi-type set

We write

Every mesh entity has a number of adjacent entities. The mesh–element
connectivity relationships are used to specify the way mesh entities are
connected. For a given mesh of topological dimension

We write

In a mesh with a very regular topology, there may be a closed-form
mathematical expression for the adjacency relationship

Every mesh entity has a number of values associated with it. These values are
also known as

A

The data associated with the mesh also need to be numbered. The choice of
numbering can have a significant effect on the computational efficiency of
calculations over the mesh

The most common operation performed on meshes is the local application of a
function or

In the unstructured case, we store an explicit list (also known as a

In Sect.

An extruded mesh consists of a base mesh which is replicated a fixed number
of times in a layered structure

For ease of exposition, we discuss the case where each mesh column contains the same number of layers; however, this is not a limitation of the method and algorithms presented here

. A mesh of topological dimensionThe mesh definition can be extended to include extruded meshes. Let mesh

The effect of the extrusion process on the base mesh can always be captured
by associating a line segment with the vertical direction. We write

As a consequence, the cells of the extruded mesh are prisms formed by taking
the tensor product of the base mesh cell with the vertical line segment. For
example, each triangle becomes a triangular prism. The construction of tensor
product cells and finite element spaces on them is considered in more detail
in

The extrusion process introduces new types of mesh entities reflecting the
connectivity between layers. The pairs of corresponding entities of dimension

Extruded mesh entities belonging to the base mesh to be extruded (left to right): vertices, horizontal edges, horizontal facets.

Mesh entities used in the extrusion process to connect entities in
Fig.

The topological dimension on its own is no longer enough to distinguish
between the different types of entities and their orientation. Instead
entities are characterized by a pair composed of the horizontal and vertical
dimensions. In the case of a two-dimensional triangular base mesh, the set of
dimensions is

Topological dimensions of extruded mesh entities.

We write

The complete set of extruded mesh entities is then

Similarly we must extend the indexing of the adjacency relationships, writing

Identically to the case of non-extruded meshes, function spaces over an extruded mesh associate degrees of freedom with the (extended) set of mesh entities. A constant number of degrees of freedom is associated with each entity of a given type.

If we can arrange that the degrees of freedom are numbered such that the
vertical entities are “innermost”, it is possible to use direct addressing
for the vertical part of any mesh iteration, significantly reducing the
computational penalty introduced by using an indirectly addressed,
unstructured base mesh. Algorithm 1 implements this “vertical innermost”
numbering algorithm. The critical feature of this algorithm is that degrees
of freedom associated with vertically adjacent entities have adjacent global
numbers. The outcome of this vertical numbering is shown in
Fig.

Numbering of the topological entities of an extruded cell
for the case of an extruded triangle. The cell itself has
numbering

Vertical numbering of degrees of freedom (shown in filled circles) associated with vertices and horizontal edges. Only one set of vertically aligned degrees of freedom of each type is shown. The arrows outline the order in which the degrees of freedom are numbered.

Iterating over the mesh and applying a kernel to a set of connected entities (stencil) is the key operation used in mesh-based computations.

The global numbering of the degrees of freedom allows stencils to be
calculated using a direct addressing scheme when accessing the degrees of
freedom of vertically adjacent entities. We assume that the traversal of the
mesh occurs over a set of mesh entities which is homogeneous (a set
containing only cells for example). Degrees of freedom belonging to
vertically adjacent entities, accessed by two consecutive kernel applications
on the same column, have a constant offset between them. The offset is given
by the sum of degrees of freedom attached to the two vertically adjacent
entities contained in the stencil:

Let

The lists of degrees of freedom accessed by

For a given stencil function

The algorithm for computing the vertical offset is presented in Algorithm 2. Note that since the offset for two vertically aligned entity types is the same, only the base mesh entity type is considered.

If

Algorithm 3 shows the iteration algorithm working for a single field

In this section, we test the hypothesis that iteration exploiting the extruded structure of the mesh amortizes the unstructured base mesh overhead of accessing memory through explicit neighbour lists. We also show that the more layers the mesh contains, the closer its performance is to the hardware limits of the machine.

We validate our hypotheses in the Firedrake finite element framework

In Sect.

The design space to be explored is parameterized by number of layers and the manner in which the data are associated with the mesh and therefore accessed. In establishing the relationship between the performance and the hardware we examine performance on two generations of processors and varying process counts.

Numerical computations of integrals are the core mesh iteration operation in the finite element method. We focus on residual (vector) assembly for two reasons. First, in contrast to Jacobian assembly, there are no overheads due to sparse matrix insertion; the experiment is purely a test of data access via the mesh indirections. Second, residual evaluation is the assembly operation with the lowest computational intensity and therefore constitutes a worst-case scenario for data layout performance exploration.

Since we are interested in data accesses, we choose the simplest non-trivial
residual assembly operation:

In addition to the output field

Tensor product finite elements with different data layout and cell-to-cell data re-use.

The construction of a wide variety of finite element spaces on extruded
meshes was introduced in

For the purposes of data access, the distinguishing feature of different finite element spaces is the extent to which degrees of freedom are shared between adjacent cells.

We choose a set of finite element spaces spanning the combinations of horizontal and vertical reuse patterns found on extruded meshes: horizontal and vertical reuse, only horizontal, only vertical, or no reuse at all.

We employ low-order continuous and discontinuous discretizations (abbreviated
as

The set of discretizations is

Both Firedrake and our numbering algorithm support a much larger range of finite element spaces than this. However, the more complex and higher degree spaces will result in more computationally intensive kernels but not materially different data reuse. The lowest-order spaces are the most severe test of our approach since they are more likely to be memory bound.

We vary the number of layers between 1 and 100. This is a realistic range for current ocean and atmosphere simulations. The number of cells in the extruded mesh is kept approximately constant by shrinking the base mesh as the number of layers increases. The mesh size is chosen such that the data volume far exceeds the total last level cache capacity of each chosen architecture (L3 cache in all cases). This minimizes caching benefits and is therefore the strongest test of our algorithms. The overall mesh size is fixed at approximately 15 million cells, which yields a data volume of between 300 and 840 MB, depending on discretization.

The order in which the entities of the unstructured mesh are numbered is
known to be critical for data access performance. To characterize this effect
and distinguish it from the impact of the number of layers, we employ two
variants of each base mesh. The first is a mesh for which the traversal is
optimized using a reverse Cuthill–McKee ordering

The specification of the hardware used to conduct the experiments is shown in
Table

Hardware used.

The experiments we are considering are run on a single two-socket machine and use MPI (Message Passing Interface) parallelism. The number of MPI processes varies from one up to two processes per physical core (exploiting hyperthreading). We pin the processes evenly across physical cores to ensure load balance and prevent process migration between cores.

The Firedrake platform performs integral computations by automatically
generating

Runtime is measured using a nanosecond precision timer. Each experiment is performed 10 times and we report the minimum runtime. Exclusive access to the hardware has been ensured for all experiments.

We model the data transfer from main memory to CPU assuming a perfect cache:
each piece of data is only loaded from main memory once. We define the

Different discretizations lead to different data volumes due to the way data
are shared between cells.

To evaluate the impact of different data volumes we compare the valuable
bandwidth with the maximum bandwidth achieved for the STREAM triad
benchmark

Maximum STREAM triad (

The floating point operations – adds, multiplies, and, on Haswell, fused
multiply–add (FMA) operations – are counted automatically using the Intel
Architecture Code Analyzer

The performance of the extruded iteration depends on the efficiency of the
generated finite element kernel (payload) code which for some cases may not
be vectorized (as outlined in

To a first approximation the performance of a numerical algorithm will be limited by either the memory bandwidth or the floating point throughput. The STREAM benchmark provides an effective upper bound on the achievable memory bandwidth. The floating point bounds employed are based on the theoretical maximum given the clock frequency of the processor.

The Intel architectures considered are capable of executing both a floating point addition and a floating point multiplication on each clock cycle. The Haswell processor can execute a fused multiply–add instruction (FMA) instead of either an addition or multiplication operation.

The achievable FLOP rate may therefore be as much as twice the clock rate
depending on the mix of instructions executed. The achievable speed-up over
the clock rate,

The processors employed support 256 bit wide vector floating point
instructions. The double precision FLOP rate of a fully vectorized code can
be as much as 4 times that of an unvectorized code. GCC automatically
vectorized only a part of the total number of floating point instructions.
The ratio between the number of vector (packed) floating point instructions
and the total number of floating point instructions (scalar and packed)
characterizes the impact of partial vectorization on the floating point bound
through the vectorization factor

Performance of the

To control the impact of the kernel computation (payload) on the evaluation,
we compare the measured floating point throughput with a theoretical peak
which incorporates the payload instruction balance and the degree of
vectorization. Let

For the Sandy Bridge and Haswell architectures, the best performance is
achieved in the 100-layer case run with 24 and 32 processes respectively
(hyperthreading enabled). The results in Tables

Performance of the

On Sandy Bridge, the proportion of peak theoretical floating point throughput is between 71 and 85 %, while on Haswell it is between 71 and 92 %. In contrast, the proportion of peak bandwidth achieved varies between 7 and 51 % on Sandy Bridge and 9 and 75 % on Haswell. The higher and much more consistent peak FLOP results lead us to the conclusion that we are in an operation- rather than bandwidth-limited regime. The performance figures are therefore presented with respect to this metric.

When the base mesh is well ordered (Fig.

Percentage of STREAM bandwidth and theoretical throughput achieved
by the computation of integral

Percentage of STREAM bandwidth and theoretical throughput achieved
by the computation of integral

Performance of the

The performance of the extruded mesh iteration is constrained by the
properties of the mesh and the kernel computation. The total number of
computations is based on the number of degrees of freedom per cell. The range
of discretizations used in this paper (Fig.

The numbering algorithm ensures good temporal locality between vertically aligned cells. Any degrees of freedom which are shared vertically are reused when the iteration algorithm visits the next element. The reuse distance along the vertical is therefore minimal.

For

Figures

The difference between well-ordered and badly ordered mesh performance outlines the benefits responsible for the boost in performance. Horizontal data reuse dominates performance for a low number of layers, while spatial locality and vertical temporal locality (ensured by the numbering and iteration algorithms) are responsible for most of the performance gains as the number of layers increases.

We note, once again, that these results are for the lowest-order spaces which represent a worst case. Higher-order methods both access more contiguous data in each column and require many more FLOPs. As a result, we would expect to reach performance plateaus at lower numbers of layers.

In this paper we have presented efficient, locality-aware algorithms for numbering and iterating over extruded meshes. For a sufficient number of layers, the cost of using an unstructured base mesh is amortized. Achieved performance ranges from 70 to 90 % of our best estimate for the hardware's performance capabilities and current level of kernel optimization. Benefits of spatial and temporal locality vary with the number of layers: as the number of layers is increased, the benefits of spatial locality increase, while those of temporal locality decrease.

This paper employed two simplifying constraints: that there are a constant number of layers in each column, and that the number of degrees of freedom associated with each entity type is a constant. These assumptions are not fundamental to the numbering algorithm presented here, or to its performance. We intend to relax those constraints as they become important for the use cases for which Firedrake is employed.

The current code generation scheme can be extended to include inter-kernel
vectorization (an optimization mentioned in

In future work we intend to generalize some of the optimizations which extrusion enables for both residual and Jacobian assembly: inter-kernel optimizations, grouping of addition of contributions to the global system, and exploiting the vertical alignment at the level of the sparse representation of the global system matrix. In addition to the CPU results presented in this paper, we also plan to explore the performance portability issues of extruded meshes on graphical processing units and Intel Xeon Phi accelerators.

The packages used to perform the experiments have been archived using Zenodo:
Firedrake

The scripts used to perform the experiments as well as the results are
archived using Zenodo: Sandy Bridge

Gheorghe-Teodor Bercea designed the generalized extrusion algorithm, and performed the extension of the Firedrake and PyOP2 packages to support extruded meshes, the performance evaluation, and the preparation of the graphs and tables. Andrew T. T. McRae extended components of the Firedrake toolchain to support the finite element types used in the experiments, and made minor contributions to the extruded mesh iteration functionality. David A. Ham was the proponent of a generalized extrusion algorithm. Lawrence Mitchell, Florian Rathgeber, and Fabio Luporini developed related features and framework improvements in Firedrake, PyOP2, and COFFEE. Luigi Nardi is responsible for the use of the floating point balance metric. David A. Ham and Paul H. J. Kelly are the principal investigators for this paper. Gheorghe-Teodor Bercea prepared the manuscript with contributions from all the authors. All authors contributed with feedback during the paper's write-up process.

This work was supported by an Engineering and Physical Sciences Research Council prize studentship (ref. 1252364), the Grantham Institute and Climate-KIC, the Natural Environment Research Council (grant numbers NE/K006789/1, NE/K008951/1, and NE/M013480/1) and the Department of Computing, Imperial College London. The authors would like to thank J. (Ram) Ramanujam at Louisiana State University for the insightful discussions and feedback during the writing of this paper. We are thankful to Francis Russell at Imperial College London for the feedback on this paper. Edited by: S. Unterstrasser Reviewed by: two anonymous referees