The netCDF-4 format is widely used for large gridded scientific data sets and includes several compression methods: lossy linear scaling and the non-lossy deflate and shuffle algorithms. Many multidimensional geoscientific data sets exhibit considerable variation over one or several spatial dimensions (e.g., vertically) with less variation in the remaining dimensions (e.g., horizontally). On such data sets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We introduce an alternative compression method called “layer-packing” that simultaneously exploits lossy linear scaling and lossless compression. Layer-packing stores arrays (instead of a scalar pair) of scale and offset parameters. An implementation of this method is compared with lossless compression, storing data at fixed relative precision (bit-grooming) and scalar linear packing in terms of compression ratio, accuracy and speed.

When viewed as a trade-off between compression and error, layer-packing yields similar results to bit-grooming (storing between 3 and 4 significant figures). Bit-grooming and layer-packing offer significantly better control of precision than scalar linear packing. Relative performance, in terms of compression and errors, of bit-groomed and layer-packed data were strongly predicted by the entropy of the exponent array, and lossless compression was well predicted by entropy of the original data array. Layer-packed data files must be “unpacked” to be readily usable. The compression and precision characteristics make layer-packing a competitive archive format for many scientific data sets.

The volume of both computational and observational geophysical data has grown dramatically in recent decades, and this trend is likely to continue. While disk storage costs have fallen, constraints remain on data storage. Hence, in practice, compression of large data sets continues to be important for efficient use of storage resources.

Two important sources of large volumes of data are computational modeling
and remote-sensing (principally from satellites). These data often have a
“hypercube” structure and are stored in formats such as HDF5
(

In this study, we examine the advantages and trade-offs of in allowing for
different treatment of dimensions in the compression process. One motivation
for this work is improving the lossy compression ratios typically achieved
with HDF5 and netCDF-4, so they are more comparable with impressive compression
achieved by GRIB2. Records in GRIB2 are strictly two-dimensional, and this
format allows for only a limited set of predefined metadata. However GRIB2
offers excellent compression efficiency

Layer-packing is a variant of compression via linear scaling that exploits the clustering of data values along dimensional axes. We contrast this with other compression methods (rounding to fixed precision, and simple linear scaling) that are readily available within the HDF5–netCDF-4 framework. The performance is quantified in terms of the resultant loss of precision and compression ratio. We also examine various statistical properties of data sets that are predictive for their overall compressibility and the relative performance of layer-packing or rounding to fixed precision.

This section outlines the implementation of the layer-packing and layer-unpacking methods, the test data sets, evaluation metrics and the performance of the compression methods on the test cases considered.

The “deflate” algorithm

When used in the HDF5–netCDF-4 framework, the deflate algorithm is often
applied together with the shuffle filter

In all of the following work, we have used the deflate algorithm and the
shuffle filter together, henceforth referred to collectively as
DEFLATE for brevity. In the results presented below, compression via DEFLATE
was performed with the

When using the netCDF-4 framework to compress a variable, the user must define
the size of the “chunk” within which compression occurs. In all the results
presented here (both for DEFLATE and for the other compression methods), the
chunk size was equal to the layers packed using layer-packing (see
Sect.

One can compress a field using a lower-precision value (e.g., as
2-byte integers rather than 4-byte floating point numbers) and a
linear transformation between the original and compressed
representations; this process is termed “packing”. In the common
case of representing 4-byte floats as 2-byte integers, there is
already a saving of 50 % relative to its original representation
(ignoring the small overhead due to storing parameters involved in the
transformation). We call this “scalar linear packing” (or LIN, for
short), and it is the standard method of packing in the geoscience
community. Its attributes are defined in the netCDF User Guide

Scalar linear packing to convert floating-point data to an unsigned 2-byte
integer representation is outlined below:

Data packed with this method often can be compressed substantially more than
the 50 % noted above. This is done by applying DEFLATE to the packed
data; this was done for all data sets compressed with LIN here. In this study,
compression via DEFLATE was performed with the

In many applications in the geophysical sciences, a multidimensional gridded variable varies dramatically across one dimension while exhibiting a limited range of within slices of this variable. Examples include variation in atmospheric density and water vapor mixing ratio with respect to height, variation in ocean temperature and current velocity with respect to depth, and variation in atmospheric concentrations of nitrogen dioxide with respect to height and time.

We will use the term “thick dimensions” to denote those dimensions that account for the majority of the variation in such variables, “thin dimensions” to denote the remaining dimensions; in the case of the first example above (assuming a global grid and a geographic coordinate system), the vertical dimension (pressure or height) is thick, and the horizontal dimensions (latitude, longitude) and time are thin. We will use the term “thin slice” to describe a slice through the hypercube for fixed values of the thick dimensions. Note that there are cases with multiple thick dimensions, such as the third example noted above.

Scalar linear packing applied to such gridded variables will result in a
considerable loss of accuracy. This is because in order to cover the scale of
variation spanned by the thick dimensions, the range of the individual thin
slices will be limited to a subset of available discrete values. To reduce
loss of precision within the thin slice, one can store for each variable

write

write the

The 2-byte representation halves the storage cost of the array itself. However arrays of scale factors and linear offsets must also be stored and this adds to the total space required (NB these are often negligible relative to the size of the packed data array). Compression is generally significantly improved by applying DEFLATE and this was done for all data sets presented here.

Further details about the implementation and the storage format are given in
the Supplement. The code to perform layer-packing described in this article
was written as stand-alone command-line tools in Python (v. 2.7.6). These are
freely available on

One can store the data at a fixed precision (i.e., a chosen number of
significant digits, or NSD). This method is known as “bit-grooming” and is
detailed by

Single-precision floating-point numbers occupy 32

Bit-grooming quantizes

The process of quantization means mapping, in this case via a process similar to rounding, from a large set of possible inputs (in this case the full set of real numbers representable as floating-point values) to a smaller set (in this case those floating-point values defined to a chosen NSD).

data to a fixed number of significant digits (NSD) using bitmasks, not floating-point arithmetic. The NSD bitmasks alter the IEEE floating point mantissa by changing to 1 (bit setting) or 0 (bit shaving) the least significant bits that are superfluous to the specified precision. Sequential values of the data are alternately shaved and set, which nearly eliminates any mean bias due to quantizationIn the following we compared storing 2, 3, 4 and 5 significant digits; these
are denoted NSD2, NSD3, NSD4 and NSD5, respectively. Similar to LIN and LAY,
DEFLATE was applied together with rounding. In this study, compression via
bit-grooming was performed using the

In the following tests, we compared a total of 255 variables from six data
sets. Each variable was extracted individually to file as uncompressed
netCDF, and the file was then compressed using the methods described,
allowing for computation of compression and error metrics described below.
The data sets are summarized in Table

The variables chosen from these data sets were those with the largest number of data points overall. For example in data sets 2–5, variables without a vertical coordinate were not considered in the analysis, since these account for only a small fraction of the total data. A small number of the variables that would otherwise be included (based on the dimensions alone) were excluded due to the occurrence in seemingly random data (i.e., of all magnitudes) in regions of the array where values were not defined (in the sense of sea-surface temperatures over land points), which we believe should have been masked with a fill value. The rationale for excluding these variables is, first, that these regions did not appear to contain meaningful data and that the extreme range of the seemingly random data led to gross outliers in the distribution of error statistics for LIN and LAY in particular.

Summary of the data sets used in this study. Abbreviations: ID –
index, NWP – numerical weather prediction, CTM – chemistry-transport model,
Dims – dimensions of the variable, TS Dims – dimensions of the thin slice,
# vars – number of variables per data set. The dimension sizes are indicated
as

The methods are compared with two metrics. The first relates to the
compression efficiency. Compression ratios are defined as (uncompressed
size)

Error was quantified by the root-mean-square difference between the original and the compressed variables. However in order to compare results across variables with different scales and units, the errors must be normalized somehow. We considered four different normalization methods, which emphasized different aspects of the error profile. The errors were normalized either by the standard deviation or the mean of the original data, and these were calculated either separately per thin slice or across the entire variable – the rationale is as follows.

When normalizing by the mean (of the entire variable, or of the thin slice),
variables with a low mean-to-standard-deviation ratio (e.g., potential
vorticity) will show larger errors using the layer-packing compression.
However, when normalizing by the standard deviation, variables with a high
mean-to-standard-deviation ratio will show larger errors using the
bit-grooming compression (e.g., atmospheric temperatures stored with units of
kelvin, concentrations of well-mixed atmospheric trace gases such as

If we calculate the ratio of the RMSE to normalization factor (i.e., the mean
or standard deviation) per thin slice and then average across the normalized
errors across slices, the resulting metric will be more sensitive to large
relative errors within subsections of the data array. The alternative is to
calculate the normalization factor across the whole variable, and the
resulting metric will be more reflective of relative errors across the entire
data array. This may be understood in the context of a hypothetical
three-dimensional array, with values ranging from

In order to make sense of which variables compress well or poorly with different methods, a range of statistics were calculated for each variable. Most of these statistics were calculated over two-dimensional hyperslices of the original data arrays, and then the value for an individual variable was taken as the average over these hyperslices. A full list of the statistics calculated is given in the Supplement.

The two most informative statistics that arose from this analysis were based
on the entropy of either the original data field or the corresponding
exponent field (i.e., decomposing the data array into significand and
exponent, and then calculating the entropy of the exponent array). The
entropy is a measure of statistical dispersion, based on the frequency with
which each value appears in each data set. Let us denote as P

In order to normalize for these limitations to the entropy of a finite data
set, for each case the entropy was normalized by the maximum theoretical
value attainable for that data set, which was taken to be

It was found that some of the data sets compressed significantly using
DEFLATE only. This was often due to a high proportion of zero or “missing”
values. Variables were classified as either “sparse” (highly compressible
or otherwise relatively simple) or “dense” (all other variables). Sparse
variables were chosen to be those satisfying any one of the following
conditions: the compression ratio is greater than 5.0 using DEFLATE, the
fraction of values equal to the most common value in the entire variable is
greater than 0.2, and the fraction of hyperslices where all values were
identical is great than 0.2. These definitions were somewhat arbitrary and
other classifications may be preferable, but it is seen (e.g., in
Figs.

Figure

The median compression times (Fig.

The ERA-Interim

For all methods considered except DEFLATE, the compression comes at the
expense of precision; the distribution of resultant errors is shown in
Fig.

The choice of error metric is ambiguous and leads to slightly
different results. Four different normalization factors for the RMSE were
considered (shown in Fig. S1). It can be seen that the comments about the
bit-grooming and LAY methods in the preceding paragraph hold regardless of
the normalization method, whereas LIN shows much higher standardized errors if
errors are normalized within thin slices and then averaged (due to the
reasons explained in the last part of Sect.

The pairwise relationship between compression and error is shown in
Fig.

The relationship between normalized errors and compression ratio for
the lossy compression methods considered. The three contours for each method
show the bounds within which the two-dimensional kernel-smoothed distribution
integrates to 0.25, 0.5 and 0.75, respectively. Only dense variables were
used to produce this plot. The normalized errors (the

Relative performance of LAY compared with NSD3 (left column) or NSD4
(right column), in terms of compression (top row) and errors (bottom row).
The grey dashed line indicates values of 1.0 on the

The question of when LAY is preferable to NSD3 or NSD4 can be addressed with
reference to the complexity statistics. Among those complexity metrics
considered, the normalized entropy of the data field proved to be the best
predictor of compression in the lossless case (DEFLATE). By contrast, the
best predictor of the relative performance of LAY to the two bit-grooming
methods was the normalized entropy of the exponent field (NEEF). By “best
predictor”, we mean that these were, respectively, the most strongly
correlated among the metrics considered with the DEFLATE compression ratios
and the relative error or compressed file size. In the case of lossless
compression, the correlation of the log of the compression ratio with the
normalized entropy was over 0.9 (the next highest correlation was below 0.8;
all variables were included), while for differentiating bit-grooming and LAY,
the absolute correlations between the NEEF and the log of the file size ratio
or the log of the RMSE ratio as shown in Fig.

The trade-off between error and compression is evident: as the NEEF
increases, the bit-groomed file sizes become larger than the corresponding
LAY file sizes, while the errors of resulting from LAY grow relative to those
of bit-grooming (Fig.

The reason that the NEEF differentiates the relative performance between
bit-grooming and LAY can be understood by the nature of the errors induced by
the two techniques. Bit-grooming guarantees constant

One interesting aspect of these results is that the entropy is defined

The clear relationship between the normalized entropy of the data field
(NEDF) and the compression ratios achieved by DEFLATE alone
(Fig.

Layer-packing was compared with scalar linear packing, bit-grooming and
lossless compression via the DEFLATE algorithm. The lossy methods form a
continuum when one compares the resultant compression ratios and normalized
errors. The trade-off between error and compression has been shown elsewhere

In this study, we have effectively separated the precision reduction from the compression itself. This is because the bit-grooming, LAY and LIN methods can be thought of as preconditioners for the same lossless compression algorithm (namely DEFLATE). The reduction in entropy due to these preconditioners explains a large part of the improved compression above lossless compression. This concept could be extended to develop methods for automatically determining the right precision to retain in a data set.

This study neither extended to the comparison with other lossless filters for
compression of the precision-reduced fields, nor did it compare the results
with other lossy compression techniques. While it is possible that the
findings presented here extend beyond the deflate and shuffle technique,
other lossy and non-lossy compression algorithms operate in fundamentally
different ways. Such an extension to this study may be considered in future,
for example, by taking advantage of the fact that the HDF5 API allows for a
range of alternative compression filters to be loaded dynamically

A number of such compression filters (both lossy and lossless) have been
developed specifically for geophysical data sets, which have different
properties compared to plain text, for example. They tend to be
multi-dimensional, stored as floating-point numbers and in many cases are
relatively smooth

Other authors have shown impressive data-compression rates using methods
originally developed for image processing. For example, the GRIB2 format
allows for compression using the JPEG-2000 algorithm and format (based on
wavelet transforms) to store numerical fields

A number of studies have assessed a range of compression methods on a variety
of data sets

Layer-packing achieves compression and error results roughly in between bit-grooming storing three or four significant figures. Layer-packing and bit-grooming control error in different ways, with bit-grooming guaranteeing fixed relative errors for every individual datum while layer-packing results in roughly constant relative errors within the hyperslice across which the packing is applied.

The idea itself behind layer-packing is not new and forms the basis of
compression within the GRIB format, in which each field is a two-dimensional
array and compression is performed on fields individually. In one sense
layer-packing is more general than that used in GRIB, in that thin slices are
not restricted to being two-dimensional. Our preliminary results (not shown)
showed that the JPEG2000 algorithm yields greater compression compared to the
methods presented here for the same level of error; this echoes the findings
of

Although the methods described here are used only on netCDF data files, they apply equally to other data formats (e.g., HDF4, HDF5). These results are presented as a proof of concept only.

The timing information is presented mainly for completeness and a caveat should be raised. As noted above, the compression via DEFLATE, bit-grooming and LIN were performed using tools from the NCO bundle (written in C and C++), whereas the LAY compression was implemented in Python. The code is presented as a demonstration of layer-packing rather than production code.

Both the scalar linear packing and the layer-packing use the same representation of the data (i.e., 2-byte integers). However large differences in the compression and relative errors were found. This is because the compressibility of a packed field is related to the distribution of values within the scaling range. Across a given thin slice, layer-packing will represent values using the full range of 2-byte integers, whereas scalar linear packing will typically use a smaller portion of that range. This increases the loss of precision resulting from scalar linear packing but also entails greater compression.

When considering which compression method to use for an individual data set, one needs to consider several factors. First, space constraints and the size of the data sets in question will vary considerably for different applications. Second, the degree of precision required will also depend on the application and may differ between variables within a data set. The bit-grooming (storing at least three significant digits) and layer-packing techniques achieved average normalized errors of 0.05 % or better, which in many geoscientific applications is much less than the model or measurement errors. Third, data sets vary considerably in their inherent compressibility, which in the cases considered appeared strongly related to the normalized entropy of the data array. Fourth, how data are stored should also reflect how they will be used (e.g., active use versus archiving). A major disadvantage of the layer-packing as described here is that it is essentially an archive format and needs to be unpacked by a custom application before it can be easily interpreted. Scalar linear packing is similarly dependent on unpacking (although many netCDF readers will automatically unpack such data, from 2-byte integers to 4-byte floating point, by default), whereas bit-grooming requires no additional software. Finally, other issues relating to portability, the availability of libraries and consistency within a community also play a role in determining the most appropriate storage format.

This paper considers layer-packing, scalar linear packing and bit-grooming as a basis for compressing large gridded data sets. When viewed in terms of the compression–error trade-off, layer-packing was found to fit within the continuum of bit-grooming (i.e., when varying the number of significant digits to store), roughly in between storing three and four significant digits. The relative performance of layer-packing and bit-grooming was strongly related to the normalized entropy of the exponent array and again highlighted the trade-off between compression and errors. Given the variation in compression and accuracy achieved for the different data sets considered, the results highlight the importance of testing compression methods on a realistic sample of the data.

If space is limited and a large data set must be stored, then we recommend that the standard deflate and shuffle methods be applied. If this does not save enough space, then careful thought should be given to precisely which variables and which subsets of individual variables will be required in the future; it often arises that despite a wealth of model output, only a limited portion will ever be examined. Many tools exist for subsetting such data sets. Beyond this, if further savings are required and if the data need not be stored in full precision, then the appropriate relative precision for each variable should be selected and applied via bit-grooming. Layer-packing should be considered when choosing a compression technique for specialist archive applications.

The Python-based command line utilities used for the layer-packing, unpacking
and relative-error analysis are available freely at

Of the data sets used in the tests (listed in Sect.

Jeremy D. Silver wrote the layer-packing Python software, performed the compression experiments and wrote most of the manuscript. Charles S. Zender contributed to the design of the study, provided some of the test data sets used in the experiments and contributed to the text.

The authors declare that they have no conflict of interest.

The work of Jeremy D. Silver was funded by the University of Melbourne's McKenzie Postdoctoral Fellowship program. The work of Charles S. Zender was funded by NASA ACCESS NNX12AF48A and NNX14AH55A and by DOE ACME DE-SC0012998. We thank Peter J. Rayner (University of Melbourne) for useful discussions. Three anonymous reviewers provided constructive comments and suggestions on the manuscript. Edited by: P. Ullrich Reviewed by: three anonymous referees