Geoscientific models and measurements generate false precision (scientifically meaningless data bits) that wastes storage space. False precision can mislead (by implying noise is signal) and be scientifically pointless, especially for measurements. By contrast, lossy compression can be both economical (save space) and heuristic (clarify data limitations) without compromising the scientific integrity of data. Data quantization can thus be appropriate regardless of whether space limitations are a concern. We introduce, implement, and characterize a new lossy compression scheme suitable for IEEE floating-point data. Our new Bit Grooming algorithm alternately shaves (to zero) and sets (to one) the least significant bits of consecutive values to preserve a desired precision. This is a symmetric, two-sided variant of an algorithm sometimes called Bit Shaving that quantizes values solely by zeroing bits. Our variation eliminates the artificial low bias produced by always zeroing bits, and makes Bit Grooming more suitable for arrays and multi-dimensional fields whose mean statistics are important.

Bit Grooming relies on standard lossless compression to achieve the actual reduction in storage space, so we tested Bit Grooming by applying the DEFLATE compression algorithm to bit-groomed and full-precision climate data stored in netCDF3, netCDF4, HDF4, and HDF5 formats. Bit Grooming reduces the storage space required by initially uncompressed and compressed climate data by 25–80 and 5–65 %, respectively, for single-precision values (the most common case for climate data) quantized to retain 1–5 decimal digits of precision. The potential reduction is greater for double-precision datasets. When used aggressively (i.e., preserving only 1–2 digits), Bit Grooming produces storage reductions comparable to other quantization techniques such as Linear Packing. Unlike Linear Packing, whose guaranteed precision rapidly degrades within the relatively narrow dynamic range of values that it can compress, Bit Grooming guarantees the specified precision throughout the full floating-point range. Data quantization by Bit Grooming is irreversible (i.e., lossy) yet transparent, meaning that no extra processing is required by data users/readers. Hence Bit Grooming can easily reduce data storage volume without sacrificing scientific precision or imposing extra burdens on users.

The increased resolution of geoscientific models and measurements (GSMMs) leads to increases in dataset size that outpace improvements in both accuracy (nearness to true values) and precision (degree of repeatability). Numerical precision that exceeds true or assumed knowledge of the underlying phenomena is called false precision and a significant fraction of GSMM storage bits archive this false precision as essentially random (and therefore hard to compress) bits that lack scientific content. Lossy compression techniques can reduce storage requirements without sacrificing scientific content by eliminating unused ranges and/or false precision of stored fields. We introduce a new algorithm, Bit Grooming, that preserves a specified level of precision, is statistically unbiased, retains the full representable range of floating-point data, and yet requires no additional software tools or filters to read or write.

For measurements there is never a scientific reason to retain false
precision, as it amounts to storing random bits. Reasons to retain false
precision during prognostic integrations of geoscientific models include
numerical stability, conservation checks (e.g., mass, energy, momentum), and
correct treatment of threshold and resonance phenomena. There are fewer
reasons to retain false precision after than during a simulation. Most GSMMs
store their data as either four- or eight-byte IEEE floating-point numbers.
IEEE single-precision (SP, four-byte) and double-precision (DP, eight-byte)
formats

Data compression is well-studied

The compression ratios of lossless techniques are limited by the need to recover the exact data compressed. Lossy compression (also called quantization) relaxes this requirement and can “trade off” precision for compression. Losses acceptable with some forms of data can only be determined subjectively, as for example the quality of photographic images. In contrast, researchers can, at least in principle, know a priori the scientifically defensible precision of GSMMs. False precision can mislead (by implying noise is signal) and be scientifically pointless, especially for measurements. By contrast, lossy compression can be both economical (save space) and heuristic (clarify data limitations). Data quantization can thus be appropriate regardless of whether space limitations are a concern. Thus after presenting our quantitative results, we describe techniques that make Bit Grooming simple and practical.

This paper is organized into four more sections. Section

A primary motivation in developing Bit Grooming is to reduce the storage of
climate-related datasets. We implemented and tested Bit Grooming in the
netCDF Operators, NCO

First, NCO can read and write data encoded with the (lossless) DEFLATE
algorithm

The three lossy compression algorithms NCO employs are packing and two
precision-preserving algorithms (including Bit Grooming). Packing quantizes
(usually) floating-point data into a lower precision type (fewer bytes per
value) that represents a much smaller range. By convention netCDF defines a
linear-packing algorithm that depends on two parameters
(

Packing floating-point data into integers has benefits and drawbacks. The
type conversion frees up the IEEE754 exponent bits (8 bits for SP, and
11 bits for DP), which then contribute to the dynamic range of the packed
integers (16 and 32 bits for

Another limitation of Linear Packing is that the precision of packed data
cannot be specified or guaranteed in advance because it depends on the
distribution of the values to be packed. While the numeric resolution (i.e.,
the smallest resolvable difference) of unpacked data always equals

Consider the same dynamic range used previously except now offset by

The other two lossy compression algorithms considered both perform
precision-preserving compression (PPC). The operational definition of
“significant digit” in our precision-preserving algorithms is that the
exact value, before rounding or quantization, is within one-half of the value
of the decimal place occupied by the least significant digit (LSD) of the
rounded value. For example, the value

One PPC algorithm preserves the specified total number of significant digits
(NSD) of the value. For example, there is
only one significant digit in the weight of most “eight-hundred pound
gorillas” that you will encounter, i.e., so

The other PPC algorithm preserves the number of decimal significant digits
(DSD), i.e., the number of significant digits following (positive, by
convention) or preceding (negative) the decimal point. For example,

Their fundamental difference is that NSD is independent of dimensional units
and DSD is not. The NSD for a given GSMM value depends on the intrinsic
accuracy and error characteristics of the model or measurements. The
appropriate DSD for a given value depends on these intrinsic characteristics
and, in addition, the dimensional units with which values are stored. Our
eight-hundred pound gorilla has

The time penalty for compressing and uncompressing data varies according to
the algorithm.

NSD algorithms create a bitmask to alter the significand (mantissa or
fraction) of IEEE 754 floating-point data. For instance, the bitmask for the
NSD technique called Bit Shaving is one for all bits to be retained and zero
for ignored bits

The DSD algorithm, by contrast, uses rounding to remove undesired precision.
The rounding zeroes the greatest number of (base-2) significand bits
consistent with the desired (base-10) decimal precision. Our NCO
implementation rounds with the internal math library

Maintaining non-biased statistical properties during lossy compression
requires special attention.
Decimal Rounding uses

Exact and lossy IEEE single-precision floating-point Pi. IEEE-754
single-precision binary representations of

Bit Grooming Pi.
Same as Table

To demonstrate the change in IEEE representation caused by quantization,
consider again the case of

Reducing the preserved precision of NSD rounding produces increasingly long
strings of identical bits amenable to compression (Table

The consumption of about 3 bits per digit of base-10 precision is evident, as
is the coincidence of a quantized value that greatly exceeds the mandated
precision for

How can one be sure lossy data are sufficiently precise?
We define several metrics to quantify quantization error.
The mean error

The three most important error metrics for quantization are

Traditional Bit Shaving bit-shifts zeros into the least significant
bits (LSBs) of true values

All three metrics are expressed in terms of the fraction of the tens place
occupied by the LSD. If the LSD is the hundreds digit or the thousandths
digit, then the metrics are fractions of 100 or

To demonstrate these principles we conduct error analyses on an artificial,
reproducible dataset, and on an actual dataset

The artificial
dataset employed is one million evenly spaced values from 1.0 to 2.0. The
analysis data are

For

Error metrics for Bit Grooming vs. Bit Shaving.

Compression ratios for low-resolution initially uncompressed model netCDF3 data.

The artificial data have a much smaller mean error

Compression ratios for high-resolution initially uncompressed model
data

PPC quantization enhances compression of typical climate datasets. The degree
of enhancement depends, of course, on the required precision. Model results
are often computed as

The first dataset tested (Table

Packing SP floating-point data into two-byte integers yields
CR

The second dataset tested (Table

Compression ratios for high-resolution initially compressed observed HDF4 data.

Compression ratios for initially compressed HDF5 data.

NASA uses HDF4 format to store and distribute the third dataset tested
(Table

NASA uses HDF5 format to store and distribute the fourth dataset tested
(Table

PPC algorithms preserve all significant digits of every value.
The Bit Grooming (NSD) algorithm uses a theoretical approach (3.32 bits
per base-10 digit), tuned and tested to ensure the

While Bit Grooming works on top of any lossless compression technique, we
demonstrated it with the DEFLATE algorithm

Factors influencing the choice of lossy compression technique include
precision, accuracy, dynamic range, compression ratio, and portability

These other factors may include the greater transparency, dynamic range, and guaranteed precision of Bit Grooming relative to packing. Regarding transparency, Bit-Groomed data are valid IEEE floating-point data immediately suitable for analysis and plotting, whereas packed data must first be unpacked and reconstituted into intelligible floating-point data. Hence Bit-Groomed data are more portable than packed data.

Another important consideration is precision. Bit Grooming guarantees that
its lossy quantization will preserve a specified number of (decimal)
significant digits. Packing into two-byte integers

In terms of range, Bit Grooming has the same dynamic ranges as IEEE SP and DP
data,

Offering multiple quantization and compression algorithms with a consistent
and simple interface is important so that users can easily find the algorithm
that best suits their needs. This section describes the NCO implementation of
the three quantization algorithms and single lossless compression algorithm
that NCO exposes to user control. We focus on the new PPC algorithms (Bit
Grooming and Decimal Rounding) whose characteristics are the subject of most
of this study, but we begin with a brief summary of the DEFLATE and packing
implementations that have been in NCO for 10–20 years. NCO triggers lossless
DEFLATE compression with the

Although Bit Grooming instantly reduces data precision, on-disk
storage reduction occurs only once the data are compressed either
internally (e.g., by netCDF) or externally (by a user-supplied
mechanism).
It is straightforward to compress data internally using the built-in
compression and decompression supported by netCDF4/HDF5.
For convenience, NCO automatically activates file-wide DEFLATE
deflation level one (i.e.,

NCO users can invoke PPC with the long option

NCO users can specify the precision of an entire dataset with many variables
in one simple command. Setting

NCO users access PPC through a single switch,

The

To request, say, 5 significant digits (

We introduced a new lossy and precision-preserving compression (PPC) algorithm called Bit Grooming, and evaluated it against its nearest cousin, Bit Shaving, as well as against packing and lossless techniques. Bit Grooming replaces the (unwanted) least significant bits of the IEEE significand with a string of identical values that alternates between zeroes and ones for consecutive elements of an array. We quantified the trade-offs involved in the choice of lossy-packing technique for four climate-related datasets. We found that PPC compression reduces the volume of single-precision compressed data by roughly 10 % per decimal digit quantized (or “groomed”) after the third such digit. Bit Grooming reduces the storage space required by initially uncompressed and compressed climate data by 25–80 and 5–65 %, respectively, for single-precision values (the most common case for climate data) quantized to retain 1–5 decimal digits of precision. Bit Groomed and Bit Shaved data are equally efficiently compressed, and Bit Grooming eliminates undesirable statistical artifacts of Bit Shaving. By alternately using zero and one as the fill bit, Bit Grooming produces no mean absolute bias, whereas Bit Shaving is negatively biased.

The lossy technique of Linear Packing, followed by lossless compression, produces significantly better compression ratios than PPC algorithms like Bit Grooming for most precision levels. Bit Grooming yields comparable or better compression than packing only when retaining 1 or 2 significant digits of precision. Packing, however, can only encode values from a much smaller dynamic range than Bit Grooming, and its guaranteed precision degrades rapidly (1 digit per decade) outside the largest decade of values to be quantized. Moreover, packed data require additional software overhead to unpack. Bit Grooming, in contrast, works on all ranges of floating-point values, has well-defined and guaranteed precision, and requires no additional software interface to read. By understanding the trade-offs between precision, statistical accuracy, numerical range and storage space of common lossy-packing techniques, producers can make better decisions regarding how much precision to archive in their datasets, and how to discard the false precision.

NCO source code is available from GitHub at

Two anonymous reviewers and J. D. Silver provided helpful comments that improved the quality of this paper. R. Signell originally suggested we investigate Decimal Rounding. Supported by NASA ACCESS NNX12AF48A and NNX14AH55A and by DOE ACME DE-SC0012998. Edited by: D. Ham Reviewed by: two anonymous referees