H5Z-ZFP and the HDF5 filter’s cd_values

Note

The details described here are likely relevant only to developers of the H5Z-ZFP filter. If you just want to use the filter, you can ignore this material.

The HDF5 library uses an array of values, named cd_values in formal arguments documenting various API functions, for managing auxiliary data for a filter. Instances of this cd_values array are used in two subtly different ways within HDF5.

The first use is in passing auxiliary data for a filter from the caller to the library when initially creating a dataset. This happens directly in an H5Pset_filter() (see here) call.

The second use is in persisting auxiliary data for a filter to the dataset’s object header in a file. This happens indirectly as part of an H5Dcreate() call.

When a dataset creation property list includes a filter, the filter’s set_local() method is called (see H5Zregister()) as part of the H5Dcreate call. In the filter’s set_local() method, the cd_values that were passed by the caller (in H5Pset_filter()) are often modified (via H5Pmodify_filter() (see here) before they are persisted to the dataset’s object header in a file.

Among other things, this design allows a filter to be generally configured for any dataset in a file and then adjusted as necessary to handle such things as data type and/or dimensions when it is applied to a specific dataset. Long story short, the data stored in cd_values of the dataset object’s header in the file are often not the same values passed by the caller when the dataset was created.

To make matters a tad more complex, the cd_values data is treated by HDF5 as an array of C typed, 4-byte, unsigned integer values. Furthermore, regardless of endianness of the data producer, the persisted values are always stored in little-endian format in the dataset object header in the file. Nonetheless, if the persisted cd_values data is ever retrieved (e.g. via H5Pget_filter_by_id() (see here), the HDF5 library ensures the data is returned to callers with proper endianness. When command-line tools like h5ls and h5dump print cd_values, the data will be displayed correctly.

Handling double precision auxiliary data via cd_values is still more complicated because a single double precision value will span multiple entries in cd_values in almost all cases. Setting aside the possibility of differing floating point formats between the producer and consumers, any endianness handling the HDF5 library does for the 4-byte entries in cd_values will certainly not ensure proper endianness handling of larger values. It is impossible for command-line tools like h5ls and h5dump to display such data correctly.

Fortunately, the ZFP library has already been designed to handle these issues as part of the ZFP’s native stream header. But, the ZFP library handles these issues in an endian-agnostic way. Consequently, the H5Z-ZFP filter uses the cd_values that is persisted to a dataset’s object header to store ZFP’s stream header. ZFP’s stream header is stored starting at &cd_values[1]. cd_values[0] is used to stored H5Z-ZFP filter and ZFP library and ZFP encoder version information.

This also means that H5Z-ZFP avoids the overhead of duplicating the ZFP stream header in each dataset chunk. For larger chunks, these savings are probably not too terribly significant.