HDF5 files are packed efficiently and allow to speed up calculations when dealing large quantities of data.

HDF5 files can be quickly explored with the line commands h5ls and h5dump.

HDF5 files can be handled in many programming languages, but here I will focus on dealing with them with python.

Opening a hdf5 file in python

Python can handle HDF5 files using the h5py librery. Thus the first step is to import this library and then define an object that will contain the information from our file:

import h5py
thefile = 'myfile.hdf5'

Usually, HDF5 files contain a Header and the different types of data stored in different groups, which are similar to folders.

Writing an hdf5 file in python

We can create an output HDF5 file using the write clause:

hf = h5py.File(thefile, 'w')

We create a header for this file with some attributes containing general information

# Header
head = hf.create_dataset('header',(1,))	
head.attrs[u'volume']       = 1000.
head.attrs[u'units_volume'] = u'(Mpc/h)**3'

We generate some data to be stored in our file.

# Create the edges and mid points of bins
import numpy as np
step = 1
edges = np.array(np.arange(9.,16.,step))
mhist = edges[1:]-0.5*step

# Create some random integer positions between 0 and 10
pos = np.random.randint(10, size=(100, 3))

We store the data generated above in a group called ‘data’.

hfdat = hf.create_group('data')

We add some labels with the units of the generated data.

hfdat.create_dataset('mass',data=mhist)
hfdat['mass'].dims[0].label = 'Mass (Msun/h)'

hfdat.create_dataset('pos',data=pos)
hfdat['pos'].dims[0].label = 'x,y,z (Mpc/h)'

The last thing to do is to close the newly created hdf5 file:

# Close the output file
hf.close()

We can check the content of out file using h5ls. To see what is in the ‘data’ group we can do:

h5ls myfile.hdf5/data

Reading a hdf5 file in python

To access the information in our HDF5 file, we open it in a read only way:

f = h5py.File(thefile, 'r')

We can acces the information in the groups of our file by either defining new variables or including the name of the group as a path (we see this later).

header = f['Header']
data = f['Data']

Reading the information in the header

To list all of the attributes in the header:

print(list(header.attrs.items()))  

Once we know the names of the attributes, we can access their values. Let’s assume get the side of the cube from the volume:

boxsize = header.attrs['volume'] **(1/3)

Reading the data

The file we have created contains within the data group an array mass, and a matrix pos. We can store these as numpy arrays doing the following:

mass = data['mass'][:]
pos = data['pos'][:]

or

mass = f['data/mass'][:]
pos = f['data/pos'][:]

Note that mass will be a numpy array and pos a matrix.

Instead of getting all the information within mass, we could have only read the entries from 5 to 10, by doing:

mass5 = f['data/mass'][5:10]

We can also read the labels that accompany the information stored in the group data

print([dim.label for dim in data['pos'].dims])
print(data['pos'].dims[0].label)

Closing the hdf5 file in python

f.close()

Append values to a dataset

To append data we need to resize the datasets, and this can be only done if they don’t have a maximum shape. In the example above, the datasets have been created with a fixed shape and cannot be extended. In order to resize the datasets, these have to be declared with the relevant maxshape set to None (the rest of the code will be the same as above):

hfdat.create_dataset('mass',data=mhist,maxshape=(None))
hfdat.create_dataset('pos',data=pos,maxshape=(None,3))

Once the file has been created with datasets without a maximum shape, we can append data to them. Let’s start by creating some extra positions to append:

# Create 5 random integer positions between 0 and 100
nadd = 5
newpos = np.random.randint(100, size=(nadd, 3))

In order to append these data, we need to open the h5 file with the append option:

f = h5py.File(thefile, 'a')

We define the dataset and resize it. The extension to the dataset is filled with 0s.

dset = f['data/pos']
dset.resize(dset.shape[0]+nadd, axis=0) 

Now we fill the extension with newpos, note the syntax using comma [-nadd:,:]:

dset[-nadd:,:] = newpos