HDF5 files are packed efficiently and allow to speed up calculations when dealing large quantities of data.
HDF5 files can be quickly explored with the line commands h5ls and h5dump.
HDF5 files can be handled in many programming languages, but here I will focus on dealing with them with python
.
Opening a hdf5 file in python
Python
can handle HDF5 files using the h5py librery. Thus the first step is to import this library and then define an object that will contain the information from our file:
import h5py
thefile = 'myfile.hdf5'
Usually, HDF5 files contain a Header and the different types of data stored in different groups, which are similar to folders.
Generating an hdf5 file in python
We can create an output HDF5 file using the write clause:
hf = h5py.File(thefile, 'w')
Note that the option ‘w’ will delete previous files with the same name.
We generate some data to be stored in our file.
# Create the edges and mid points of bins
import numpy as np
step = 1
edges = np.array(np.arange(9.,16.,step))
mhist = edges[1:]-0.5*step
# Create some random integer positions between 0 and 10
pos = np.random.randint(10, size=(100, 3))
We store the data generated above in a group called ‘data’.
hfdat = hf.create_group('data')
We add some labels with the units of the generated data.
hfdat.create_dataset('mass',data=mhist)
hfdat['mass'].dims[0].label = 'Mass (Msun/h)'
hfdat.create_dataset('pos',data=pos)
hfdat['pos'].dims[0].label = 'x,y,z (Mpc/h)'
The last thing to do is to close the newly created hdf5 file:
# Close the output file
hf.close()
Inspecting the content of an hdf5 file in the command line
We can check the content of out file using h5ls:
h5ls myfile.hdf5
To see what is in the ‘data’ group we can do:
h5ls myfile.hdf5/data
To see the values in one of the datasets use the ‘-d’ option:
h5ls -d myfile.hdf5/data/mass
The units of a dataset can be seen using the verbose ‘-v’ option:
h5ls -v myfile.hdf5/data/mass
The verbose option can also be useful to show the header contents:
h5ls -d myfile.hdf5/header
Adding a new dataset to an existing hdf5 file
We can add datasets to an already existing HDF5 file opening it with the append clause:
hf = h5py.File('myfile.hdf5', 'a')
We create a header for this file with some attributes containing general information
# Header
head = hf.create_dataset('header',(1,))
head.attrs[u'volume'] = 1000.
head.attrs[u'units_volume'] = u'(Mpc/h)**3'
We close the updated hdf5 file:
hf.close()
The header datasets will now be visible when you check the hdf5 file content in the command line:
h5ls myfile.hdf5
Reading a hdf5 file in python
To access the information in our HDF5 file, we open it in a read only way:
f = h5py.File(thefile, 'r')
We can acces the information in the groups of our file by either defining new variables or including the name of the group as a path (we see this later).
header = f['header']
data = f['data']
Reading the information in the header
To list all of the attributes in the header:
print(list(header.attrs.items()))
Once we know the names of the attributes, we can access their values. Let’s assume get the side of the cube from the volume:
boxsize = header.attrs['volume'] **(1/3)
Reading the data
The file we have created contains within the data group an array mass, and a matrix pos. We can store these as numpy arrays doing the following:
mass = data['mass'][:]
pos = data['pos'][:]
or
mass = f['data/mass'][:]
pos = f['data/pos'][:]
Note that mass will be a numpy array and pos a matrix.
Instead of getting all the information within mass, we could have only read the entries from 5 to 10, by doing:
mass5 = f['data/mass'][5:10]
We can also read the labels that accompany the information stored in the group data
print([dim.label for dim in data['pos'].dims])
print(data['pos'].dims[0].label)
Closing the hdf5 file in python
f.close()
Append values to a dataset
To append data we need to resize the datasets, and this can be only done if they don’t have a maximum shape. In the example above, the datasets have been created with a fixed shape and cannot be extended. In order to resize the datasets, these have to be declared with the relevant maxshape set to None (the rest of the code will be the same as above):
hfdat.create_dataset('mass',data=mhist,maxshape=(None,))
hfdat.create_dataset('pos',data=pos,maxshape=(None,3))
Once the file has been created with datasets without a maximum shape, we can append data to them. Let’s start by creating some extra positions to append:
# Create 5 random integer positions between 0 and 100
nadd = 5
newpos = np.random.randint(100, size=(nadd, 3))
In order to append these data, we need to open the h5 file with the append option:
f = h5py.File(thefile, 'a')
We define the dataset and resize it. The extension to the dataset is filled with 0s.
dset = f['data/pos']
dset.resize(dset.shape[0]+nadd, axis=0)
Now we fill the extension with newpos, note the syntax using comma [-nadd:,:]
:
dset[-nadd:,:] = newpos