HDF5 files are packed efficiently and allow to speed up calculations when dealing large quantities of data.
HDF5 files can be quickly explored with the line commands h5ls and h5dump.
HDF5 files can be handled in many programming languages, but here I will focus on dealing with them with
Opening a hdf5 file in python
Python can handle HDF5 files using the h5py librery. Thus the first step is to import this library and then define an object that will contain the information from our file:
import h5py thefile = 'myfile.hdf5'
Usually, HDF5 files contain a Header and the different types of data stored in different groups, which are similar to folders.
Writing an hdf5 file in python
We can create an output HDF5 file using the write clause:
hf = h5py.File(thefile, 'w')
We create a header for this file with some attributes containing general information
# Header head = hf.create_dataset('header',(1,)) head.attrs[u'volume'] = 1000. head.attrs[u'units_volume'] = u'(Mpc/h)**3'
We generate some data to be stored in our file.
# Create the edges and mid points of bins import numpy as np step = 1 edges = np.array(np.arange(9.,16.,step)) mhist = edges[1:]-0.5*step # Create some random integer positions between 0 and 10 pos = np.random.randint(10, size=(100, 3))
We store the data generated above in a group called ‘data’.
hfdat = hf.create_group('data')
We add some labels with the units of the generated data.
hfdat.create_dataset('mass',data=mhist) hfdat['mass'].dims.label = 'Mass (Msun/h)' hfdat.create_dataset('pos',data=pos) hfdat['pos'].dims.label = 'x,y,z (Mpc/h)'
The last thing to do is to close the newly created hdf5 file:
# Close the output file hf.close()
We can check the content of out file using h5ls. To see what is in the ‘data’ group we can do:
Reading a hdf5 file in python
To access the information in our HDF5 file, we open it in a read only way:
f = h5py.File(thefile, 'r')
We can acces the information in the groups of our file by either defining new variables or including the name of the group as a path (we see this later).
header = f['header'] data = f['data']
Reading the information in the header
To list all of the attributes in the header:
Once we know the names of the attributes, we can access their values. Let’s assume get the side of the cube from the volume:
boxsize = header.attrs['volume'] **(1/3)
Reading the data
The file we have created contains within the data group an array mass, and a matrix pos. We can store these as numpy arrays doing the following:
mass = data['mass'][:] pos = data['pos'][:]
mass = f['data/mass'][:] pos = f['data/pos'][:]
Note that mass will be a numpy array and pos a matrix.
Instead of getting all the information within mass, we could have only read the entries from 5 to 10, by doing:
mass5 = f['data/mass'][5:10]
We can also read the labels that accompany the information stored in the group data
print([dim.label for dim in data['pos'].dims]) print(data['pos'].dims.label)
Closing the hdf5 file in python
Append values to a dataset
To append data we need to resize the datasets, and this can be only done if they don’t have a maximum shape. In the example above, the datasets have been created with a fixed shape and cannot be extended. In order to resize the datasets, these have to be declared with the relevant maxshape set to None (the rest of the code will be the same as above):
Once the file has been created with datasets without a maximum shape, we can append data to them. Let’s start by creating some extra positions to append:
# Create 5 random integer positions between 0 and 100 nadd = 5 newpos = np.random.randint(100, size=(nadd, 3))
In order to append these data, we need to open the h5 file with the append option:
f = h5py.File(thefile, 'a')
We define the dataset and resize it. The extension to the dataset is filled with 0s.
dset = f['data/pos'] dset.resize(dset.shape+nadd, axis=0)
Now we fill the extension with newpos, note the syntax using comma
dset[-nadd:,:] = newpos