HDF5 files are packed efficiently and allow to speed up calculations when dealing large quantities of data.
Opening a hdf5 file in python
import h5py thefile = 'myfile.hdf5'
Generating an hdf5 file in python
We can create an output HDF5 file using the write clause:
hf = h5py.File(thefile, 'w')
Note that the option ‘w’ will delete previous files with the same name.
We generate some data to be stored in our file.
# Create the edges and mid points of bins import numpy as np step = 1 edges = np.array(np.arange(9.,16.,step)) mhist = edges[1:]-0.5*step # Create some random integer positions between 0 and 10 pos = np.random.randint(10, size=(100, 3))
We store the data generated above in a group called ‘data’.
hfdat = hf.create_group('data')
We add some labels with the units of the generated data.
hfdat.create_dataset('mass',data=mhist) hfdat['mass'].dims.label = 'Mass (Msun/h)' hfdat.create_dataset('pos',data=pos) hfdat['pos'].dims.label = 'x,y,z (Mpc/h)'
The last thing to do is to close the newly created hdf5 file:
# Close the output file hf.close()
Inspecting the content of an hdf5 file in the command line
We can check the content of out file using h5ls:
To see what is in the ‘data’ group we can do:
To see the values in one of the datasets use the ‘-d’ option:
h5ls -d myfile.hdf5/data/mass
Adding a new dataset to an existing hdf5 file
We can add datasets to an already existing HDF5 file opening it with the append clause:
hf = h5py.File('myfile.hdf5', 'a')
We create a header for this file with some attributes containing general information
# Header head = hf.create_dataset('header',(1,)) head.attrs[u'volume'] = 1000. head.attrs[u'units_volume'] = u'(Mpc/h)**3'
We close the updated hdf5 file:
The header datasets will now be visible when you check the hdf5 file content in the command line:
Reading a hdf5 file in python
To access the information in our HDF5 file, we open it in a read only way:
f = h5py.File(thefile, 'r')
We can acces the information in the groups of our file by either defining new variables or including the name of the group as a path (we see this later).
header = f['header'] data = f['data']
Reading the information in the header
To list all of the attributes in the header:
Once we know the names of the attributes, we can access their values. Let’s assume get the side of the cube from the volume:
boxsize = header.attrs['volume'] **(1/3)
Reading the data
The file we have created contains within the data group an array mass, and a matrix pos. We can store these as numpy arrays doing the following:
mass = data['mass'][:] pos = data['pos'][:]
mass = f['data/mass'][:] pos = f['data/pos'][:]
Note that mass will be a numpy array and pos a matrix.
Instead of getting all the information within mass, we could have only read the entries from 5 to 10, by doing:
mass5 = f['data/mass'][5:10]
We can also read the labels that accompany the information stored in the group data
print([dim.label for dim in data['pos'].dims]) print(data['pos'].dims.label)
Closing the hdf5 file in python
Append values to a dataset
To append data we need to resize the datasets, and this can be only done if they don’t have a maximum shape. In the example above, the datasets have been created with a fixed shape and cannot be extended. In order to resize the datasets, these have to be declared with the relevant maxshape set to None (the rest of the code will be the same as above):
Once the file has been created with datasets without a maximum shape, we can append data to them. Let’s start by creating some extra positions to append:
# Create 5 random integer positions between 0 and 100 nadd = 5 newpos = np.random.randint(100, size=(nadd, 3))
In order to append these data, we need to open the h5 file with the append option:
f = h5py.File(thefile, 'a')
We define the dataset and resize it. The extension to the dataset is filled with 0s.
dset = f['data/pos'] dset.resize(dset.shape+nadd, axis=0)
Now we fill the extension with newpos, note the syntax using comma
dset[-nadd:,:] = newpos