- A pandas DataFrame is a two-dimensional data structure that can hold heterogeneous Python objects.
- The pandas library uses numpy’s ndarray as the underlying storage for both pandas.DataFrame and pandas.Series classes.
- The HDF5 is a standard for storing multi-dimensional data in a hierarchical fashion. A HDF5 dataset can have up to 32 dimensions.
- A HDF5 file can have huge volumes of data contained in datasets organized into various groups with root group in the top, just like the UNIX file system.
- Similar to pandas, the HDF5 also uses the underlying storage as numpy ndarrays.
Exporting a pandas DataFrame to a HDF5 file:
- A HDF5 file is organized as various groups starting from /(root).
- The method to_hdf() exports a pandas DataFrame object to a HDF5 File.
- The HDF5 group under which the pandas DataFrame has to be stored is specified through the parameter key.
- The to_hdf() method internally uses the pytables library to store the DataFrame into a HDF5 file.
- The read_hdf() method reads a pandas object like DataFrame, Series from a HDF5 file.
# Example Python program that writes a pandas DataFrame
# into a HDF5 file
import pandas as pds
# Create a DataFrame for 3x3 matrix
data = [(0.7, 0.6, 0.4),
(0.5, 0.6, 0.5),
(0.8, 0.5, 0.4)];
df = pds.DataFrame(data);
# Export the pandas DataFrame into HDF5
h5File = "fromdf.h5";
# Use pandas again to read data from the hdf5 file to the pandas DataFrame
df1 = pds.read_hdf(h5File, "/data/d1");
print("DataFrame read from the HDF5 file through pandas:");
DataFrame read from the HDF5 file through pandas:
0 1 2
0 0.7 0.6 0.4
1 0.5 0.6 0.5
2 0.8 0.5 0.4