Warm tip: This article is reproduced from stackoverflow.com, please click
hdf5 python python-3.x

Load group from hdf5

发布于 2020-03-31 22:58:06

I have an hdf5 file that contains datasets inside groups. Example:

group1/dataset1
group1/dataset2
group1/datasetX


group2/dataset1
group2/dataset2
group2/datasetX

I'm able to read each dataset independently. This is how I read a dataset from a .hdf5 file:

def hdf5_load_dataset(hdf5_filename, dsetname):
    with h5py.File(hdf5_filename, 'r') as f:
        dset = f[dsetname]
        return dset[...]

# pseudo-code of how I call the hdf5_load_dataset() function
group = {'group1':'dataset1', 'group1':'dataset2' ...}

for group in groups:
    for dataset in groups[group]:
        dset_value = hdf5_load_dataset_value(path_hdf5_file, f'{group}/{dataset}')
        # do stuff

I would like to know if it's possible to load into memory all the datasets of group1, then of group2, etc as a dictionary or similar in a single file read. My script is taking quite some time (4min) to read ~200k datasets. There are 2k groups with 100 datasets. So if I load a group in memory at once it will not overload it and I will gain in speed.

This is a pseudo-code of what I'm looking for:

for group in groups:
    dset_group_as_dict = hdf5_load_group(path_hdf5_file, f'{group}')

    for dataset in dset_group_as_dict;
        #do stuff

EDIT:

Inside each .csv file:

time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05

For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:

XY_1/impact_X/time
XY_1/impact_Y/amplitude

where

time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...])  # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...])  # 500 items

XY_1 is a position in space.

impact_X means that X was impacted in position XY_1 so X amplitude has changed.

So, XY_1 must be in a different group of XY_2 as well as impact_X, impact_Y etc since they represent data to a particular XY position.

I need to create plots from each or only one (time, amplitude) pair (configurable). I also need to compare the amplitudes with a "golden" array to see the differences and calculate other stuff. To perform calculation I will read all the datasets, perform calculation and save the result.

I have more than 200k .csv files for each test case, in a total is more than 5M. Using 5M reads from disk will take quite some time in this case. For the 200k files, by exporting all the .csv to a unique JSON file, it takes ~40s to execute, using .csv is taking ~4min. I can't use a unique json any longer due to memory issues when loading the single JSON file. That's why I chose hdf5 as an alternative.

EDIT 2:

How I read the csv file:

def read_csv_return_list_of_rows(csv_file, _delimiter):
    csv_file_list = list()
    with open(csv_file, 'r') as f_read:
        csv_reader = csv.reader(f_read, delimiter = _delimiter)
        for row in csv_reader:
            csv_file_list.append(row)
    return csv_file_list
Questioner
Raphael
Viewed
20
titusjan 2020-01-31 19:43

No, there is no single function that reads multiple groups or datasets at once. You have to make it yourself from lower level functions that read one group or dataset.

And can you give us some further context? What kind of data is it and how do you want to process it? (Do you want to make statistics? Make plots? Etcetera.) What are you ultimately trying to achieve? This may help us to avoid the to avoid the classical XY problem.

In your earlier question you said you converted a lot of small CSV file into one big HDF file. Can you tell us why? What is wrong with having many CSV small files?

In my experience HDF files with a huge number of groups and datasets are fairly slow, as you are experiencing now. Is it better to have relatively few, but larger, datasets. Is it possible for you to somehow merge multiple datasets into one? If not, HDF may not be the best solution for your problem.