Load group from hdf5

发布于 2020-03-31 22:58:06

I have an hdf5 file that contains datasets inside groups. Example:

group1/dataset1
group1/dataset2
group1/datasetX


group2/dataset1
group2/dataset2
group2/datasetX

I'm able to read each dataset independently. This is how I read a dataset from a .hdf5 file:

def hdf5_load_dataset(hdf5_filename, dsetname):
    with h5py.File(hdf5_filename, 'r') as f:
        dset = f[dsetname]
        return dset[...]

# pseudo-code of how I call the hdf5_load_dataset() function
group = {'group1':'dataset1', 'group1':'dataset2' ...}

for group in groups:
    for dataset in groups[group]:
        dset_value = hdf5_load_dataset_value(path_hdf5_file, f'{group}/{dataset}')
        # do stuff

I would like to know if it's possible to load into memory all the datasets of group1, then of group2, etc as a dictionary or similar in a single file read. My script is taking quite some time (4min) to read ~200k datasets. There are 2k groups with 100 datasets. So if I load a group in memory at once it will not overload it and I will gain in speed.

This is a pseudo-code of what I'm looking for:

for group in groups:
    dset_group_as_dict = hdf5_load_group(path_hdf5_file, f'{group}')

    for dataset in dset_group_as_dict;
        #do stuff

EDIT:

Inside each .csv file:

time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05

For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:

XY_1/impact_X/time
XY_1/impact_Y/amplitude

where

time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...])  # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...])  # 500 items

XY_1 is a position in space.

impact_X means that X was impacted in position XY_1 so X amplitude has changed.

So, XY_1 must be in a different group of XY_2 as well as impact_X, impact_Y etc since they represent data to a particular XY position.

I need to create plots from each or only one (time, amplitude) pair (configurable). I also need to compare the amplitudes with a "golden" array to see the differences and calculate other stuff. To perform calculation I will read all the datasets, perform calculation and save the result.

I have more than 200k .csv files for each test case, in a total is more than 5M. Using 5M reads from disk will take quite some time in this case. For the 200k files, by exporting all the .csv to a unique JSON file, it takes ~40s to execute, using .csv is taking ~4min. I can't use a unique json any longer due to memory issues when loading the single JSON file. That's why I chose hdf5 as an alternative.

EDIT 2:

How I read the csv file:

def read_csv_return_list_of_rows(csv_file, _delimiter):
    csv_file_list = list()
    with open(csv_file, 'r') as f_read:
        csv_reader = csv.reader(f_read, delimiter = _delimiter)
        for row in csv_reader:
            csv_file_list.append(row)
    return csv_file_list

Questioner

Raphael

Viewed

Chinese

Original

Raphael 2020-01-31 20:55:02

thx, please take a look at the updated question.

titusjan 2020-01-31 21:35:16

@Raphael I don't understand why reading the JSON file would go faster than separate CSV files. If I understand correctly the JSON file would be just as large as all the CSV files combined (i.e. 5 million data elements). Do you use numpy.loadtxt() or did you make your own function?

Raphael 2020-01-31 21:49:00

Maybe it's because I'm reading 200k (for each test case) .csv instead of a single json file? I'm not sure. I updated the question with the script I use to read the csv and return a list of it's values.

Raphael 2020-01-31 22:55:51

I think I will continue using the .csv files since hdfile in this case is slow and use more space than the csv files. Thank you for the clarification.

Load group from hdf5

EDIT:

EDIT 2:

Related issues