Warm tip: This article is reproduced from serverfault.com, please click

Randomly selected file take longer to load with numpy.load than sequential ones

发布于 2020-11-26 21:11:03

Context

While training a neural network I realized the time spent per batch increased when I increased the size of my dataset (without changing the batch size). The important part is, I need to fetch 20 .npy files per data point, this number doesn't depend on the dataset size.

Problem

Training goes from 2s/iteration to 10s/iteration... There is no apparent reason why training would take longer. However, I managed to track down the bottleneck. It seems to have to do with the loading of the .npy files.


To reproduce this behavior, here's a small script you can run to generate 10,000 dummy .npy files:

def path(i):
    return os.sep.join(('../datasets/test', str(i)))

def create_dummy_files(N=10000):
    for i in range(N):
        x = np.random.random((100, 100))
        np.save(path(random.getrandbits(128)), x)

Then you can run the following two scripts and compare them yourself:

  • The first script where 20 .npy files are randomly selected and loaded:

    L = os.listdir('../datasets/test')
    S = random.sample(L, 20)
    for s in S:
        np.load(path(s)) # <- timed this
    
  • The second version, where 20 .npy 'sequential' files are selected and loaded.

    L = os.listdir('../datasets/test')
    i = 100
    S = L[i: i + 20]
    for s in S:
        np.load(path(s)) # <- timed this
    

I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice). I wrapped the np.load(path(s)) line with time.time() calls. I'm not timing the sampling, only the loading. Here are the results:

  • Random loads (times roughly stay between 0.1s and 0.4s, average is 0.25s):

    enter image description here

  • Non random loads (times roughly stay between 0.010s and 0.014s, average is 0.01s):

    enter image description here


I'm assuming those times are related to the CPU's activity when the scripts are loaded. However, it doesn't explain this gap. Why are these two results so different? Is there something to do with the way files are indexed?

Edit: I printed S in the random sample script, copied the list of 20 filenames then ran it again with S as a list literally defined. The time it took is comparable to the 'sequential' script. This means it's not related to files not being sequential in the fs or anything. It seems the random sampling gets counted in the timer, yet timing is defined as:

t = time.time()
np.load(path(s))
print(time.time() - t)

I tried as well wrapping np.load (exclusively) with cProfile: same result.

Questioner
Ivan
Viewed
0
Ivan 2020-11-29 18:54:35

I did say:

I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice)

But as tevemadar mentioned

i should be randomized

I completely messed up the operation of selecting different files in the second version. My code was timing the scripts 100 times like so:

for i in trange(100):
   if rand:
      S = random.sample(L, 20)
   else:
      S = L[i: i+20] # <- every loop there's only 1 new file added in the selection, 
                     #    19 files will have already been cached in the previous fetch

For the second script, it should rather be S = L[100*i, 100*i+20]!

And yes, when timing, the results are comparable.