EDITED:
It turned out that the base class much higher in the inheritance tree inherited Tf.keras.Model
when this class inheritance is present the below described behaviour is observable. The performance of a plain Python class is negligible compared to the plain script.
I haven't found any related documentation of this behaviour, if available updates will follow.
EDIT 2:
I have not found any documentation about this (so a confirmation is needed) but it appears that:
If I assign any object to self
(in a Tf.keras.Model
inheriting class) all the contained tf.Variables
gets extracted and appears in the trainable_variables
attribute of the class. So my assumption is that keras.Model
inspects any assignment to self
trying to find some specific objects, and this inspection causes the self
assignment of a huge dict to be slow.
For reference: Inspection gets deep in nested lists and dictionaries but do not inspect classes unless they extends ts.keras.Model
or tf.keras.Layer
ORIGINAL QUESTION:
I have an list col
of strings (~300k rows of ~30 char strings to give you an idea). To be precise is a pandas.DataGrid
not a list.
I'm creating a lookup dictionary for future use like so:
direct = {}
inverse = {}
# progressive
progressive = 0
# create direct map
for label in col:
# skip if present
if str(label) in direct:
continue
# else add to direct
direct[str(label)] = progressive
inverse[progressive] = str(label)
progressive += 1
Nothing strange here, it takes 0.15 seconds and python process memory usage is reasonable.
Then I moved my code in a class, and here things get strange. Here are provided two slightly different versions of the same function.
Version A:
def fromDataset(self, column):
# reset map
self.direct = {}
self.inverse = {}
# progressive
progressive = 0
# create direct map
for label in column:
# skip if present
if str(label) in self.direct:
continue
# else add to direct
self.direct[str(label)] = progressive
self.inverse[progressive] = str(label)
progressive += 1
Version B:
def fromDataset(self, column):
# reset map
direct = {}
inverse = {}
# progressive
progressive = 0
# create direct map
for label in column:
# skip if present
if str(label) in direct:
continue
# else add to direct
direct[str(label)] = progressive
inverse[progressive] = str(label)
progressive += 1
self.direct = direct
self.inverse = inverse
All the proposed functions yeld the same result (a dictionary of ~120k entries with a ~30MB RAM footprint)
I can accept that Version A will be slower than Version B accessing self
variables may times but what I can't understand is how is possible that Version B is taking 2.16 seconds (14x than before) weather Version A can not even be tested (After 10+ minutes no result yet and the process memory usage grows by 500+ MB)
What is even more strange Version B is taking 0.17 seconds to create the dictionaries and ~2 seconds to perform:
self.direct = direct
self.inverse = inverse
After an entire day lost with this I started wondering if there is something related with Python memory allocation that I'm missing. And the only meaningful assumption I came to is that self.direct = direct
causes Python to actually move/copy the dictionary in ram.
Can anyone explain me what is going on in Version A and in Version B that so radically differ from the straight scripted version?
I created a reproducible example and encountered no issue :
import random
import string
import pandas
def gen_random_word(word_length=30):
return ''.join((random.choice(string.ascii_letters) for _ in range(word_length)))
# I create a list of 300000 labels (but only 150k distinct labels) of 30 characters
labels = [gen_random_word() for _ in range(150000)]
labels = labels + labels
random.shuffle(labels)
# A dataframe here is useless but I try to get close to your own example
df = pandas.DataFrame(
{'labels': labels}
)
class A:
def __init__(self):
self.direct = {}
self.inverse = {}
self.progressive = 0
def fromDataset(self, column):
# create direct map
for label in column:
# skip if present
if str(label) in self.direct:
continue
# else add to direct
self.direct[str(label)] = self.progressive
self.inverse[self.progressive] = str(label)
self.progressive += 1
class B:
def __init__(self):
self.direct = {}
self.inverse = {}
self.progressive = 0
def fromDataset(self, column):
# reset map
direct = {}
inverse = {}
# progressive
progressive = 0
# create direct map
for label in column:
# skip if present
if str(label) in direct:
continue
# else add to direct
direct[str(label)] = progressive
inverse[progressive] = str(label)
progressive += 1
self.direct = direct
self.inverse = inverse
self.progressive = progressive
First test :
%%time # remove that if you are not using jupyter and use another timing solution
direct = {}
inverse = {}
# progressive
progressive = 0
# create direct map
for label in df['labels']:
# skip if present
if str(label) in direct:
continue
# else add to direct
direct[str(label)] = progressive
inverse[progressive] = str(label)
progressive += 1
Wall time : 267 ms
Second test :
%%time
a = A()
a.fromDataset(df['labels'])
Wall time : 249 ms
Third test :
%%time
b = B()
b.fromDataset(df['labels'])
Wall time : 220 ms
So... nothing significant.
Yeour reproduction is right. I've removed anything from the class (i mean other functions and even anchestor classes) and execution now is normal. By testing each single super class i've found one in the project that is inheriting Tf.keras.Model - it is possible that keras model modifies memory management on non @tf functions? (When I inherit it again the behaviour changes to as described by me)
well of course tf.keras.Model can do a lot of stuff that affects the memory and execution time.
If you know more about it I would like you to update your answer, thank you.