Tf.keras.Model class ~ self variables performance

Question

Warm tip: This article is reproduced from stackoverflow.com, please click

dictionary python tensorflow2.0

Tf.keras.Model class ~ self variables performance

发布于 2020-04-07 10:15:01

EDITED:

It turned out that the base class much higher in the inheritance tree inherited Tf.keras.Model when this class inheritance is present the below described behaviour is observable. The performance of a plain Python class is negligible compared to the plain script.

I haven't found any related documentation of this behaviour, if available updates will follow.

EDIT 2:

I have not found any documentation about this (so a confirmation is needed) but it appears that: If I assign any object to self (in a Tf.keras.Model inheriting class) all the contained tf.Variables gets extracted and appears in the trainable_variables attribute of the class. So my assumption is that keras.Model inspects any assignment to self trying to find some specific objects, and this inspection causes the self assignment of a huge dict to be slow.

For reference: Inspection gets deep in nested lists and dictionaries but do not inspect classes unless they extends ts.keras.Model or tf.keras.Layer

ORIGINAL QUESTION:

I have an list col of strings (~300k rows of ~30 char strings to give you an idea). To be precise is a pandas.DataGrid not a list.

I'm creating a lookup dictionary for future use like so:

direct = {}
inverse = {}
# progressive
progressive = 0

# create direct map
for label in col:
    # skip if present
    if str(label) in direct:
        continue
    # else add to direct
    direct[str(label)] = progressive
    inverse[progressive] = str(label)
    progressive += 1

Nothing strange here, it takes 0.15 seconds and python process memory usage is reasonable.

Then I moved my code in a class, and here things get strange. Here are provided two slightly different versions of the same function.

Version A:

def fromDataset(self, column):
    # reset map
    self.direct = {}
    self.inverse = {}

    # progressive
    progressive = 0

    # create direct map
    for label in column:
        # skip if present
        if str(label) in self.direct:
            continue
        # else add to direct
        self.direct[str(label)] = progressive
        self.inverse[progressive] = str(label)
        progressive += 1

Version B:

def fromDataset(self, column):
    # reset map
    direct = {}
    inverse = {}

    # progressive
    progressive = 0

    # create direct map
    for label in column:
        # skip if present
        if str(label) in direct:
            continue
        # else add to direct
        direct[str(label)] = progressive
        inverse[progressive] = str(label)
        progressive += 1

    self.direct = direct
    self.inverse = inverse

All the proposed functions yeld the same result (a dictionary of ~120k entries with a ~30MB RAM footprint)

I can accept that Version A will be slower than Version B accessing self variables may times but what I can't understand is how is possible that Version B is taking 2.16 seconds (14x than before) weather Version A can not even be tested (After 10+ minutes no result yet and the process memory usage grows by 500+ MB)

What is even more strange Version B is taking 0.17 seconds to create the dictionaries and ~2 seconds to perform:

self.direct = direct
self.inverse = inverse

After an entire day lost with this I started wondering if there is something related with Python memory allocation that I'm missing. And the only meaningful assumption I came to is that self.direct = direct causes Python to actually move/copy the dictionary in ram.

Can anyone explain me what is going on in Version A and in Version B that so radically differ from the straight scripted version?

Questioner

Newbie

Viewed

25

Chinese

Original

Tf.keras.Model class ~ self variables performance

Related issues