Warm tip: This article is reproduced from stackoverflow.com, please click
dictionary python tensorflow2.0

Tf.keras.Model class ~ self variables performance

发布于 2020-04-07 10:15:01

EDITED:

It turned out that the base class much higher in the inheritance tree inherited Tf.keras.Model when this class inheritance is present the below described behaviour is observable. The performance of a plain Python class is negligible compared to the plain script.

I haven't found any related documentation of this behaviour, if available updates will follow.

EDIT 2:

I have not found any documentation about this (so a confirmation is needed) but it appears that: If I assign any object to self (in a Tf.keras.Model inheriting class) all the contained tf.Variables gets extracted and appears in the trainable_variables attribute of the class. So my assumption is that keras.Model inspects any assignment to self trying to find some specific objects, and this inspection causes the self assignment of a huge dict to be slow.

For reference: Inspection gets deep in nested lists and dictionaries but do not inspect classes unless they extends ts.keras.Model or tf.keras.Layer

ORIGINAL QUESTION:

I have an list col of strings (~300k rows of ~30 char strings to give you an idea). To be precise is a pandas.DataGrid not a list.

I'm creating a lookup dictionary for future use like so:

direct = {}
inverse = {}
# progressive
progressive = 0

# create direct map
for label in col:
    # skip if present
    if str(label) in direct:
        continue
    # else add to direct
    direct[str(label)] = progressive
    inverse[progressive] = str(label)
    progressive += 1

Nothing strange here, it takes 0.15 seconds and python process memory usage is reasonable.

Then I moved my code in a class, and here things get strange. Here are provided two slightly different versions of the same function.

Version A:

def fromDataset(self, column):
    # reset map
    self.direct = {}
    self.inverse = {}

    # progressive
    progressive = 0

    # create direct map
    for label in column:
        # skip if present
        if str(label) in self.direct:
            continue
        # else add to direct
        self.direct[str(label)] = progressive
        self.inverse[progressive] = str(label)
        progressive += 1

Version B:

def fromDataset(self, column):
    # reset map
    direct = {}
    inverse = {}

    # progressive
    progressive = 0

    # create direct map
    for label in column:
        # skip if present
        if str(label) in direct:
            continue
        # else add to direct
        direct[str(label)] = progressive
        inverse[progressive] = str(label)
        progressive += 1

    self.direct = direct
    self.inverse = inverse

All the proposed functions yeld the same result (a dictionary of ~120k entries with a ~30MB RAM footprint)

I can accept that Version A will be slower than Version B accessing self variables may times but what I can't understand is how is possible that Version B is taking 2.16 seconds (14x than before) weather Version A can not even be tested (After 10+ minutes no result yet and the process memory usage grows by 500+ MB)

What is even more strange Version B is taking 0.17 seconds to create the dictionaries and ~2 seconds to perform:

self.direct = direct
self.inverse = inverse

After an entire day lost with this I started wondering if there is something related with Python memory allocation that I'm missing. And the only meaningful assumption I came to is that self.direct = direct causes Python to actually move/copy the dictionary in ram.

Can anyone explain me what is going on in Version A and in Version B that so radically differ from the straight scripted version?

Questioner
Newbie
Viewed
25
Corentin Limier 2020-01-31 23:48

I created a reproducible example and encountered no issue :

import random
import string

import pandas


def gen_random_word(word_length=30):
    return ''.join((random.choice(string.ascii_letters) for _ in range(word_length)))

# I create a list of 300000 labels (but only 150k distinct labels) of 30 characters
labels = [gen_random_word() for _ in range(150000)]
labels = labels + labels
random.shuffle(labels)

# A dataframe here is useless but I try to get close to your own example
df = pandas.DataFrame(
    {'labels': labels}
)

class A:
    def __init__(self):
        self.direct = {}
        self.inverse = {}
        self.progressive = 0

    def fromDataset(self, column):

        # create direct map
        for label in column:
            # skip if present
            if str(label) in self.direct:
                continue
            # else add to direct
            self.direct[str(label)] = self.progressive
            self.inverse[self.progressive] = str(label)
            self.progressive += 1

class B:
    def __init__(self):
        self.direct = {}
        self.inverse = {}
        self.progressive = 0

    def fromDataset(self, column):
        # reset map
        direct = {}
        inverse = {}

        # progressive
        progressive = 0

        # create direct map
        for label in column:
            # skip if present
            if str(label) in direct:
                continue
            # else add to direct
            direct[str(label)] = progressive
            inverse[progressive] = str(label)
            progressive += 1

        self.direct = direct
        self.inverse = inverse
        self.progressive = progressive

First test :

%%time # remove that if you are not using jupyter and use another timing solution

direct = {}
inverse = {}
# progressive
progressive = 0

# create direct map
for label in df['labels']:
    # skip if present
    if str(label) in direct:
        continue
    # else add to direct
    direct[str(label)] = progressive
    inverse[progressive] = str(label)
    progressive += 1

Wall time : 267 ms

Second test :

%%time
a = A()
a.fromDataset(df['labels'])

Wall time : 249 ms

Third test :

%%time
b = B()
b.fromDataset(df['labels'])

Wall time : 220 ms

So... nothing significant.