How to use multiprocessing to create gzip file from dataframe in python

Marek Schwarz 2020-01-31 17:24

I don't know about the p_tqdm but if I understand your question, it might be easily done with multiprocessing.

something like this

import multiprocessing

def process(row):
    # take care that "do_somthing" must return class with encode() method (e.g. string)
    return do_somthing(row)

df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
    things.append(row)


with gzip.open("final.gz", "wb") as f, multiprocessing.Pool() as pool:
    for processed_row in pool.imap(process, things):
        f.write(processed_row.encode())

Just few sidenotes:

The pandas iterrows method is slow - avoid if possible (see Does pandas iterrows have performance issues?).
Also, you don't need to create things, just pass iterable to imap(even passing df.iterrows() directly should be possible) save yourself some memory.
And finally, since it appears that you are reading sql data, why not connect to the db dicectly and iterate over the cursor from SELECT ... query, skipping pandas altogether.

dweeb 2020-02-01 22:45:37

Thank a for you answer and thoughtful comments. I also realized after further research that iterrows() is a nightmare on performance. I changed my data frame to a dask data frame then passed the df.to_array() and then iterated over this to build .gz file. The performance is much better. I wonder if using passing the df.array() into the multiprocessing pool will do better. I am curious about reading the SQL Tables- i don’t want to hamper the DBs performance but I’d like to speed up that part. Right now I basically do a fetch all and just wait blinding for 10s on millions of rows. Any tips?

Marek Schwarz 2020-02-03 03:48:35

As I've said, iterate directly over the cursor object. You don't say what is the db engine you are using, but, if, for example, it is something like sql then something like this is usually possible: cur.execute("SELECT * FROM x";) and then for row in cur: ... should be possible. (The row is usually a tuple of data.) Check out e.g. docs.python.org/3.8/library/sqlite3.html. I don't think that the multiprocessing would give you a lot of perf improvement (it has quite an overhead) but it depends on your usecase.

Related issues

find array value that index matches another array values

Set values in row to zero before index value of row [NumPy or Tensorflow]

How to Remove Array Element and Then Re-Index Array?

How to use where condition in fetch URL in javascript?

Why aren't variable-length arrays part of the C++ standard?

Why are the elements of an array formatted as zeros when they are multiplied by 1/2 or 1/3?

How to replace item in array?

How to format an array of coordinates to a different reference point

How to change button color depending on array element in ReactJS

Finding the three largest elements in an array