Warm tip: This article is reproduced from stackoverflow.com, please click
arrays multithreading python-3.x

How to use multiprocessing to create gzip file from dataframe in python

发布于 2020-03-29 12:47:42

I have a process that's becoming IO bound where I pull a large dataset from a database into a pandas dataframe and then try to do some line by line processing and then persist to a gzip file. I'm trying to find a way to use multiprocessing to be able to split the creation of the gzip into multiple processes and then merge them into one file. Or process in parallel without overwriting a previous thread. I found this package p_tqdm but i'm running into EOF issues probably because the threads overwrite each other. Here's a sample of my current solution:

from p_tqdm import p_map

df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
    things.append(row)    
p_map(process, things)

def process():
    with gzip.open("final.gz", "wb") as f:
        value = do_somthing(row)
        f.write(value.encode())
Questioner
dweeb
Viewed
95
Marek Schwarz 2020-01-31 17:24

I don't know about the p_tqdm but if I understand your question, it might be easily done with multiprocessing.

something like this

import multiprocessing

def process(row):
    # take care that "do_somthing" must return class with encode() method (e.g. string)
    return do_somthing(row)

df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
    things.append(row)


with gzip.open("final.gz", "wb") as f, multiprocessing.Pool() as pool:
    for processed_row in pool.imap(process, things):
        f.write(processed_row.encode())

Just few sidenotes:

  • The pandas iterrows method is slow - avoid if possible (see Does pandas iterrows have performance issues?).

  • Also, you don't need to create things, just pass iterable to imap(even passing df.iterrows() directly should be possible) save yourself some memory.

  • And finally, since it appears that you are reading sql data, why not connect to the db dicectly and iterate over the cursor from SELECT ... query, skipping pandas altogether.