Warm tip: This article is reproduced from stackoverflow.com, please click
pandas postgresql python django-database

Pandas dataframe to Object instances array efficiency for bulk DB insert

发布于 2020-03-31 23:01:03

I have a Pandas dataframe in the form of:

Time    Temperature    Voltage    Current
0.0     7.8            14         56
0.1     7.9            12         58
0.2     7.6            15         55
... So on for a few hundred thousand rows...

I need to bulk insert the data into a PostgreSQL database, as fast as possible. This is for a Django project, and I'm currently using the ORM for DB operations and building queries, but open to suggestions if there are more efficient ways to accomplish the task.

My data model looks like this:

class Data(models.Model):
    time = models.DateTimeField(db_index=True)
    parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
    parameter_value = models.FloatField()

So Time is row[0] of the DataFrame, and then for each header column, I grab the value that corresponds to it, using the header as parameter. So row[0] of the example table would generate 3 Data objects in my database:

Data(time=0.0, parameter="Temperature", parameter_value=7.8)
Data(time=0.0, parameter="Voltage", parameter_value=14)
Data(time=0.0, parameter="Current", parameter_value=56)

Our application allows the user to parse data files that are measured in milliseconds. So we generate a LOT of individual data objects from a single file. My current task is to improve the parser to make it much more efficient, until we hit I/O constraints on a hardware level.

My current solution is to go through each row, create one Data object for each row on time + parameter + value and append said object to an array so I can Data.objects.bulk_create(all_data_objects) through Django. Of course I am aware that this is inefficient and could probably be improved a lot.

Using this code:

# Convert DataFrame to dict
df_records = df.to_dict('records')

# Start empty dta array
all_data_objects = []

# Go through each row creating objects and appending to data array
for row in df_records:
    for parameter, parameter_value in row.items():
        if parameter != "Time":
            all_data_objects.append(Data(
                    time=row["Time"],
                    parameter_value=parameter_value,
                    parameter=parameter))

# Commit data to Postgres DB
Data.objects.bulk_create(all_data)

Currently the entire operation, without the DB insert operation included (writing to disk), that is, just generating the Data objects array, for a 55mb file that generates about 6 million individual Data objects takes around 370 seconds. Just the df_records = df.to_dict('records') line takes 83ish seconds. Times were measured using time.time() at both ends of each section and calculating the difference.

How can I improve these times?

Questioner
Mormoran
Viewed
19
villoro 2020-02-04 17:40

If you really need a fast solution I suggest you dumb the table directly using pandas.

First let's create the data for your example:

import pandas as pd

data = {
    'Time': {0: 0.0, 1: 0.1, 2: 0.2},
    'Temperature': {0: 7.8, 1: 7.9, 2: 7.6},
    'Voltage': {0: 14, 1: 12, 2: 15},
    'Current': {0: 56, 1: 58, 2: 55}
}
df = pd.DataFrame(data)

Now you should transform the dataframe so that you have the desired columns with melt:

df = df.melt(["Time"], var_name="parameter", value_name="parameter_value")

At this point you should map the parameter values to the foreign id. I will use params as an example:

params = {"Temperature": 1, "Voltage": 2, "Current": 3}
df["parameter"] = df["parameter"].map(params)

At this point the dataframe will look like:

   Time  parameter  parameter_value
0   0.0          1              7.8
1   0.1          1              7.9
2   0.2          1              7.6
3   0.0          2             14.0
4   0.1          2             12.0
5   0.2          2             15.0
6   0.0          3             56.0
7   0.1          3             58.0
8   0.2          3             55.0

And now to export using pandas you can use:

import sqlalchemy as sa
engine = sa.create_engine("use your connection data")
df.to_sql(name="my_table", con=engine, if_exists="append", index=False)

However when I used that it was not fast enough to meet our requirements. So I suggest you use cursor.copy_from insted since is faster:

from io import StringIO

output = StringIO()
df.to_csv(output, sep=';', header=False, index=False, columns=df.columns)
output.getvalue()
# jump to start of stream
output.seek(0)

# Insert df into postgre
connection = engine.raw_connection()
with connection.cursor() as cursor:
    cursor.copy_from(output, "my_table", sep=';', null="NULL", columns=(df.columns))
    connection.commit()

We tried this for a few millions and it was the fastest way when using PostgreSQL.