Pandas dataframe to Object instances array efficiency for bulk DB insert

villoro 2020-02-04 17:40

If you really need a fast solution I suggest you dumb the table directly using pandas.

First let's create the data for your example:

import pandas as pd

data = {
    'Time': {0: 0.0, 1: 0.1, 2: 0.2},
    'Temperature': {0: 7.8, 1: 7.9, 2: 7.6},
    'Voltage': {0: 14, 1: 12, 2: 15},
    'Current': {0: 56, 1: 58, 2: 55}
}
df = pd.DataFrame(data)

Now you should transform the dataframe so that you have the desired columns with melt:

df = df.melt(["Time"], var_name="parameter", value_name="parameter_value")

At this point you should map the parameter values to the foreign id. I will use params as an example:

params = {"Temperature": 1, "Voltage": 2, "Current": 3}
df["parameter"] = df["parameter"].map(params)

At this point the dataframe will look like:

   Time  parameter  parameter_value
0   0.0          1              7.8
1   0.1          1              7.9
2   0.2          1              7.6
3   0.0          2             14.0
4   0.1          2             12.0
5   0.2          2             15.0
6   0.0          3             56.0
7   0.1          3             58.0
8   0.2          3             55.0

And now to export using pandas you can use:

import sqlalchemy as sa
engine = sa.create_engine("use your connection data")
df.to_sql(name="my_table", con=engine, if_exists="append", index=False)

However when I used that it was not fast enough to meet our requirements. So I suggest you use cursor.copy_from insted since is faster:

from io import StringIO

output = StringIO()
df.to_csv(output, sep=';', header=False, index=False, columns=df.columns)
output.getvalue()
# jump to start of stream
output.seek(0)

# Insert df into postgre
connection = engine.raw_connection()
with connection.cursor() as cursor:
    cursor.copy_from(output, "my_table", sep=';', null="NULL", columns=(df.columns))
    connection.commit()

We tried this for a few millions and it was the fastest way when using PostgreSQL.

Mormoran 2020-02-04 20:05:48

I'm getting an error after trying this method, on the cursor.copy_from(output, "data", sep=';', null="NULL", columns=(df.columns)) line. The traceback reads: Expected bytes or unicode string, got numpy.float64 instead, I guess it's down to not giving the right data values somewhere. ( data is the name of the table I am inserting into). I haven't used this method before, would you have an idea of what's happening here? (At this point I'm way out of my comfort zone, never used sqlalchemy or stringIO before, so I'm pretty much copy/pasting your snippets, while trying to learn)

villoro 2020-02-04 22:27:42

I am not really sure why are you getting this error but it seems that you have some values as a numpy number instead of string. It is possible that you have one of the columns defined as strings in the database? I suggest you try the df.to_sql option before doing the cursor.copy_from alternative. It is posible that this option is fast enough for you.

Mormoran 2020-02-04 22:30:22

I can post an example of the output of df.to_csv() if that helps, will do in main body

Mormoran 2020-02-04 22:34:01

I may also be missing some parameters for the copy operation, for instance, my headers are not the name of the parameter, but an ID. How can I specify that?

villoro 2020-02-05 00:52:36

I am no sure about what you mean. Could you add the definition of the SQL table where the data should go?

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels