Warm tip: This article is reproduced from serverfault.com, please click

Best way to perform multiple amount of Pandas lookups between two DataFrames

发布于 2020-11-27 23:56:10

I am trying to use two data frames for a simple lookup using Pandas. I have a main master data frame (left) and a lookup data frame (right). I want to left join them on the matching integer code and return the item title from the item_df.

I see a slight solution with a key value pair idea but it seems cumbersome. My idea is to merge the data frames together using col3 and name as key columns and keep the value from the right frame that I want which will be title. Thus I decide to drop the key column that I joined on so all I have left is the value. Now lets say I want to do this several times with my own manual naming conventions. For this I use rename to rename the value that I merged in. Now I would repeat this merge operation and rename my next join to something like second_title (see example below).

Is there a less cumbersome way to perform this repeated operation without constantly dropping the extra columns that are merged in and renaming the new column between each merge step?

Example code below:

import pandas as pd

master_dict: dict = {'col1': [3,4,8,10], 'col2': [5,6,9,10], 'col3': [50,55,59,60]}
master_df: pd.DataFrame = pd.DataFrame(master_dict)
item_dict: dict = {'name': [55,59,50,5,6,7], 'title': ['p1','p2','p3','p4','p5','p6']}
item_df: pd.DataFrame = pd.DataFrame(item_dict)
    
print(master_df.head())
   col1  col2  col3
0     3     5    50
1     4     6    55
2     8     9    59
3    10    10    60
print(item_df.head())
   name title
0    55    p1
1    59    p2
2    50    p3
3     5    p4
4     6    p5

# merge on col3 and name
combined_df = pd.merge(master_df, item_df, how = 'left', left_on = 'col3', right_on = 'name')
# rename title to "first_title"
combined_df.rename(columns = {'title':'first_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True) # remove 'name' column that was joined in from right frame
# repeat operation for "second_title"
combined_df = pd.merge(combined_df, item_df, how = 'left', left_on = 'col2', right_on = 'name')
combined_df.rename(columns = {'title': 'second_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True)
print(combined_df.head())
   col1  col2  col3 first_title second_title
0     3     5    50          p3           p4
1     4     6    55          p1           p5
2     8     9    59          p2          NaN
3    10    10    60         NaN          NaN
Questioner
Coldchain9
Viewed
0
sammywemmy 2020-11-29 08:47:00

We could use your key:value mapping with the map function:

The code below gets a dictionary of values of item_df name column that is in master_df col3 and col2 respectively.

col3 = dict(zip(*(value for _, value in
                  item_df[item_df.name.isin(master_df.col3)].items()))
           )

col2 = dict(zip(*(value for _, value in
                 item_df[item_df.name.isin(master_df.col2)].items()))
           )


col3
{55: 'p1', 59: 'p2', 50: 'p3'}

col2
{5: 'p4', 6: 'p5'}

Next is to use assign and create the columns first_title and second_title:

master_df.assign(
    first_title=master_df.col3.map(col3),
    second_title=master_df.col2.map(col2)
    )



   col1 col2    col3    first_title second_title
0   3   5       50      p3            p4
1   4   6       55      p1            p5
2   8   9       59      p2            NaN
3   10  10      60      NaN           NaN

UPDATE

I thought about your comment for a single dictionary, and it seems achievable by using a Series. This will greatly reduce the bloated code I shared earlier. In this case, we convert item_df to a series and map it to each relevant column:

item_df = item_df.set_index("name").loc[:, "title"]

item_df

name
55    p1
59    p2
50    p3
5     p4
6     p5
7     p6
Name: title, dtype: object

Now create your specific columns using the assign function:

master_df.assign(first_title=master_df.col3.map(item_df), 
                 second_title=master_df.col2.map(item_df)
                 )

   col1 col2    col3    first_title second_title
0   3   5       50      p3            p4
1   4   6       55      p1            p5
2   8   9       59      p2            NaN
3   10  10      60      NaN           NaN

Much simpler and straight to the point.