Warm tip: This article is reproduced from serverfault.com, please click

Python: Memory efficient, quick lookup in python for 100 million pairs of data?

发布于 2020-11-28 01:06:15

This is my first time asking a question on here, so apologies if I am doing something wrong.

I am looking to create some sort of dataframe/dict/list where I can check if the ID in one column has seen a specific value in another column before.

For example for one pandas dataframe like this (90 million rows):

ID  Another_ID
1   10
1   20
2   50
3   10
3   20
4   30

And another like this(10 million rows):

ID  Another_ID
1   30
2   30
2   50
2   20
4   30
5   70         

I want to end up with a third column that is like this:

ID  Another_ID seen_before
1   30         0
2   30         0
2   50         1
2   20         0
4   30         1
5   20         0

I am looking for a memory efficient but quick way to do this, any ideas? Thanks!

Questioner
BrendanA
Viewed
0
Quang Hoang 2020-11-28 11:42:34

Merge is a good idea, here, you want to merge on both columns:

df1['seen_before'] = 1

df2.merge(df1, on=['ID', 'Another_ID'], how='left')

Output:

   ID  Another_ID  seen_before
0   1          30          NaN
1   2          30          NaN
2   2          50          1.0
3   2          20          NaN
4   4          30          1.0
5   5          70          NaN

Note: this assumes that df1 has no duplicates. If you are not sure about this, replace df1 with df1.drop_duplicates() in merge.