I have two dataframes. First one is related to users and looks like this:
user_id city_id
0 a
1 a
2 b
3 a
4 c
.. and so on
Second one gives information how many percent of each city belongs to each district, something like this:
city_id district_id probability
a a1 0.01
a a2 0.02
a a3 0.02
a a4 0.56
a a5 0.39
b b1 0.63
b b2 0.07
b b3 0.30
and so on..
I need to organize users based on this probability to which district of their city they belong. So (for example) that I get approximatelly 56% that users that live in city a are from district a4 and so on. Basically final df would have rows related to the user_id, city_id and district_id
.
My first clue was to give each user a random number and to compare with the probability.
My second idea was to group by rows by city_id, to look up at the second table and select (give value to the third column) by probability. So basically for the city a, that means that I will select 56% rows in the group and give it district value a4 and so on. But I am not sure that mathematically is the best way.
If df1
and df2
are your two dataframes:
import numpy as np
def get_district(city):
dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
return np.random.choice(dlist, p=p) #give weighed random choice from list
And apply this:
df['district_id'] = df.city_id.apply(get_district)
After @JoeCondron's helpful comments, another method:
def get_city_district(city,df1,df2):
l = len(df1[df1.city_id==city])
d = df2[df2['city_id']==city]
ds, p = list(d['district_id']),list(d['probability'])
df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
return df1
def f(df1,df2):
df1['district_id'] = None
for i in set(df1.city_id):
df1 = get_city_district(i,df1,df2)
return df1
Much faster when tested, but only with a few cities.
This will call
get_district
for every single row indf
which will in turn slicedf2
each time unnecessarily. We only need to get the weightings for each unique city once. Also, you are generating the same Boolean key twice.