Warm tip: This article is reproduced from stackoverflow.com, please click
dataframe pandas python

Classify dataframe rows based on probabibility

发布于 2020-03-27 15:45:41

I have two dataframes. First one is related to users and looks like this:

user_id    city_id
  0           a
  1           a
  2           b
  3           a
  4           c
.. and so on

Second one gives information how many percent of each city belongs to each district, something like this:

 city_id     district_id    probability
    a             a1           0.01
    a             a2           0.02
    a             a3           0.02
    a             a4           0.56
    a             a5           0.39
    b             b1           0.63
    b             b2           0.07
    b             b3           0.30
 and so on.. 

I need to organize users based on this probability to which district of their city they belong. So (for example) that I get approximatelly 56% that users that live in city a are from district a4 and so on. Basically final df would have rows related to the user_id, city_id and district_id.

My first clue was to give each user a random number and to compare with the probability.

My second idea was to group by rows by city_id, to look up at the second table and select (give value to the third column) by probability. So basically for the city a, that means that I will select 56% rows in the group and give it district value a4 and so on. But I am not sure that mathematically is the best way.

Questioner
Anajlim
Viewed
36
Jim Eisenberg 2020-02-01 00:54

If df1 and df2 are your two dataframes:

import numpy as np
def get_district(city):
    dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
    p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
    return np.random.choice(dlist, p=p) #give weighed random choice from list

And apply this:

df['district_id'] = df.city_id.apply(get_district)

After @JoeCondron's helpful comments, another method:

def get_city_district(city,df1,df2):
    l = len(df1[df1.city_id==city])
    d = df2[df2['city_id']==city]
    ds, p = list(d['district_id']),list(d['probability'])
    df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
    return df1

def f(df1,df2):
    df1['district_id'] = None
    for i in set(df1.city_id):
        df1 = get_city_district(i,df1,df2)

    return df1

Much faster when tested, but only with a few cities.