Classify dataframe rows based on probabibility

Jim Eisenberg 2020-02-01 00:54

If df1 and df2 are your two dataframes:

import numpy as np
def get_district(city):
    dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
    p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
    return np.random.choice(dlist, p=p) #give weighed random choice from list

And apply this:

df['district_id'] = df.city_id.apply(get_district)

After @JoeCondron's helpful comments, another method:

def get_city_district(city,df1,df2):
    l = len(df1[df1.city_id==city])
    d = df2[df2['city_id']==city]
    ds, p = list(d['district_id']),list(d['probability'])
    df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
    return df1

def f(df1,df2):
    df1['district_id'] = None
    for i in set(df1.city_id):
        df1 = get_city_district(i,df1,df2)

    return df1

Much faster when tested, but only with a few cities.

JoeCondron 2020-02-01 00:21:57

This will call get_district for every single row in df which will in turn slice df2 each time unnecessarily. We only need to get the weightings for each unique city once. Also, you are generating the same Boolean key twice.

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels