温馨提示:本文翻译自stackoverflow.com，查看原文请点击：python - Classify dataframe rows based on probabibility

dataframe pandas python

python - 根据概率对数据框行进行分类

发布于 2020-03-27 16:11:43

我有两个数据框。第一个与用户有关，看起来像这样：

user_id    city_id
  0           a
  1           a
  2           b
  3           a
  4           c
.. and so on

第二个信息提供了每个城市属于每个地区的百分比，如下所示：

 city_id     district_id    probability
    a             a1           0.01
    a             a2           0.02
    a             a3           0.02
    a             a4           0.56
    a             a5           0.39
    b             b1           0.63
    b             b2           0.07
    b             b3           0.30
 and so on..

我需要根据这种可能性来组织用户，他们属于他们所在城市的地区。因此（例如）我大约有56％的人居住在城市a中，他们来自a4区，依此类推。基本上，最终df将具有与的相关行user_id, city_id and district_id。

我的第一个提示是给每个用户一个随机数，并与概率进行比较。

我的第二个想法是按city_id对行进行分组，以查看第二个表并按概率选择（将值赋予第三列）。所以基本上对于城市a，这意味着我将在组中选择56％的行，并将其区域值赋予a4，依此类推。但是我不确定数学上是不是最好的方法。

提问者

Anajlim

被浏览

89

查看英文版

查看原文

Jim Eisenberg 2020-02-01 00:54

如果df1和df2是您的两个数据框：

import numpy as np
def get_district(city):
    dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
    p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
    return np.random.choice(dlist, p=p) #give weighed random choice from list

并应用此：

df['district_id'] = df.city_id.apply(get_district)

在@JoeCondron的有用评论之后，另一种方法是：

def get_city_district(city,df1,df2):
    l = len(df1[df1.city_id==city])
    d = df2[df2['city_id']==city]
    ds, p = list(d['district_id']),list(d['probability'])
    df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
    return df1

def f(df1,df2):
    df1['district_id'] = None
    for i in set(df1.city_id):
        df1 = get_city_district(i,df1,df2)

    return df1

经过测试的速度要快得多，但仅限少数几个城市。

JoeCondron 2020-02-01 00:21:57

这将要求get_district每一行，df而每一行又将df2不必要地进行切片。我们只需要获取每个唯一城市的权重一次。此外，您将两次生成相同的布尔键。

相关问题

1

如何使用python cut方法创建bin，接受一个参数并返回适当的bin？

2

从具有特定条件的列表列表创建字典

3

根据行值选择列，Python，Pandas

4

在数据框中绘制零和一的计数

5

python函数。

6

在两个DataFrame之间执行大量Pandas查找的最佳方法

7

如何获取Pandas数据透视表中的列数和每列的宽度？

8

在Pandas数据框中分组时缺少所需值时显示一列

9

Python隐藏壁虱但显示壁虱标签

10

获取Entry和checkbutton值Tkinter时出现问题

热门github

1

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step (翻译：从头开始一步一步实现类似ChatGPT的LLM)

2

Python tool for converting files and office documents to Markdown.

3

PowerShell for every system! (翻译：适用于各系统的PowerShell)

4

FULL v0, Cursor, Manus, Augment Code, Same.dev, Lovable, Devin, Replit Agent, Windsurf Agent, VSCode Agent, Dia Browser, Xcode, Trae AI, Cluely & Orchids.app (And other Open Sourced) System Prompts, Tools & AI Models.

5

An AI Hedge Fund Team

6

G-code generator for 3D printers (Bambu, Prusa, Voron, VzBot, RatRig, Creality, etc.)

7

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

8

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI

9

AI coding agent, built for the terminal.

10

all of the workflows of n8n i could find (also from the site itself)

11

A cryptocurrency trading API with more than 100 exchanges in JavaScript / TypeScript / Python / C# / PHP / Go (翻译：一个 JavaScript / Python / PHP 加密货币交易 API，支持 100 多个比特币/山寨币交易所)

12

Invoicing, Time tracking, File reconciliation, Storage, Financial Overview & your own Assistant made for Freelancers

13

🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

14

Run LLMs with MLX

15

Clone a voice in 5 seconds to generate arbitrary speech in real-time (翻译：5秒克隆语音，实时生成任意语音)