温馨提示:本文翻译自stackoverflow.com，查看原文请点击：python - How to get around slow groupby for a sparse matrix?

pandas python sparse-matrix

python - 如何避开稀疏矩阵的慢速groupby？

发布于 2020-03-29 21:58:06

我有一个大矩阵（约2亿行），描述了每天发生的操作的列表（可能有10000个操作）。我的最终目标是创建一个共现矩阵，显示在同一天发生了哪些操作。

这是一个示例数据集：

data = {'date':   ['01', '01', '01', '02','02','03'],
        'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])

我想用pd.get_dummies创建一个稀疏矩阵，但是解开该矩阵并在其上使用groupby极其缓慢，仅花了6分钟即可获得5000行。

# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)

# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()

# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)

我也尝试过：

以非稀疏格式获取假人；
使用groupby进行聚合；
在矩阵乘法之前要稀疏格式。

但是我在步骤1中失败了，因为没有足够的RAM创建这么大的矩阵。

非常感谢您的帮助。

提问者

Dudelstein

被浏览

27

查看英文版

查看原文

Dudelstein 2020-01-31 18:51

我根据这篇文章提出了一个仅使用稀疏矩阵的答案。代码很快，一千万行花了大约10秒的时间（我之前的代码花了5000分钟花了6分钟的时间，并且无法扩展）。

节省时间和内存的原因是使用稀疏矩阵，直到最后一步（需要在导出之前分解（已经很小）的共现矩阵）到最后一步时为止。

## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)

## Add an auxiliary variable
df['count'] = 1

## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
                shape=(date_c.categories.size, action_c.categories.size))

## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)

## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(), 
       index = action_c.categories, columns = action_c.categories)

相关问题

1

如何使用python cut方法创建bin，接受一个参数并返回适当的bin？

2

从具有特定条件的列表列表创建字典

3

根据行值选择列，Python，Pandas

4

在数据框中绘制零和一的计数

5

python函数。

6

在两个DataFrame之间执行大量Pandas查找的最佳方法

7

如何获取Pandas数据透视表中的列数和每列的宽度？

8

在Pandas数据框中分组时缺少所需值时显示一列

9

Python隐藏壁虱但显示壁虱标签

10

获取Entry和checkbutton值Tkinter时出现问题

热门github

1

Suna - Open Source Generalist AI Agent

2

Lightning-fast and Powerful Code Editor written in Rust (翻译：使用Rust编写的快速、强大的代码编辑器)

3

AI-powered multi-agent builder

4

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

5

Find vulnerabilities, misconfigurations, secrets, SBOM in containers, Kubernetes, code repositories, clouds and more (翻译：容器映像、文件系统和 Git 存储库中的漏洞以及配置问题和硬编码机密的扫描程序)

6

Linux, Jenkins, AWS, SRE, Prometheus, Docker, Python, Ansible, Git, Kubernetes, Terraform, OpenStack, SQL, NoSQL, Azure, GCP, DNS, Elastic, Network, Virtualization. DevOps Interview Questions (翻译：包含Linux、Jenkins、AWS、SRE、Prometheus、Docker、Python、Ansible、Git、Kubernetes、Terraform、OpenStack、SQL、NoSQL、Azure、GCP、DNS、弹性、网络、虚拟化等DevOps 面试问题)

7

A collection of inspiring lists, manuals, cheatsheets, blogs, hacks, one-liners, cli/web tools and more. (翻译：这个存储库是我每天在工作中使用的各种材料和工具的集合。)

8

Collection of leaked system prompts

9

A one-of-a-kind resume builder that keeps your privacy in mind. Completely secure, customizable, portable, open-source and free forever. Try it out today! (翻译：Reactive Resume 是一款免费开源的简历生成器，支持定制和移植、安全、开源且永久免费。赶紧试试吧！)

10

21 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/ (翻译：12 节课程，开始使用生成式 AI 进行构建)

11

AI Notepad for back-to-back meetings. Local-first & Extensible.

12

Build Real-Time Knowledge Graphs for AI Agents

13

Your AI second brain. Self-hostable. Get answers from the web or your docs. Build custom agents, schedule automations, do deep research. Turn any online or local LLM into your personal, autonomous AI (gpt, claude, gemini, llama, qwen, mistral). Get started - free.

14

科技爱好者周刊，每周五发布

15

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.