温馨提示:本文翻译自stackoverflow.com，查看原文请点击：python - Write unique rows to CSV with pandas in chunks

pandas python

python - 用大块 pandas 将唯一的行写入CSV

发布于 2020-03-27 11:18:26

我有一亿行csv，我必须像这样用大 pandas 来阅读：

df_chunks = pandas.read_csv(
    'my-file.csv.gz',
    sep='\t',
    chunksize=100000,
    compression='gzip')

for df in df_chunks:
    # here I filter some rows and columns and after that
    # I write to a new csv
    filtered_df.to_csv(
        'my_filtered.csv.gz',
        sep=',',
        columns=['id', 'date'],
        compression='gzip',
        mode='a')

我想写的数据看起来像这样，只有2列

id,date
42517544,2019-06-30
42517544,2019-06-30
42517544,2019-07-01
...

现在我可以使用类似的东西，df.drop_duplicates()但是由于我正在分块编写，所以最终可能会出现重复。注意文件很大，大约10G，所以我需要分块读写。

我想找到一种方法来处理大 pandas ，也许还可以在内存中设置一个不消耗太多内存的集合，因为这也是一个约束。

有什么好的方法呢？

提问者

PepperoniPizza

被浏览

19

查看英文版

查看原文

piRSquared 2019-07-03 22:35

设定

一百万行

np.random.seed([3, 1415])
n = 1_000_000
dfout = pd.DataFrame({
    'id': np.random.randint(1000, size=n),
    'date': np.random.choice(pd.date_range('2019-01-01', periods=1000), size=n)
})

dfout.to_csv('my-file.csv.gz', compression='gzip', sep='\t', index=False)

解

像你一样大块

df_chunks = pd.read_csv(
    'my-file.csv.gz',
    sep='\t',
    chunksize=100000,
    compression='gzip')

每个唯一写入单个文件 date

for i, df in enumerate(df_chunks):
    for date, d in df.groupby('date'):
        date = pd.Timestamp(date)
        d.drop_duplicates().to_csv(
            f'{date:%Y%m%d}.csv.gz',
            compression='gzip',
            mode='a',
            index=False,
            header=False
        )
    print(f'\r{i}', end='')

读入每个日期文件drop_duplicates，然后写回

from pathlib import Path

path = Path('.')

for i, fh in enumerate(path.glob('[0-9]' * 8 + '.csv.gz')):
    df = pd.read_csv(fh, header=None)
    df.drop_duplicates().to_csv(
        'my_filtered.csv.gz',
        compression='gzip',
        mode='a',
        index=False,
        header=False
    )
    print(f'\r{i}: {fh}', end='')

df = pd.read_csv(
    'my_filtered.csv.gz',
    compression='gzip',
    header=None,
    names=['id', 'date']
)

验证方式

assert len(df) == len(dfout) - dfout.duplicated().sum()
assert df.duplicated().sum() == 0

PepperoniPizza 2019-07-04 10:56:02

非常详细的方法，这确实是一个最佳解决方案，请尝试一下，谢谢！

相关问题

1

如何使用python cut方法创建bin，接受一个参数并返回适当的bin？

2

从具有特定条件的列表列表创建字典

3

根据行值选择列，Python，Pandas

4

在数据框中绘制零和一的计数

5

python函数。

6

在两个DataFrame之间执行大量Pandas查找的最佳方法

7

如何获取Pandas数据透视表中的列数和每列的宽度？

8

在Pandas数据框中分组时缺少所需值时显示一列

9

Python隐藏壁虱但显示壁虱标签

10

获取Entry和checkbutton值Tkinter时出现问题

热门github

1

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step (翻译：从头开始一步一步实现类似ChatGPT的LLM)

2

Python tool for converting files and office documents to Markdown.

3

PowerShell for every system! (翻译：适用于各系统的PowerShell)

4

FULL v0, Cursor, Manus, Augment Code, Same.dev, Lovable, Devin, Replit Agent, Windsurf Agent, VSCode Agent, Dia Browser, Xcode, Trae AI, Cluely & Orchids.app (And other Open Sourced) System Prompts, Tools & AI Models.

5

An AI Hedge Fund Team

6

G-code generator for 3D printers (Bambu, Prusa, Voron, VzBot, RatRig, Creality, etc.)

7

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

8

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI

9

AI coding agent, built for the terminal.

10

all of the workflows of n8n i could find (also from the site itself)

11

A cryptocurrency trading API with more than 100 exchanges in JavaScript / TypeScript / Python / C# / PHP / Go (翻译：一个 JavaScript / Python / PHP 加密货币交易 API，支持 100 多个比特币/山寨币交易所)

12

Invoicing, Time tracking, File reconciliation, Storage, Financial Overview & your own Assistant made for Freelancers

13

🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

14

Run LLMs with MLX

15

Clone a voice in 5 seconds to generate arbitrary speech in real-time (翻译：5秒克隆语音，实时生成任意语音)