温馨提示:本文翻译自stackoverflow.com，查看原文请点击：python - How to create term frequency matrix for multiple text files?

list matrix python python-3.x

python - 如何为多个文本文件创建词频矩阵？

发布于 2020-04-10 16:59:55

我有以下代码，用于总共四个文本文件，每个文件都包含一些不同的关键字。它们分别称为test1.txt，test2.txt，test3.txt和test4.txt。我想将其转换为列表的矩阵/列表。我有以下代码。

temp = [''] + list(sample_collection)
values = list(sample_collection['test1.txt'])

sample_collection = [temp] + [[x] + [v.get(x, 0) for v in sample_collection.values()] for x in values]

但是，我想对其进行修改，以使其不仅包括来自test1的关键字，还包括来自其他文件的所有其他唯一关键字。我不知道该怎么做。有一段代码可以做到这一点吗？

预期输出：

[['', 'test1.txt', 'test2.txt', 'test3.txt', 'test4.txt'],
['apple', 1, 0, 2, 1],
['banana', 1, 1, 1, 1],
['lemon', 1, 1, 0, 0],
['grape', 0, 0, 0, 1]]

提问者

Lana_Del_Neigh

被浏览

295

查看英文版

查看原文

Green 2020-02-02 01:29

我会使用sklearn框架。

它不是python基本软件包的一部分，因此您需要安装它（pip install sklearn）。

比，导入CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer

读取您的文件并将其存储在列表中。假设您会调用它my_corpus。现在您有一个my_corpus由4个成员命名的列表。

只需使用：

vectorizer =  CountVectorizer()    
matrix = vectorizer.fit_transform(my_corpus)

另外，如果您不希望使用其他水獭包，请执行以下操作：corpus = [“我喜欢狗”，“我喜欢猫”，“猫喜欢牛奶”，“您喜欢我”]
token_corpus = [s.split （）对于语料库中的s]

vocabulary = {}                                                                      
for i, f in enumerate(token_corpus):                                                 
    for t in f:                                                                      
        if t not in vocabulary:                                                      
             vocabulary[t] = [0]*len(corpus)                                         
        vocabulary[t][i]+=1                                                          

vocabulary
{'I': [1, 1, 0, 0], 'like': [1, 1, 1, 0], 'dogs': [1, 0, 0, 0], 'cats': [0, 1, 1, 0], 'milk': [0, 0, 1, 0], 'You': [0, 0, 0, 1], 'likes': [0, 0, 0, 1], 'me': [0, 0, 0, 1]}

如果要将其保存在列表中，请使用：

list(map(list, vocabulary.items()))
[['I', [1, 1, 0, 0]], ['like', [1, 1, 1, 0]], ['dogs', [1, 0, 0, 0]], ['cats', [0, 1, 1, 0]], ['milk', [0, 0, 1, 0]], ['You', [0, 0, 0, 1]], ['likes', [0, 0, 0, 1]], ['me', [0, 0, 0, 1]]]

Lana_Del_Neigh 2020-02-02 01:14:53

谢谢！但是，有没有外部库的方法吗？

Green 2020-02-02 01:15:39

当然，我将编辑我的答案。

相关问题

1

如何使用python cut方法创建bin，接受一个参数并返回适当的bin？

2

从具有特定条件的列表列表创建字典

3

根据行值选择列，Python，Pandas

4

在数据框中绘制零和一的计数

5

python函数。

6

在两个DataFrame之间执行大量Pandas查找的最佳方法

7

如何获取Pandas数据透视表中的列数和每列的宽度？

8

在Pandas数据框中分组时缺少所需值时显示一列

9

Python隐藏壁虱但显示壁虱标签

10

获取Entry和checkbutton值Tkinter时出现问题

热门github

1

🤯 Lobe Chat - an open-source, modern-design AI chat framework. Supports Multi AI Providers( OpenAI / Claude 3 / Gemini / Ollama / DeepSeek / Qwen), Knowledge Base (file upload / knowledge management / RAG ), Multi-Modals (Plugins/Artifacts) and Thinking. One-click FREE deployment of your private ChatGPT/ Claude / DeepSeek application. (翻译：LobeChat 是开源的高性能聊天机器人框架，支持语音合成、多模态、可扩展的（Function Call）插件系统。)

2

Collection of leaked system prompts

3

Jelly Evolution Simulator

4

Master programming by recreating your favorite technologies from scratch. (翻译：在这个项目中，你能学会如何创造自己的各种工具，引擎，游戏，框架，库......)

5

Agent S: an open agentic framework that uses computers like a human

6

An open source payments switch written in Rust to make payments fast, reliable and affordable (翻译：YOLOv8 🚀 in PyTorch > ONNX > CoreML > TFLite)

7

Python - 100天从新手到大师

8

Truly independent web browser

9

Curated list of project-based tutorials (翻译：收藏了基于项目的教程列表)

10

21 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/ (翻译：12 节课程，开始使用生成式 AI 进行构建)

11

ChatGPT DAN, Jailbreaks prompt

12

A quick example of how one can "synchronize" a 3d scene across multiple windows using three.js and localStorage

13

real time face swap and one-click video deepfake with only a single image