How to create term frequency matrix for multiple text files?

Green 2020-02-02 01:29

I would use sklearn framework.

It isn't a part of python base packages, so you will need to install it (pip install sklearn).

than, import the CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

read you files and store them in a list. let's say you will call it my_corpus. now you have a list named my_corpus with 4 members.

just use:

vectorizer =  CountVectorizer()    
matrix = vectorizer.fit_transform(my_corpus)

Alternativly, if you wouldn't like to use a oter packages, just do: corpus = ["I like dogs", "I like cats", "cats like milk", "You likes me"]
token_corpus = [s.split() for s in corpus]

vocabulary = {}                                                                      
for i, f in enumerate(token_corpus):                                                 
    for t in f:                                                                      
        if t not in vocabulary:                                                      
             vocabulary[t] = [0]*len(corpus)                                         
        vocabulary[t][i]+=1                                                          

vocabulary
{'I': [1, 1, 0, 0], 'like': [1, 1, 1, 0], 'dogs': [1, 0, 0, 0], 'cats': [0, 1, 1, 0], 'milk': [0, 0, 1, 0], 'You': [0, 0, 0, 1], 'likes': [0, 0, 0, 1], 'me': [0, 0, 0, 1]}

if you want to save it in a list just use:

list(map(list, vocabulary.items()))
[['I', [1, 1, 0, 0]], ['like', [1, 1, 1, 0]], ['dogs', [1, 0, 0, 0]], ['cats', [0, 1, 1, 0]], ['milk', [0, 0, 1, 0]], ['You', [0, 0, 0, 1]], ['likes', [0, 0, 0, 1]], ['me', [0, 0, 0, 1]]]