Producing a Term-Document Matrix Processed with TF-IDF¶

(Term Frequency - Inverse Document Frequency... How many times the words appear in each document/how many times it appears in all documents)

Import Libraries¶

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Sample Pieces of Text¶

In [2]:
document_1 = "But there was one other thing that the grown-ups also knew, and it was this: that however small the chance might be of striking lucky, the chance is there. The chance had to be there."
document_2 = "I will not pretend I wasn't petrified. I was. But mixed in with the awful fear was a glorious feeling of excitement. Most of the really exciting things we do in our lives scare us to death. They wouldn't be exciting if they didn't."
document_3 = "A person who has good thoughts cannot ever be ugly. You can have a wonky nose and a crooked mouth and a double chin and stick-out teeth, but if you have good thoughts they will shine out of your face like sunbeams and you will always look lovely."
document_4 = "Never do anything by halves if you want to get away with it. Be outrageous. Go the whole hog. Make sure everything you do is so completely crazy it's unbelievable"
document_5 = "And above all, watch with glittering eyes the whole world around you because the greatest secrets are always hidden in the most unlikely places. Those who don't believe in magic will never find it."
In [3]:
documents = [document_1, document_2, document_3, document_4, document_5]

Process List Using TF-IDF¶

In [4]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

Produces a Term-Document Matrix¶

In [5]:
terms = tfidf_vectorizer.get_feature_names_out()

tfidf_df = pd.DataFrame(tfidf_matrix.T.toarray(), index=terms, columns=["Document 1", "Document 2", "Document 3","Document 4","Document 5"])

tfidf_df
Out[5]:
Document 1 Document 2 Document 3 Document 4 Document 5
above 0.000000 0.000000 0.000000 0.000000 0.182704
all 0.000000 0.000000 0.000000 0.000000 0.182704
also 0.144571 0.000000 0.000000 0.000000 0.000000
always 0.000000 0.000000 0.110176 0.000000 0.147404
and 0.096821 0.000000 0.365825 0.000000 0.122359
... ... ... ... ... ...
wonky 0.000000 0.000000 0.136561 0.000000 0.000000
world 0.000000 0.000000 0.000000 0.000000 0.182704
wouldn 0.000000 0.160672 0.000000 0.000000 0.000000
you 0.000000 0.000000 0.274369 0.262054 0.122359
your 0.000000 0.000000 0.136561 0.000000 0.000000

119 rows × 5 columns

Print the Words From Document 1 in order of Importance¶

In [6]:
filtered_tfidf_df = tfidf_df[tfidf_df['Document 1'] != 0]
sorted_tfidf_df = filtered_tfidf_df.sort_values(by='Document 1', ascending=False)
word_list = sorted_tfidf_df.index.tolist()
word_list
Out[6]:
['chance',
 'there',
 'the',
 'that',
 'was',
 'be',
 'one',
 'ups',
 'this',
 'thing',
 'striking',
 'small',
 'other',
 'also',
 'might',
 'lucky',
 'knew',
 'however',
 'had',
 'grown',
 'is',
 'and',
 'it',
 'to',
 'but',
 'of']